Agent Observability: Logs, Traces, and Cost per Task

By Lazar Milicevic · Published June 16, 2026 · 11 min read

Developer reviewing observability dashboards with traces and cost metrics for an AI agent on multiple monitors

You shipped an agent last month. It mostly works. Then on Tuesday a client emails: "It answered nothing for 40 minutes and you charged me $14." You open your logs. You have print("calling tool") and a stack trace from three days ago. You have no idea which model call cost what, how many retries happened, or whether the planner went into a loop calling search_web seventeen times before giving up.

This is the moment most teams realize: an agent you cannot observe is an agent you cannot trust — and definitely cannot bill. Observability is not a phase-two concern. It is the substrate that lets you debug, price, and improve the thing.

Why agents break the old observability model

A traditional web service has one request, one response, and a stack you can profile. An agent is a non-deterministic graph: a planner picks tools, tools call models, models call other tools, retries happen, and a "task" can take 30 seconds or 8 minutes with 50+ LLM calls inside.

The old triad — metrics, logs, traces — still applies, but the shape changes:

Concern	Web service	Agent
Unit of work	HTTP request	Task (multi-step)
Latency	p95 endpoint	p95 task, p95 per step
Cost	Server time	Token spend per task
Failure	5xx rate	Hallucination, loops, tool errors, partial success
Debug	Stack trace	Causal trace across LLM + tools

If your dashboard only shows "API up, 200 OK," you are blind to the things that actually matter: did the agent finish the job correctly, how much did it cost, and where did it waste time.

The three layers you must instrument

Think of agent observability as three layers, each answering a different question.

1. Logs — What happened at this step? Structured records of every decision, prompt, tool call, and result. One row per event.

2. Traces — How did the steps connect? A parent-child tree that ties a user task to every LLM call and tool invocation underneath it, with timing.

3. Metrics — How is the system behaving over time? Aggregates: cost per task, success rate, p95 latency, tool failure rate, tokens per task.

Skip any layer and you lose something. Logs without traces means you cannot follow a task across tools. Traces without metrics means you cannot spot regressions. Metrics without logs means you cannot debug the outliers.

What to log per step (the minimum useful schema)

Every step an agent takes should emit a structured event. JSON, one line per event, append-only. If you cannot replay a task from your logs, you are not logging enough.

Here is a schema that has held up well in production:

{
  "task_id": "tsk_01HQ...",
  "trace_id": "trc_01HQ...",
  "span_id": "spn_01HQ...",
  "parent_span_id": "spn_01HQ...",
  "step_type": "llm_call",
  "timestamp": "2024-11-12T14:22:31.412Z",
  "duration_ms": 1843,
  "agent": "lead_qualifier",
  "model": "claude-sonnet-4",
  "input_tokens": 1240,
  "output_tokens": 312,
  "cost_usd": 0.0089,
  "tool_name": null,
  "tool_args_hash": null,
  "status": "ok",
  "retry_count": 0,
  "user_id": "usr_abc",
  "tenant_id": "ten_xyz",
  "prompt_version": "qualifier_v7",
  "error": null
}

A few things to notice:

task_id is the billing unit. Everything rolls up to it.
trace_id and span_id follow the OpenTelemetry model so you can use existing tools.
Hash tool args instead of logging them raw. PII safety, smaller storage, still groupable.
prompt_version is non-negotiable. When a regression hits, you need to know which prompt was live.
Cost is computed at write time, not later. Token counts and prices drift; the cost at the moment of the call is the source of truth.

For tool calls, swap model/input_tokens for tool_name and tool_latency_ms. Same envelope, different payload.

Tracing a task across tools

A trace is a tree. The root span is the user-facing task ("qualify this inbound lead"). Children are LLM calls and tool invocations. Grandchildren are sub-tool calls. You want to be able to open one trace and see the whole story.

OpenTelemetry is the right default. Most LLM observability tools (Langfuse, Arize Phoenix, Helicone, LangSmith) either speak OTel or layer on top of it. Pick one, but instrument with OTel underneath so you are not locked in.

A minimal Python wrapper looks like this:

from opentelemetry import trace
from contextlib import contextmanager
import time, json

tracer = trace.get_tracer("agent")

@contextmanager
def span(name, **attrs):
    with tracer.start_as_current_span(name) as s:
        for k, v in attrs.items():
            s.set_attribute(k, v)
        start = time.time()
        try:
            yield s
        finally:
            s.set_attribute("duration_ms", int((time.time() - start) * 1000))

def run_task(task_id, user_input):
    with span("task.qualify_lead", task_id=task_id):
        plan = make_plan(user_input)
        for step in plan:
            with span(f"step.{step.type}", tool=step.tool):
                result = execute(step)
        return result

The discipline: every LLM call and every tool call gets its own span. Retries are child spans of the original. Parallel tool calls are sibling spans. If your agent framework hides this from you, wrap it.

When something goes wrong at 2am, you do not want to read 4,000 lines of logs. You want to open the trace for task_id=tsk_01HQ..., see that step 3 retried four times because the CRM API returned 502, and ship a fix before the standup.

Cost per task: the metric that keeps you honest

Cost per task is the single most useful number for an agent system. It is the metric that:

Tells you if a customer is profitable
Catches prompt regressions (cost spike = new prompt is verbose or looping)
Surfaces silent failures (cost crashes to near-zero = agent is short-circuiting)
Lets you price the product without guessing

Compute it at the task level by summing every child span's cost_usd. Then track three derivatives:

-- cost per task, last 7 days, by agent and tenant
SELECT
  agent,
  tenant_id,
  COUNT(DISTINCT task_id) AS tasks,
  SUM(cost_usd) AS total_cost,
  AVG(task_cost) AS avg_cost_per_task,
  PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY task_cost) AS p95_cost
FROM (
  SELECT
    task_id, agent, tenant_id,
    SUM(cost_usd) AS task_cost
  FROM agent_spans
  WHERE timestamp > NOW() - INTERVAL '7 days'
  GROUP BY task_id, agent, tenant_id
) t
GROUP BY agent, tenant_id
ORDER BY total_cost DESC;

Watch the p95, not just the average. A handful of pathological tasks (loops, runaway tool calls) will eat your margin even if the median is healthy. If p95 is 8× the median, you have a tail problem worth investigating.

A useful complement is cost per successful task. If your agent succeeds 70% of the time but you still pay for the 30% that fail, your real unit cost is avg_cost / success_rate. That is the number to put on a pricing model.

Catching the failure modes observability reveals

Once you have logs, traces, and cost in one place, certain failure patterns become obvious. Here are the ones that show up most often.

Tool-call loops. The agent calls search_web, gets a result, decides it is not good enough, calls again with a slightly different query, ten times in a row. In the trace: ten sibling spans, same tool, escalating cost. Fix: hard cap on tool calls per task, plus a planner check that fails loudly when the cap is hit.

Silent truncation. The model hits context limit, returns a partial answer, and the agent treats it as success. In logs: output_tokens suspiciously close to the max for that model. Fix: alert on output_tokens / max_output_tokens > 0.95.

Tool flakiness masked by retries. A tool fails 30% of the time but the agent retries until it works. Cost triples, latency triples, you never notice because the success rate looks fine. In metrics: retry_count distribution by tool_name. Fix: separate "tool succeeded first try" from "tool eventually succeeded" in your dashboards.

Prompt regression. You updated the system prompt on Tuesday. Cost per task jumps 40%. Without prompt_version in your logs you would chase this for a week. With it, the diff is a SQL query.

Cheap-model fallback never triggers. You designed a router that uses a small model for easy tasks and a large model for hard ones. In the logs: 98% of tasks went to the large model. The router is broken. You only notice because you are watching cost per task by model.

A pragmatic stack for a small team

You do not need a 10-person platform team to do this well. A working setup for a solopreneur or small ops team:

Layer	Pragmatic choice	Why
Instrumentation	OpenTelemetry SDK	Vendor-neutral, future-proof
LLM-aware tracing	Langfuse, Phoenix, Helicone, or LangSmith	Built for LLM spans, token/cost views included
Log storage	Postgres or ClickHouse	Cheap, queryable, your data stays yours
Dashboards	Grafana or Metabase	SQL-driven, no lock-in
Alerts	A handful of SQL queries on a cron	Slack webhook when p95 cost doubles

The mistake is reaching for a full observability platform on day one. Start with structured JSON logs to a file, ship them to Postgres nightly, write three dashboards. You will learn what you actually need before you pay for a tool.

If you want a single rule of thumb: if you cannot answer "what did task X cost and why" in under 60 seconds, your observability is broken.

How BizFlowAI approaches this

Every agent we build for a client is instrumented from day one — OpenTelemetry spans on every LLM call and tool invocation, per-task cost computed at write time, prompt versions tagged on every event, and a Grafana dashboard the client owns. We treat the observability layer as part of the deliverable, not an add-on. If we cannot show you cost per task and p95 latency on the day we ship, we have not finished shipping.

On a discovery call we will pull up a live dashboard from a running client agent and walk through a real task trace — the planner decision, the tool calls, the retries, the dollar amount at the bottom. It is the fastest way to see what "automation you can trust" actually looks like in production.

Where to start this week

If you have an agent in production with weak observability, do these in order:

Pick a task ID schema and propagate it. Every log line, every span, every database write tied to an agent action gets a task_id. Without this, nothing else works.
Switch to structured JSON logs. One event per line. Schema above is a good start.
Compute cost at write time. Pull the token counts from the model response, multiply by the price you paid at that moment, store the dollar figure.
Add OpenTelemetry spans around LLM calls and tool calls. Even before you pick a vendor, the spans will be portable.
Build one dashboard: cost per task by agent, last 7 days, with p50/p95. This is the dashboard you will look at every morning.
Set one alert: p95 cost per task doubles week-over-week. It will fire on real regressions.

Everything else — fancy eval frameworks, replay tooling, fine-grained PII redaction — can wait. The six steps above are the difference between an agent you guess about and an agent you can defend in a pricing conversation.

Observability is not the glamorous part of building agents. It is the part that decides whether your automation survives contact with real customers and real invoices.

Frequently asked questions

What is agent observability and why does it matter?

Agent observability is the practice of instrumenting AI agents with logs, traces, and metrics so you can see every LLM call, tool invocation, retry, and cost inside a multi-step task. Unlike a traditional web service that handles one request and one response, an agent is a non-deterministic graph where a single task can trigger 50+ LLM calls over several minutes. Without observability you cannot debug failures, price the product, or catch loops and regressions. It is foundational, not a phase-two concern.

How do you calculate cost per task for an LLM agent?

Assign every event a task_id, compute cost_usd at write time for each LLM call using the input and output token counts and current prices, then sum all child spans grouped by task_id. Track the average, p95, and cost per successful task (avg_cost divided by success_rate). The p95 reveals pathological loops or runaway tool calls that destroy margins even when the median looks healthy.

What should you log for each step an AI agent takes?

Emit one structured JSON event per step with task_id, trace_id, span_id, parent_span_id, step_type, timestamp, duration_ms, model or tool_name, input and output tokens, cost_usd, status, retry_count, tenant_id, and prompt_version. Hash tool arguments instead of logging raw values to protect PII and save storage. The rule of thumb: if you cannot replay a task from your logs, you are not logging enough.

Should I use OpenTelemetry or a tool like Langfuse for LLM tracing?

Use both. Instrument your agent with the OpenTelemetry SDK so you are not locked into a vendor, then send the spans to an LLM-aware backend like Langfuse, Arize Phoenix, Helicone, or LangSmith, which add token, cost, and prompt views on top. This combination gives you portability plus the LLM-specific dashboards you need to debug agents.

What failure modes does agent observability help you catch?

Common patterns include tool-call loops (same tool called repeatedly with cost escalating), silent truncation (output_tokens near the model max), tool flakiness hidden by automatic retries, prompt regressions after a deploy (visible via prompt_version in logs), and broken model routers that send everything to the expensive model. Each becomes obvious once logs, traces, and cost per task live in one place.

Work with BizFlowAI

If you'd rather have this built for you, that's what we do: production AI automation for solo founders and small teams — agents, integrations, and document pipelines that actually ship.

Book a free discovery call — 30 minutes, we map the highest-ROI automation in your workflow. No pitch deck, just engineering.