How to calculate the real cost of an AI agent run beyond token count

Most teams measure AI agent cost the same way they measure LLM cost: tokens in, tokens out, multiply by the per-token rate, and call it a day. That number is real but it is incomplete. In many production agent deployments, token cost is less than half of the true cost-per-run. The rest is invisible.

This matters for two reasons. First, invisible costs make it impossible to price your product or service correctly. Second, invisible costs make it impossible to optimize — you cannot reduce what you cannot see.

The true cost components of an agent run

1. Token cost (the visible part)

This is what everyone tracks. Input tokens multiplied by the input rate plus output tokens multiplied by the output rate. For GPT-4o at current pricing, that is roughly $2.50 per million input tokens and $10 per million output tokens. For Claude 3.5 Sonnet, it is $3 and $15. The exact numbers change, but this is the part of your cost that your LLM provider makes transparent.

What most teams miss: context accumulates across agent steps. In a ten-step agent run, each LLM call includes not just the current prompt but the full conversation history up to that point. If each step adds 500 tokens to the context, your tenth LLM call has 4,500 more input tokens than your first. Token cost is not linear across agent steps — it is superlinear.

2. Tool call cost

Agents call tools. Tools cost money. A web search API call might cost $0.003. A database query has infrastructure cost even if it is not billed per-query. A call to an external API may have per-call pricing. An embedding call for a retrieval step costs tokens at the embedding model's rate.

In a retrieval-augmented agent that does five retrieval steps per run, tool call cost can exceed the LLM token cost. Most teams do not track this at all because it is spread across different billing accounts and different services.

3. Retry cost

Agents retry. They retry when tool calls fail. They retry when the LLM produces output that fails validation. They retry when a downstream service times out. Some agents have explicit retry loops in their orchestration logic. Others retry implicitly when they fail to achieve their goal and loop back to try again.

A retry doubles the cost of that step. A step that retries three times costs four times as much as a step that succeeds on the first try. If your agent has a 15% retry rate and you are not tracking it, you are seeing costs 15%+ higher than your baseline calculation and you do not know why.

4. Infrastructure cost

Every agent run uses compute. The orchestration layer runs somewhere. Long-running agents hold a connection open and use memory for the duration. Vector databases and retrieval indexes have storage and query costs. If you are running agents on dedicated infrastructure, the fixed cost of that infrastructure needs to be amortized across runs.

For short, simple agents on serverless infrastructure, this cost is small. For long-running, complex agents on dedicated compute, it can be significant — and it does not show up on your LLM provider's invoice.

5. Failure cost

Failed runs still cost money. An agent that runs for eight steps and then hits a timeout or produces an invalid output has consumed token budget, tool call budget, and infrastructure budget for those eight steps. That cost does not disappear because the run did not succeed.

If your agent has a 5% failure rate and each failure consumes 60% of the cost of a successful run, you are paying 3% more per successful run than your successful-run cost implies. At scale, that adds up.

A practical cost model

Here is a cost model you can apply to any agent run:

cost_per_run =

token_cost(input_tokens_per_step × steps × context_growth_factor)

+ token_cost(output_tokens_per_step × steps)

+ tool_call_cost × avg_tool_calls_per_step × steps

+ token_cost(embedding_calls) // if retrieval-augmented

+ infrastructure_cost_per_second × avg_duration_seconds

+ retry_overhead_factor × base_cost // typically 1.1–1.25x

+ (failure_rate × partial_run_cost) // amortized failure cost

The context growth factor is the most commonly missed. If your agent has a 10-step run and each step adds 300 tokens to context, your average input token count per step is not your first-step input count — it is first-step input plus (4.5 × 300) extra tokens on average across all steps.

What to track in production

Once you have built the cost model, you need to track it in production, per run, per agent. The metrics that matter:

Cost per run (p50, p95, p99) — median cost plus the long tail. A p99 that is 10x the p50 is a red flag.
Cost over time — are your runs getting cheaper or more expensive as your prompts evolve?
Cost by input type — which inputs trigger expensive runs? Are there input patterns you can detect and handle differently?
Retry rate — what fraction of steps retry? What is triggering retries?
Step count distribution — how many steps does a typical run take? Is that number stable or drifting?

Cost as a reliability signal

Here is something counterintuitive: a sudden spike in cost-per-run is often a reliability signal before it is a cost signal. If your agent's average run cost doubles overnight, that usually means something changed in the agent's behavior — it is taking more steps, retrying more often, or consuming more context. That behavioral change is worth investigating independent of the cost impact.

Cost observability is not just finance. It is ops. Treat your per-run cost like you treat your p99 latency — a metric that you monitor, alert on, and investigate when it moves unexpectedly.

Track cost-per-run across your entire fleet

Agent Opz captures the full cost breakdown — tokens, tool calls, retries, infrastructure — for every agent run.

Get early access