What Is Token-Based Pricing? How LLM APIs Actually Charge You

Every major LLM API — OpenAI, Anthropic, Google Gemini, AWS Bedrock, Azure OpenAI — bills per million tokens, with asymmetric rates for input and output. A token is roughly 4 characters of English, so 1,000 tokens is about 750 words. Costs scale non-linearly because output tokens are more expensive and because long-context requests re-charge the entire context on every turn unless prompt caching is enabled.

Published April 18, 2026 · Updated April 18, 2026

~0 chars
One English token
0–5×
Output-to-input price ratio (typical)
0%
Prompt-cache discount (Anthropic, OpenAI)
0%
Batch-API discount (OpenAI, Anthropic)

What is Token-Based Pricing?

Token-based pricing is a usage-metering model where a large language model charges separately for input tokens (the prompt and context you send) and output tokens (the text the model generates). Rates differ by model and by direction, output tokens are typically 3–5× more expensive than input tokens, and the total cost of a single request is (input_tokens × input_rate) + (output_tokens × output_rate) per million tokens.

How Token-Based Pricing Works

  1. 1

    Tokenization: Your prompt is split into tokens by the provider's tokenizer (tiktoken for OpenAI, a similar BPE scheme for Anthropic and Gemini). A token is roughly 4 English characters or ¾ of a word.

  2. 2

    Input billing: Every token in the prompt — system message, conversation history, retrieved context — counts as an input token and is billed at the model's input rate per million.

  3. 3

    Output billing: Every token the model generates counts as an output token, billed at a higher rate (commonly 3–5× the input rate).

  4. 4

    Request accounting: The provider returns token counts in the response (e.g., OpenAI's usage object, Anthropic's usage field). You never guess — you sum the reported numbers.

  5. 5

    Aggregation: Bills are computed per day or per billing cycle as sum(input_tokens × input_rate / 1e6) + sum(output_tokens × output_rate / 1e6), then rolled up per model.

  6. 6

    Modifiers: Prompt caching (Anthropic, OpenAI), batch APIs, and committed-use discounts apply after the raw token count — they change the effective rate, not the underlying unit.

Types of Token-Based Pricing

Flat per-million pricing

A single rate per million input tokens and another per million output tokens. Standard for OpenAI GPT-4o, GPT-4o-mini, and Claude Haiku/Sonnet/Opus baseline pricing.

Tiered pricing by context window

Some providers charge a higher rate when the prompt exceeds a threshold (e.g., Gemini 1.5 Pro doubles above 128k input tokens). Makes long-context RAG noticeably more expensive.

Cached-input pricing

Anthropic and OpenAI discount tokens served from their prompt cache (typically ~90% off input rate) when the same prefix is reused within a TTL. Huge on agent loops.

Batch pricing

Async batch APIs (OpenAI Batch, Anthropic Message Batches) cut both input and output rates roughly in half in exchange for up-to-24-hour latency. Best for overnight evals and offline summarization.

Image and tool-call tokens

Multimodal inputs (images, audio) are converted to tokens at model-specific rates. Tool-call metadata also counts as output tokens. Easy to miss in cost forecasts.

Common Use Cases

Forecasting monthly spend

Multiply average tokens-per-request by requests-per-day by 30 by the relevant input/output split. Always split by model — blended averages lie.

Choosing between models

Translate latency and quality tradeoffs into dollar terms: GPT-4o at $2.50/$10 per M vs GPT-4o-mini at $0.15/$0.60 per M is a 16× input and 17× output gap. Use the smaller model where it meets the quality bar.

Detecting prompt bloat

A token count that creeps up week-over-week is usually accumulated context (few-shot examples, tool schemas) that a developer added without noticing the cost impact.

Justifying prompt caching

If >60% of your input tokens repeat across requests (system prompt + tool schemas + few-shots), prompt caching pays for itself within days.

Frequently Asked Questions

How many tokens is a word?+

Roughly 1.3 tokens per English word. A 500-word email is about 650 tokens. Code, JSON, and non-English languages tokenize less efficiently — expect 1.5–2× the token count for the same visible content.

Are input and output tokens priced the same?+

Almost never. Output tokens typically cost 3–5× more than input tokens on the same model. That's why response streaming and max_tokens caps have a real cost lever, not just a UX one.

Does the system prompt count toward my token bill?+

Yes. Every token you send counts as an input token — system message, conversation history, retrieved documents, tool schemas, few-shot examples. Prompt caching can discount the repeated prefix, but the raw count is still billed at the cached rate.

Why does my actual bill exceed my forecasted token cost?+

Four common reasons: (1) output tokens came in longer than you modeled, (2) retries on error doubled some requests, (3) tool-call round-trips multiplied effective input tokens, (4) the agent looped on a tool call more than you expected. A cost dashboard with per-request granularity usually finds all four.

How do I compare model prices fairly?+

Price per million tokens is a starting point, but you need tokens-per-request on your actual workload. A cheaper model that needs 2× the output tokens to hit the same quality is not cheaper. Run a sample of real requests against both and compare total cost, not sticker price.

Start tracking your AI costs

Unified view across 50+ AI providers — with zero impact to your inference path.

Start tracking your AI spend

Free tier available. Read-only ingestion. No changes to production.