LLM APIs charge by tokens, not requests, and the cost of a serious AI feature can swing by 100× depending on the model, prompt size, and how often you call it. A side-project chatbot can run on $5/month or $5,000/month with the same user volume. Here's how to actually predict and control LLM costs before the bill arrives.
How LLM Pricing Works
Every major LLM provider — Anthropic, OpenAI, Google, Mistral, AWS Bedrock, Azure — bills you per million tokens. They charge different rates for:
- Input tokens — what you send (the prompt, system message, conversation history, retrieved documents).
- Output tokens — what the model generates back.
Output tokens are typically 3–5× more expensive than input tokens. For Claude Opus 4.7, for example, you'll pay around 5× more per output token than per input token. This matters: if your prompt is 10,000 tokens and the response is 200 tokens, the input dominates the cost. If your prompt is 200 tokens and the response is 5,000 tokens, the output does.
What Is a Token?
A token is roughly a word or part of a word. The exact tokenization depends on the model's tokenizer, but common rules of thumb:
- English: 1 token ≈ 0.75 words, or about 4 characters
- Code: 1 token ≈ 3 characters (more dense)
- Other languages: often 2–3× more tokens than English for the same content
- JSON / XML: more tokens than plain text — punctuation and structural characters count
Quick estimate: 1,000 words ≈ 1,300–1,500 tokens. The exact count matters when you're near a context window limit, but for cost estimation, "words × 1.4" is good enough.
A Simple Cost Formula
cost = (input_tokens / 1,000,000) × input_price
+ (output_tokens / 1,000,000) × output_price
Per request. Multiply by request volume to get monthly cost.
Example: A chat app sends 800 input tokens and gets 200 output tokens per turn. Each user has ~30 turns per day. The model charges $3 per million input tokens and $15 per million output tokens.
Per turn: (800/1M × $3) + (200/1M × $15) = $0.0024 + $0.003 = $0.0054
Per user per day: $0.0054 × 30 = $0.162
Per user per month (30 days): $4.86
1,000 daily active users: ~$4,860/month
That single calculation tells you whether your unit economics work. Most teams skip it and find out from the bill.
How to Estimate Costs Online
Use DevZone's LLM Cost Calculator to compare costs across providers without doing the math by hand:
- Pick a model (Claude, GPT, Gemini, Mistral, Llama, etc.).
- Enter expected input tokens and output tokens per request.
- Enter request volume (per day, per month).
- The calculator shows total cost and lets you compare two models side-by-side.
Useful when deciding whether to use the larger, more expensive model or the smaller, cheaper one for a given task.
Where Token Costs Actually Hide
Most teams underestimate cost because they only count the obvious tokens. Watch for these:
1. System prompts. A 2,000-token system prompt sent on every request adds up fast. At 1M requests/month, you've spent $6,000 just on the system prompt at $3/M input tokens.
2. Conversation history. A 20-turn conversation may carry forward 10,000+ tokens of history with each new turn. The same conversation has dramatically increasing cost per turn.
3. RAG retrieved chunks. Pulling 10 chunks of 500 tokens each = 5,000 input tokens per query, on top of the prompt and conversation.
4. Tool / function definitions. Tool schemas count as input tokens. A long list of tool definitions can add 1,000+ tokens per request.
5. Retries. If you retry on failures, every retry doubles the cost of that interaction.
6. Streaming = same cost. Streaming changes the user experience, not the price. You're billed for the full output regardless of whether it's streamed or not.
Strategies to Cut LLM Costs
Use prompt caching. Anthropic, OpenAI, and Google all support caching the static prefix of a prompt (system message, tool definitions, RAG context). Cached input tokens are typically 10% of normal input price. For an app with a long, mostly-static system prompt, this can cut input costs by 90%.
Choose the right model size. GPT-4o-mini, Claude Haiku 4.5, and Gemini Flash are 10–30× cheaper than the flagship models, often with little quality loss for simple tasks. Reserve flagship models for tasks that genuinely need them.
Cap the output. Set max_tokens to a reasonable ceiling. Without a cap, a model that wanders can produce 4,000 tokens when 200 would do.
Compress prompts. Remove filler instructions, redundant examples, and verbose formatting. Every removed token saves cost on every request forever.
Batch when possible. If you're processing offline (summarizing 10,000 emails, generating product descriptions), use the batch APIs — Anthropic and OpenAI both offer ~50% discount for non-real-time workloads.
Use a smaller model for routing. A common pattern: a cheap model decides which expensive model (or which tool) to invoke. The expensive model only runs on the queries that need it.
Cache responses. If users ask similar questions, cache responses by prompt hash. The cheapest LLM call is the one you don't make.
Free Tier vs Paid
Most LLM providers offer free tiers, but they're rate-limited and not suitable for production:
- Anthropic: $5 free credit on signup, expires after some time.
- OpenAI: limited free credits.
- Google AI Studio: free tier with daily/per-minute caps.
Free tiers are good for prototypes. The instant you have real users, expect a real bill.
Comparing Models on Cost-per-Quality
Cost alone isn't the right metric — the right metric is "cost per acceptable answer." A model that's 2× cheaper but answers correctly only 50% of the time isn't actually cheaper, because you're either retrying or shipping bad results.
Run an evaluation on a representative sample of your traffic with each candidate model. Score the outputs (manually, or with a higher-tier judge model). Then compare:
effective_cost = total_cost / (requests × acceptable_answer_rate)
A flagship model at 95% acceptance and a small model at 85% acceptance might cost the same per acceptable answer, even if the flagship is 5× the headline price.
Watching Costs in Production
Set hard limits early:
- Provider-side spending caps. Most providers let you set a monthly hard limit. Set one at 1.5× expected spend.
- Per-user throttling. Cap requests per user per day, especially during free trials.
- Budget alerts. 50%, 80%, 100% of the cap.
- Per-feature dashboards. Tag requests by feature so you can see which feature is eating the budget.
A surprise $50,000 LLM bill is a story you only need to live once.
FAQ
Why does the same prompt produce different token counts on different models?
Each model uses its own tokenizer. GPT models use cl100k_base or o200k_base; Claude uses a different tokenizer; Gemini uses SentencePiece. The same English text might be 100 tokens in one tokenizer and 110 in another.
Are training tokens counted?
No. You're billed for inference (running the model). Training data and fine-tuning tokens are separately priced if you use the provider's fine-tuning service.
Is the API more expensive than ChatGPT/Claude.ai subscriptions?
For light personal use, the consumer subscriptions are cheaper. For application development with even modest scale, the API is cheaper because you only pay for what you use, not a flat monthly fee.
Do thinking / reasoning tokens cost extra?
For reasoning models (o1, o3, Claude with extended thinking, Gemini's thinking), you're billed for the model's reasoning tokens too — even though they're not visible in the response. This can dramatically increase the effective output token count. Check the provider docs for the exact billing rules.
Should I worry about input tokens or output tokens more?
It depends on your workload. Long-prompt, short-response (RAG, classification): input dominates. Short-prompt, long-response (creative generation, code generation): output dominates. Run the math on a sample of real traffic — the answer surprises most teams.