LLM cost cheat sheet

Practical ways to spend less on LLM APIs, grouped by where the money goes. Most bills shrink fast once you stop sending tokens you do not need and reuse the ones you do.

Start here: the biggest levers

In rough order of how much they usually move the bill:

Model choice. The gap between a budget and a frontier model is often 10x or more. Use the smallest one that passes your own test.
Output length. Output tokens cost several times more than input. Cap them and ask for less.
Repeated context. Cache it, or stop re-sending it. This is where long chats and agents quietly bleed money.
Number of calls. One good request beats a string of follow-ups, and anything that can wait should run in a batch.

1. Spend less per call

Right-size the model. Use the cheapest model that passes your own test, not the flagship by default. A budget model often handles classification, extraction, and routing fine. Recommender →
Cap the output. Output tokens cost several times more than input. Set a sensible max and ask for concise answers.
Trim the prompt. Drop boilerplate, unused examples, and repeated context. Every token in the prompt is billed on every call.
Use structured output. A schema or tool definition is cheaper and more reliable than a long "format it exactly like this" instruction.

2. Reuse context with caching

Turn on prompt caching. Repeated context (system prompt, instructions, examples, retrieved docs) bills at roughly a tenth of the input price on a cache hit.
Put stable content first. Caching matches a prefix, so keep the fixed parts at the top and the variable parts (the user's question) at the end.
Do not bust the cache. A timestamp, random ID, or reordered JSON near the top changes the prefix and throws the cache away.
See caching + batch savings →

3. Make fewer round-trips

Plan one request, not ten. Every follow-up re-sends the whole conversation as input. Asking once, well, can cost a fraction of a back-and-forth.
Batch async work. If results can wait, the Batch API is 50% off.
Bound your agents. Each agent step re-sends the growing context. Limit steps and cache the system and tools prompt.
Agent cost → · Plan vs. piecemeal →

4. Tame RAG and long context

Retrieve less. A smaller top-k and tighter chunks mean fewer tokens fed to the model on every query.
Cache the retrieved context when the same documents come back often.
Summarize long histories instead of re-sending the full transcript each turn.
RAG cost →

5. Measure before you guess

Count tokens first. Token counts are model-specific; estimate before you send rather than after the bill.
Watch the usage wall. Know how close you are to a plan's limit before it cuts you off.
Re-check prices. Model prices drop fast and new, cheaper models ship often. Compare before you commit. Matrix → · Head-to-head →

6. Subscription or API?

Light or programmatic use: the API is usually cheaper and more flexible.
Heavy chat use: a flat subscription keeps the bill predictable.
Mixed teams: often both, split by who builds vs who chats.
Subscription vs API →

Workflows that make saving automatic

Set these up once and the savings happen without you thinking about them.

Cheap-first routing

Send every request to a budget model first. Escalate to a bigger model only when a quick check (answer length, a confidence signal, or a validator) says the cheap answer is not good enough. Most traffic never needs the expensive model. Pick the pair →

Draft, then refine

Let a cheap model produce the draft and pass only the final candidate to a stronger model to polish. You pay frontier prices once, not on every intermediate step.

Cache-friendly prompt layout

Freeze a stable prefix (role, rules, examples, context) and change only the tail. Wire caching once and every later call rides the discount. Savings →

Nightly batch queue

Anything that does not need an instant answer goes into a queue and runs as one batch job overnight at half price: evals, summaries, tagging, backfills.

Summarize and continue

When a chat or agent run gets long, replace the old turns with a short running summary instead of re-sending the whole history every step. Agent cost →

Test, then downgrade

Keep a small set of real examples with expected answers. Each month, run a cheaper or newer model against it; if it passes, switch and pocket the difference. Compare →

Pre-flight token check

Count tokens before you send. Trim or reject oversized prompts at the door instead of paying for them and finding out on the bill.

Guardrails and alerts

Put a max on output tokens for every call, a monthly spend cap per feature, and an alert at 80% so a runaway loop or a viral day never surprises you.