Level 4 · Production Engineering
9 min

Cost, Latency & Caching Strategy

Three levers that determine whether your LLM feature is shippable.

The default LLM API call is slow and expensive. Production systems are built around making it fast and cheap.

The three levers

  1. Cache — don't compute what you've already computed
  2. Stream — start showing output before it's done
  3. Right-size — use the smallest model that works

Prompt caching (the biggest win)

Modern APIs (Anthropic, OpenAI, Gemini, DeepSeek) let you mark a prefix of your prompt as cacheable. The cache lives for ~5 minutes. On subsequent calls with the same prefix:

  • Anthropic: 90% off input tokens (10% normal cost)
  • OpenAI: 50% off
  • DeepSeek: 90% off

Cache the stable parts: long system prompts, retrieved RAG context, conversation history. The dynamic part (user's latest turn) stays outside the cache. For a chat app with a long system prompt, this is a 5-10× cost reduction overnight.

[CACHED PREFIX]
- System prompt (5K tokens)
- Few-shot examples (8K tokens)
- Conversation history (15K tokens)
[NOT CACHED]
- User's new message

Semantic caching

Different request, same answer? Use semantic caching: embed each query, check if a similar previous query is already in the cache, return its answer if similarity > threshold. Implementations: GPTCache, Vercel's AI SDK cache, or roll your own.

Works well for: FAQs, common queries, retrieval where the user keeps rephrasing. Doesn't work for: personalised responses, real-time data.

Streaming

Don't wait for the full response. APIs return token-by-token via SSE:

const stream = openai.chat.completions.create({ ..., stream: true });
for await (const chunk of stream) {
  print(chunk.choices[0].delta.content);
}

User-perceived latency drops 5-20× because they see output starting after ~200ms instead of waiting for the full 5-second generation. Always stream in user-facing apps.

Time-to-first-token (TTFT) vs throughput

Two different latency metrics:

  • TTFT — time until the first token appears. Driven by prompt size + provider's first-byte latency.
  • Throughput — tokens per second once generation starts. Driven by model architecture + hardware.

Groq and Cerebras win on throughput (700+ tok/s). Anthropic and OpenAI win on consistent TTFT. For chat, TTFT matters most. For background generation, throughput matters most.

Batch processing

For non-real-time workloads, use batch APIs: send 1000s of requests, get results within 24h, pay 50% of normal price. OpenAI Batch, Anthropic Batch, Gemini Batch all offer this. Perfect for: nightly evals, bulk classification, dataset generation.

Right-sizing

The single biggest cost mistake is using a model that's larger than necessary. A Llama 3.1 8B can handle 70% of customer support queries that teams default to GPT-4 for. Run an experiment: take your last 1000 production queries, run them through 8B and through GPT-4, compare outputs. You'll be shocked how often the cheap one wins.

Knowledge Check

Score 70% or higher to mark this chapter complete.

Q1.How much can prompt caching reduce input token cost on Anthropic?

Q2.Why should user-facing apps always stream?

Q3.Difference between TTFT and throughput?

Q4.When is batch API processing appropriate?

0 / 4 answered

LLMAtlas — The Open Ecosystem Workspace for LLMs