LLMAtlas — The Open Ecosystem Workspace for LLMs

The default LLM API call is slow and expensive. Production systems are built around making it fast and cheap.

The three levers

Cache — don't compute what you've already computed
Stream — start showing output before it's done
Right-size — use the smallest model that works

Prompt caching (the biggest win)

Modern APIs (Anthropic, OpenAI, Gemini, DeepSeek) let you mark a prefix of your prompt as cacheable. The cache lives for ~5 minutes. On subsequent calls with the same prefix:

Anthropic: 90% off input tokens (10% normal cost)
OpenAI: 50% off
DeepSeek: 90% off

Cache the stable parts: long system prompts, retrieved RAG context, conversation history. The dynamic part (user's latest turn) stays outside the cache. For a chat app with a long system prompt, this is a 5-10× cost reduction overnight.

[CACHED PREFIX]
- System prompt (5K tokens)
- Few-shot examples (8K tokens)
- Conversation history (15K tokens)
[NOT CACHED]
- User's new message

Semantic caching

Different request, same answer? Use semantic caching: embed each query, check if a similar previous query is already in the cache, return its answer if similarity > threshold. Implementations: GPTCache, Vercel's AI SDK cache, or roll your own.

Works well for: FAQs, common queries, retrieval where the user keeps rephrasing. Doesn't work for: personalised responses, real-time data.

Streaming

Don't wait for the full response. APIs return token-by-token via SSE:

const stream = openai.chat.completions.create({ ..., stream: true });
for await (const chunk of stream) {
  print(chunk.choices[0].delta.content);
}

User-perceived latency drops 5-20× because they see output starting after ~200ms instead of waiting for the full 5-second generation. Always stream in user-facing apps.

Time-to-first-token (TTFT) vs throughput

Two different latency metrics:

TTFT — time until the first token appears. Driven by prompt size + provider's first-byte latency.
Throughput — tokens per second once generation starts. Driven by model architecture + hardware.

Groq and Cerebras win on throughput (700+ tok/s). Anthropic and OpenAI win on consistent TTFT. For chat, TTFT matters most. For background generation, throughput matters most.

Batch processing

For non-real-time workloads, use batch APIs: send 1000s of requests, get results within 24h, pay 50% of normal price. OpenAI Batch, Anthropic Batch, Gemini Batch all offer this. Perfect for: nightly evals, bulk classification, dataset generation.

Right-sizing

The single biggest cost mistake is using a model that's larger than necessary. A Llama 3.1 8B can handle 70% of customer support queries that teams default to GPT-4 for. Run an experiment: take your last 1000 production queries, run them through 8B and through GPT-4, compare outputs. You'll be shocked how often the cheap one wins.

Cost, Latency & Caching Strategy