Streaming, Batching & Concurrency
Scaling LLM workloads without melting your wallet.
Once you ship an LLM feature to real users, throughput becomes a problem. How do you handle 1000 concurrent users? 100,000? Here's the playbook.
Concurrency limits
Every provider has rate limits — typically:
- Requests per minute (RPM) — how often you can call the API
- Tokens per minute (TPM) — total tokens (input + output) per minute
- Concurrent requests — how many can be in flight at once
Hitting these limits returns 429 (Too Many Requests). Your code must handle this gracefully with exponential backoff:
for (let attempt = 1; attempt <= 5; attempt++) {
try {
return await callLLM();
} catch (e) {
if (e.status !== 429) throw e;
await sleep(2 ** attempt * 1000 + Math.random() * 500);
}
}
The queue pattern
For high-volume background work, don't call the LLM directly from your request handler. Use a queue:
- Request arrives → enqueue task → respond immediately with task ID
- Worker pool processes the queue, calling the LLM with rate-limited concurrency
- Client polls / subscribes for the result
This decouples user latency from LLM provider latency, and lets you control concurrency precisely.
Multi-provider failover
Single provider = single point of failure. Production systems route across multiple providers:
primary: Groq Llama 3.3 70B (free, fast)
fallback 1: Together AI Llama 3.3 70B (paid, same model)
fallback 2: OpenAI GPT-4.1 mini (different model)
When the primary 429s, 500s, or times out, fall back in order. OpenRouter does this automatically; you can also build it yourself with provider-agnostic clients.
Streaming for batch jobs
Even non-interactive workloads benefit from streaming when generation is long. A 30-second batch generation can be tracked: if the first chunk takes >5s, cancel and retry. If you see hallucination patterns mid-stream (repeating phrases, gibberish), cancel and retry. This saves money on doomed generations.
Parallel requests
When you need to call N models on the same input (e.g., for comparison or ensemble), do it in parallel, not sequentially:
const responses = await Promise.all([
callModel("model-a", prompt),
callModel("model-b", prompt),
callModel("model-c", prompt),
]);
Total time = max of the three, not sum. The LLMAtlas Compare Lab uses exactly this pattern.
Per-user rate limiting
Don't let one user drain your provider quota. Implement per-user limits in your own service:
- 10 requests per minute per user
- 100K tokens per day per user
- Configurable per pricing tier
Token-bucket algorithms in Redis or Postgres handle this in <10 lines.
The capacity planning table
Rough numbers to sanity-check:
- Free tier providers (Groq, Gemini, Cerebras): 30-60 RPM. Demo only.
- Paid providers: 5,000-10,000 RPM, 1-10M TPM on default tier. Real product traffic.
- Enterprise tier: negotiated, 100K+ RPM.
- Self-hosted: bounded by your GPU count. 1× H100 ≈ 30-50 simultaneous Llama 70B users.