Level 4 · Production Engineering
8 min

Streaming, Batching & Concurrency

Scaling LLM workloads without melting your wallet.

Once you ship an LLM feature to real users, throughput becomes a problem. How do you handle 1000 concurrent users? 100,000? Here's the playbook.

Concurrency limits

Every provider has rate limits — typically:

  • Requests per minute (RPM) — how often you can call the API
  • Tokens per minute (TPM) — total tokens (input + output) per minute
  • Concurrent requests — how many can be in flight at once

Hitting these limits returns 429 (Too Many Requests). Your code must handle this gracefully with exponential backoff:

for (let attempt = 1; attempt <= 5; attempt++) {
  try {
    return await callLLM();
  } catch (e) {
    if (e.status !== 429) throw e;
    await sleep(2 ** attempt * 1000 + Math.random() * 500);
  }
}

The queue pattern

For high-volume background work, don't call the LLM directly from your request handler. Use a queue:

  1. Request arrives → enqueue task → respond immediately with task ID
  2. Worker pool processes the queue, calling the LLM with rate-limited concurrency
  3. Client polls / subscribes for the result

This decouples user latency from LLM provider latency, and lets you control concurrency precisely.

Multi-provider failover

Single provider = single point of failure. Production systems route across multiple providers:

primary: Groq Llama 3.3 70B (free, fast)
fallback 1: Together AI Llama 3.3 70B (paid, same model)
fallback 2: OpenAI GPT-4.1 mini (different model)

When the primary 429s, 500s, or times out, fall back in order. OpenRouter does this automatically; you can also build it yourself with provider-agnostic clients.

Streaming for batch jobs

Even non-interactive workloads benefit from streaming when generation is long. A 30-second batch generation can be tracked: if the first chunk takes >5s, cancel and retry. If you see hallucination patterns mid-stream (repeating phrases, gibberish), cancel and retry. This saves money on doomed generations.

Parallel requests

When you need to call N models on the same input (e.g., for comparison or ensemble), do it in parallel, not sequentially:

const responses = await Promise.all([
  callModel("model-a", prompt),
  callModel("model-b", prompt),
  callModel("model-c", prompt),
]);

Total time = max of the three, not sum. The LLMAtlas Compare Lab uses exactly this pattern.

Per-user rate limiting

Don't let one user drain your provider quota. Implement per-user limits in your own service:

  • 10 requests per minute per user
  • 100K tokens per day per user
  • Configurable per pricing tier

Token-bucket algorithms in Redis or Postgres handle this in <10 lines.

The capacity planning table

Rough numbers to sanity-check:

  • Free tier providers (Groq, Gemini, Cerebras): 30-60 RPM. Demo only.
  • Paid providers: 5,000-10,000 RPM, 1-10M TPM on default tier. Real product traffic.
  • Enterprise tier: negotiated, 100K+ RPM.
  • Self-hosted: bounded by your GPU count. 1× H100 ≈ 30-50 simultaneous Llama 70B users.

Knowledge Check

Score 70% or higher to mark this chapter complete.

Q1.What's the right way to handle a 429 (Too Many Requests)?

Q2.Why route across multiple providers?

Q3.When calling 3 models on the same input for comparison, what's the right pattern?

0 / 3 answered

LLMAtlas — The Open Ecosystem Workspace for LLMs