Level 3 · RAG & Agents
8 min

Multimodal: Vision, Audio & Documents

Text is no longer the only input. Models now see, hear, and read PDFs.

Modern frontier models accept multiple input types. The "M" in multimodal stands for the modalities a model can process: text, images, audio, video, documents.

Vision-language models

Pass an image alongside text. Use cases:

  • OCR on steroids — extract text from receipts, forms, handwriting, low-quality scans
  • Document understanding — feed PDFs page-by-page as images
  • Visual debugging — paste an error screenshot, ask for fixes
  • Image classification + analysis — describe content, count, identify issues

Vision-capable models in 2026: GPT-4.1, GPT-4o, Claude 4 family, Gemini 2.5 family, Llama 4 Scout/Maverick, Pixtral, Qwen2.5-VL, InternVL3.

Document AI: the killer app

Frontier vision model + PDF library is the most reliable document-processing pipeline ever built:

  1. Convert PDF pages to images
  2. Send each page to the model with a structured-output schema
  3. Get back structured JSON of the page's content

Beats traditional OCR + parsing for anything with tables, layouts, handwriting, or scanned originals. Gemini 2.5 with 1M context handles entire 1000-page documents in one call.

Audio in / audio out

Speech-to-text is built into many models. GPT-4o and Gemini 2.5 ingest audio directly — no separate Whisper step. Combined with TTS APIs you get real-time voice agents:

audio in → LLM → audio out

End-to-end latencies are under 300ms on the best providers. This is the foundation of the AI phone agent category.

Video

Still rare in production but growing. Gemini 2.5 Pro accepts video up to ~2 hours via timestamp-indexed frames. Use cases: video summarisation, scene search, surveillance review.

Practical considerations

  • Costs are higher. Images cost 150–1000 tokens. Audio priced by duration.
  • Resolution matters. Models downsample. For OCR work, source images must be crisp.
  • Modality bias. Some models trust the text over the image when they disagree.

The trend

The line between "vision model" and "text model" is dissolving. Every frontier model will accept images by default within 1–2 years. Building text-only leaves 30% of capability on the table.

Knowledge Check

Score 70% or higher to mark this chapter complete.

Q1.Most reliable approach for parsing complex PDFs with tables and layouts?

Q2.Roughly how many tokens does an image cost?

Q3.Why are voice agents now real-time?

0 / 3 answered

LLMAtlas — The Open Ecosystem Workspace for LLMs