Multimodal: Vision, Audio & Documents
Text is no longer the only input. Models now see, hear, and read PDFs.
Modern frontier models accept multiple input types. The "M" in multimodal stands for the modalities a model can process: text, images, audio, video, documents.
Vision-language models
Pass an image alongside text. Use cases:
- OCR on steroids — extract text from receipts, forms, handwriting, low-quality scans
- Document understanding — feed PDFs page-by-page as images
- Visual debugging — paste an error screenshot, ask for fixes
- Image classification + analysis — describe content, count, identify issues
Vision-capable models in 2026: GPT-4.1, GPT-4o, Claude 4 family, Gemini 2.5 family, Llama 4 Scout/Maverick, Pixtral, Qwen2.5-VL, InternVL3.
Document AI: the killer app
Frontier vision model + PDF library is the most reliable document-processing pipeline ever built:
- Convert PDF pages to images
- Send each page to the model with a structured-output schema
- Get back structured JSON of the page's content
Beats traditional OCR + parsing for anything with tables, layouts, handwriting, or scanned originals. Gemini 2.5 with 1M context handles entire 1000-page documents in one call.
Audio in / audio out
Speech-to-text is built into many models. GPT-4o and Gemini 2.5 ingest audio directly — no separate Whisper step. Combined with TTS APIs you get real-time voice agents:
audio in → LLM → audio out
End-to-end latencies are under 300ms on the best providers. This is the foundation of the AI phone agent category.
Video
Still rare in production but growing. Gemini 2.5 Pro accepts video up to ~2 hours via timestamp-indexed frames. Use cases: video summarisation, scene search, surveillance review.
Practical considerations
- Costs are higher. Images cost 150–1000 tokens. Audio priced by duration.
- Resolution matters. Models downsample. For OCR work, source images must be crisp.
- Modality bias. Some models trust the text over the image when they disagree.
The trend
The line between "vision model" and "text model" is dissolving. Every frontier model will accept images by default within 1–2 years. Building text-only leaves 30% of capability on the table.