Multimodal RAG retrieves answers grounded in figures, tables, and page layout - not just paragraphs. This guide compares caption-and-index, unified vision embeddings (Cohere Embed 4, voyage-multimodal-3), and page-as-image retrieval (ColPali, ColQwen2) with reference architectures on OpenSearch.
Text-only RAG silently discards a large fraction of the signal in real enterprise documents. The charts, tables, diagrams, and layout cues that often carry the actual answer never make it into the index. Multimodal RAG closes that gap by treating images, page renderings, and structured visuals as first-class retrieval objects alongside text.
This guide is for engineers building RAG over document-heavy corpora - financial filings with revenue charts, technical manuals with schematics, medical reports with embedded scans, product catalogs with photos. It covers the three architectures that actually work in 2026, the embedding and parsing models that matter, and a reference design on OpenSearch with honest cost and recall trade-offs.
Why Text-Only RAG Fails on Real Enterprise Documents
A 10-K filing answers the question "what drove the margin decline in Q3?" with a waterfall chart on page 34, not with any sentence in the document. A wiring diagram answers "which valve connects to the pressure relief line?" - no paragraph in the manual contains that information. When OCR linearizes a 2D layout into 1D text, table columns merge into nonsense, footnotes detach from their referents, and figure captions get retrieved without the figure they describe.
The symptoms in production are predictable. The LLM fabricates numbers when the retrieved chunk only paraphrased a table. Tables with merged cells produce garbled markdown that the model invents structure for. Diagram-dependent questions return "I don't have enough information" even though the page was indexed - because the page's information lived in pixels.
In figure-heavy enterprise corpora, a meaningful share of answers depend on visual content the text pipeline drops on the floor. The exact percentage varies by domain: low for press releases, high for engineering documentation and financial filings. Either way, the pattern is the same - text-only retrieval has an upper bound that no amount of chunking tuning can lift.
What Multimodal Means in a RAG Pipeline
Multimodality enters at four possible points: ingestion, embedding, retrieval, and generation. Decisions at each layer compound, and the architecture you pick is mostly a decision about which layers do which job.
There are three modalities worth distinguishing in practice:
- Pure text - paragraphs, headings, lists. Standard chunking applies.
- Raster images - photos, scanned pages, diagrams. No native text representation.
- Structured visuals - tables, charts, schematics. Mixed: extractable to text in good cases, lossy in others.
The four reference architectures map cleanly onto how they handle these modalities:
| Architecture | How it works | Strengths | Weaknesses |
|---|---|---|---|
| Caption-and-index | Extract images, caption with a VLM, embed captions as text | Simple; reuses existing text RAG | Lossy; caption hallucination becomes index pollution |
| Unified embeddings | Encode text and images into the same vector space (Cohere Embed 4, voyage-multimodal-3) | Single index, cross-modal queries, single-vector storage | API dependency or self-hosted GPU cost; coarser than late interaction |
| Page-as-image with late interaction | Render pages, embed with ColPali / ColQwen2, multi-vector retrieval | Highest recall on figure-heavy docs; no parsing pipeline | 100-1000x more vectors per page; needs custom scoring on most engines |
| Hybrid late-fusion | Parallel text + image indices; BM25 + dense; RRF fusion; VLM reranker | Most flexible; best recall on mixed corpora | Most complex to operate and evaluate |
Most production teams end up at architecture #2 or #4. Architecture #1 is fine as a quick win on small corpora. Architecture #3 is correct when answers really do live in pixels and recall is the binding constraint.
Embedding Models That Matter
CLIP and SigLIP - the baselines
CLIP (Radford et al., 2021) is the canonical contrastive image-text model and produces a 512-dim shared space. It still works for product-catalog image-image search and simple cross-modal retrieval on photos. It is weak on document-style visuals - charts, tables, dense text in images - because its training distribution was natural images with short captions.
SigLIP 2 (Tschannen et al., Google DeepMind, February 2025) replaces the contrastive softmax with a sigmoid loss, which removes the dependence on huge in-batch negatives. SigLIP 2 also adds multi-resolution support and is meaningfully better on text-in-image retrieval. For OCR-adjacent tasks where a page contains both visual structure and embedded text, SigLIP 2 is the better starting point.
Cohere Embed 4 and voyage-multimodal-3.5 - enterprise unified embeddings
Cohere Embed 4, released in April 2025, was built explicitly for enterprise document retrieval. It accepts interleaved text and images, processes raw PDF pages without a parsing pipeline, and supports a 128K context window so a single embedding call can cover roughly a 200-page document. It is available on the Cohere API, Amazon Bedrock, and Azure AI Foundry. Embed 4 also supports Matryoshka dimensions of 256, 512, 1024, and 1536 - useful for trimming storage cost without re-embedding.
voyage-multimodal-3 (November 2024) was the first major single-vector multimodal model that handled screenshots, slides, and figures without a parser, reporting strong gains over CLIP-large on table and screenshot retrieval. Its successor, voyage-multimodal-3.5 (January 2026), adds video frame support and Matryoshka dimensions - the most current option if your corpus also includes screen recordings or instructional video.
Both produce single-vector embeddings, which makes indexing simple and storage modest. The trade-off is API dependency at scale, and that single-vector representations always lose to multi-vector late interaction on the hardest figure-heavy queries.
Nomic Embed Multimodal - open-weights option
For teams that need to keep embeddings inside their VPC, Nomic Embed Multimodal ships two variants built on Qwen2.5-VL. The single-vector nomic-embed-multimodal-7b mirrors the Cohere/Voyage shape. The multi-vector colnomic-embed-multimodal-7b is ColPali-style: it produces patch-level embeddings with a late-interaction matching mechanism and reports 62.7 nDCG@5 on ViDoRe V2. Both are available on Hugging Face under permissive licensing and run on a single 24GB GPU with quantization.
Storage and cost economics
Single-vector embeddings store one vector per page. Multi-vector ColPali-class models store roughly 1024 patch vectors per page. At 10M pages, that is the difference between tens of GB and several TB - even before considering replication and indexing overhead. Matryoshka truncation (1024-dim down to 256-dim) typically costs under 2% recall and cuts storage four-fold; binary quantization can shrink it further at a modest recall cost.
ColPali and the Page-as-Image Approach
The core insight behind ColPali (Faysse et al., July 2024, accepted at ICLR 2025) is to skip parsing entirely. Render the PDF page at high DPI, feed it to a vision-language model, and produce patch-level embeddings. The model "sees" the page the way a human reader does. There is no OCR step that can fail, no layout detection that can confuse a multi-column abstract with a sidebar, and no table extractor that can lose a row.
The model family lives at github.com/illuin-tech/colpali and on the vidore Hugging Face org. ColPali itself uses a PaliGemma backbone. ColQwen2 and ColQwen2.5 use Qwen2-VL backbones, are typically stronger on multilingual content, and tend to top the ViDoRe V2 leaderboard. ColSmolVLM is the lighter-weight option for cost-sensitive deployments.
Late-interaction scoring uses MaxSim between query token embeddings and page patch embeddings. This is what gives ColPali-class models their recall edge on figure-heavy queries: the matching happens at token-patch granularity, not at single-vector granularity.
A note on benchmarks: ViDoRe V1 is now saturated, with several models exceeding 90 nDCG@5. The newer ViDoRe V2 (May 2025) is harder, more diverse, and multilingual - and the gap between ColPali-style multi-vector models and well-tuned single-vector models (Cohere Embed 4, voyage-multimodal-3.x, Nomic single-vector) has narrowed. ColPali still wins on the hardest visual reasoning tasks, but a unified single-vector model is now competitive on most enterprise corpora at a fraction of the storage cost.
When ColPali pays off: figure-heavy corpora (research papers, engineering manuals, financial decks), high-stakes queries where missed retrieval is expensive (legal, medical, compliance), and cases where the parsing pipeline itself is the operational pain point you want to remove.
PDF Parsing When You Still Need Structured Text
Even with vision-first retrieval, structured text remains useful for BM25 hybrid search, audit trails, and query patterns that genuinely care about exact strings. The 2026 parser landscape:
- Unstructured.io - open-core, returns typed elements (Title, NarrativeText, Table, Image). Self-hostable.
- LlamaParse - LlamaIndex's managed service, strong on complex layouts, returns markdown.
- Reducto - closed-source SaaS API; fast on forms; not self-hostable, which matters for regulated workloads.
- Docling - IBM Research; open-source; strong on scientific papers and structured documents.
- Marker - PDF-to-markdown with the Surya OCR backbone; fast, open-source, a sensible default.
- MinerU - OpenDataLab, rising in 2025 benchmarks for table and figure extraction.
For tables specifically, Microsoft Table Transformer handles borderless and complex tables that rule-based extractors miss. For chart understanding, the choice is captioning ("This bar chart shows Q3 revenue of $4.2B, up 12% YoY") versus extracting underlying data into CSV/JSON. Captioning is fine for semantic search; data extraction is the right call when users ask for exact numbers.
Reference Architecture on OpenSearch
A practical hybrid design on OpenSearch looks like this:
Schema - parallel fields per document/page:
{
"page_id": "...",
"source_pdf": "...",
"page_number": 34,
"modality": "page",
"text_chunk": "...",
"text_embedding": [...],
"image_embedding": [...],
"page_image_uri": "s3://..."
}
The text and image vector fields are independent k-NN fields, indexed with HNSW. Filters on modality, source_pdf, and date metadata constrain retrieval cheaply.
Retrieval - three parallel paths:
- BM25 over
text_chunkfor keyword and exact-string queries. - Dense k-NN over
text_embeddingfor semantic text matching. - Dense k-NN over
image_embeddingfor visual queries.
Reciprocal Rank Fusion (RRF) merges the three ranked lists with a tunable k parameter. Per-query weighting helps when the query is obviously visual ("which chart shows...") or obviously textual ("define X"), but the parameter-free RRF default is a strong baseline.
Late-interaction caveat - if you want native ColPali-style MaxSim scoring, OpenSearch is not the easiest fit. Native late-interaction lives in Vespa today. On OpenSearch, the practical pattern is two-phase: coarse retrieval with a mean-pooled single vector, then a rerank stage that runs full MaxSim on the top 50-100 candidates. Custom scoring scripts work but are operationally heavy.
Reranking - pass the top 20 candidates through a cross-encoder reranker. Cohere Rerank 3.5 supports multimodal pairs; an open-source alternative is to use a vision-capable LLM as judge, prompting it with the query and each candidate page image and asking for a 0-10 relevance score with a one-line justification.
Chunking strategy - element-level chunks for text plus page-level embeddings for image content. Element-level catches paragraph and table-row precision; page-level catches the visual context. Deduplicate at the merge step so the same answer is not surfaced twice through different paths.
Generation with Vision-Capable LLMs
Once retrieval returns the right pages, the generation prompt sends the rendered page images directly to a vision LLM. Claude, GPT-4o, and Gemini all accept images inline. Token cost on images is non-trivial - a single 1024x1024 tile costs roughly 1500-1700 tokens depending on the provider - so passing 5 page images can add 8-12K tokens of input on top of the system prompt and text context.
A reliable prompt pattern: "Here are the relevant pages. Answer the question using only information visible on these pages. For numerical answers, cite the specific figure or table. If the answer is not on these pages, say so." The "cite the specific figure or table" instruction reduces hallucination materially because it forces the model to point to a concrete source.
Grounded citations should ideally point to a bounding box, not just a page number. Storing element bounding boxes at indexing time and surfacing them through the UI lets users verify a citation in one click. This pays back faster than any model-side hallucination mitigation.
The known failure mode here is chart hallucination. VLMs misread axis scales, confuse legend colors, and invent precise numbers when the chart resolution is poor or the labels are tight. Three mitigations work in practice:
- Ask the model to output a brief reasoning chain ("I see the bar labeled Q3 reaches approximately the 4.2 line on the y-axis...").
- Cross-check VLM-extracted numbers against structured data extracted at parse time, when available.
- Treat hedged answers as a soft signal for human review.
A small chart-QA eval set (50-200 questions with ground-truth numbers) is the only way to know how badly your specific corpus is affected. Build it before scaling.
Cost Per Query Across the Three Architectures
Rough order-of-magnitude per-query costs on a typical 10M-page enterprise corpus:
- Caption-and-index - cheapest. Text embedding plus BM25 plus text-only generation typically lands in the low single-digit cents.
- Unified embeddings (Cohere/Voyage) - mid-tier. Embedding API call plus retrieval plus VLM generation with a few page images typically costs a small multiple of caption-and-index.
- ColPali plus VLM generation - most expensive. Self-hosted GPU for retrieval plus multi-image generation puts per-query cost meaningfully higher.
Storage is the other axis. Single-vector indices for 10M pages fit comfortably in tens of GB; ColPali-class multi-vector indices land in single-digit TB. Both are tractable; both have very different infrastructure shapes.
The break-even depends on what wrong answers cost. For consumer search, the cheaper architecture usually wins. For legal due diligence, medical decision support, or financial research, the recall lift from late interaction is worth the storage and inference premium.
Use Cases Where Multimodal RAG Pays Off
- Financial analysis over 10-Ks and earnings decks - revenue charts, segment waterfalls, footnoted tables. ColPali plus a strong VLM is the natural fit; the per-query economics work because each query is high-value.
- Engineering documentation and technical manuals - wiring diagrams, P&IDs, CAD screenshots. Hybrid architectures win here because the text procedures and the diagrams both need to be retrievable.
- Medical imaging and pathology reports - radiology scans embedded in reports, annotated findings. Self-hosted is usually mandatory for PHI; Nomic Embed Multimodal plus a local VLM (PaliGemma 2, Gemma 3 vision) keeps everything in-VPC.
- E-commerce product catalogs - "find me a dress similar to this photo but in blue." Single-vector unified embeddings (SigLIP 2 or Cohere Embed 4) win on cost and latency at catalog scale.
Production Patterns and Anti-Patterns
Build an eval set before you ship anything. ViDoRe V2 and MMLongBench-Doc are the standard benchmarks for retrieval and end-to-end QA respectively, but a 200-500 question custom set on your actual corpus matters more. Mix figure-dependent, table-dependent, and text-only questions; track Recall@k, exact-match, F1, and citation quality separately.
Latency budgets are tighter than they look. Render the page image once at ingest, not at query time. Pre-compute thumbnails. The query path should be: query embedding (20-50 ms hosted, 10 ms local), retrieval (30-80 ms), reranking on top 20 (200-500 ms), VLM generation streaming first token (1-2 s). End-to-end first-token latency under 2 seconds is achievable but only if ingestion is correctly pre-computed.
The most common failure modes are not model failures but operational ones. Caption drift - VLM-generated captions hallucinate details that become misinformation in the index. Modality leakage - image embeddings retrieve visually similar but semantically irrelevant pages, like a different page from the same template. Dominant-modality bias - text always outscores image in fusion unless you tune RRF weights per modality. And stale embeddings - a model upgrade requires full re-indexing, so plan for embedding versioning from day one.
Key Takeaways
- Text-only RAG has an upper bound on figure-heavy corpora that no chunking strategy can fix.
- Three architectures dominate in 2026: caption-and-index (simplest), unified vision embeddings (Cohere Embed 4, voyage-multimodal-3.5), and page-as-image with late interaction (ColPali, ColQwen2.5, ColNomic).
- Single-vector models are now competitive with ColPali on most enterprise corpora at a small fraction of the storage cost. Pick multi-vector late interaction when recall is the binding constraint and queries are visually hard.
- OpenSearch supports parallel text and image vector fields with RRF fusion natively; native ColBERT-style late interaction is a Vespa strength and a custom-scoring problem on OpenSearch.
- VLMs hallucinate chart numbers; build a chart-QA eval set, surface bounding-box citations, and prompt for explicit reasoning when numerical accuracy matters.
- Cost per query and storage cost both vary by an order of magnitude across the three architectures. Match the architecture to what wrong answers cost in your domain.
If you are evaluating multimodal RAG for production - especially over financial, technical, or regulated documents - our team has shipped these architectures across both managed and self-hosted stacks. We can help you pick the right design for your corpus and avoid the operational pitfalls that show up at scale.