Multimodal Retrieval for LLMs: Context Selection

Scope note: this article does not try to reconstruct the internal retrieval architecture of GPT, Claude, or Gemini. Public materials expose a different layer: the inference runtime and tools around the model - web search, file search, grounding, connectors, MCP, visual/PDF understanding, citations, and agentic search loops.

Imagine a RAG system for financial reports. A user asks:

Why did EBITDA margin decline in Q4, and where is that supported?

The corpus is typical for enterprise QA: an annual report in PDF, an investor deck, a spreadsheet appendix, and an earnings call transcript. The report contains a paragraph about EBITDA, an FX adjustment table on the next page, and a waterfall chart in the deck. The text retriever finds the EBITDA paragraph. OCR fails to extract the table with correct headers. The dense retriever promotes a passage about operating expenses: semantically close, but causally insufficient. The model answers confidently: margin declined because costs increased. There is a citation, but it points to nearby topical text rather than the actual evidence.

The mistake happened before generation: the system selected the wrong evidence.

A strong model will not fix a bad context pack. It will simply assemble a cleaner answer from incomplete or incorrect sources. In production LLM systems, the failure often does not happen at the decoding stage. It happens at the retrieval stage: what was found, what was filtered out, what was reordered, what was dropped because of budget, and what ultimately entered the context window.

The core tension

For documents, PDFs, screenshots, and tables, three common simplifications break down quickly.

Dense-only retrieval is not enough. Dense similarity is good at semantic paraphrases, but it is weaker on exact IDs, error codes, numbers, ticker-like identifiers, dates, row/column labels, and source authority.

OCR-only retrieval is not enough. OCR-only pipelines do not fail only when OCR itself makes a mistake. Even perfect OCR can destroy spatial relations: which arrow points to which object, which footnote applies to which row, which color in a chart corresponds to which series, where the column header ends and the value begins.

Agentic retrieval is expensive. Agentic search is useful when the search strategy is not known upfront. If the source and task are clear - for example, finding evidence in a report, contract, or policy - agentic behavior often adds cost and nondeterminism instead of quality. Anthropic shows that multi-agent research can deliver a large uplift on breadth-first tasks, but it also reports a substantial increase in token usage in agentic modes.

The right production frame looks like this:

high-recall retrieval
  -> utility-aware reranking
  -> source-aware context packing
  -> citation-bound generation
  -> eval / telemetry / security checks

Reference architecture for multimodal retrieval around LLMs

Reference architecture: retrieval here is not a single vector search call. It is a governed evidence selection layer with routing, specialized indexes, reranking, context packing, citations, and an eval/security loop.

Running example: EBITDA margin and the FX table

A weak pipeline:

query
  -> dense vector search over OCR chunks
  -> top-5 text chunks
  -> prompt concat
  -> answer with citation

What breaks:

the query “why did margin decline” requires causal evidence, not just similarity;
the FX impact table is stored as broken OCR text;
the waterfall chart was never found because there is no visual page index;
the citation points to a page with general topical text, not the table or chart;
the model gives a plausible but incomplete answer.

A better pipeline:

query
  -> query understanding: financial report, numeric explanation, source citation required
  -> BM25: EBITDA margin, Q4, FX, adjustment
  -> dense retrieval: related management discussion chunks
  -> visual page retrieval: pages with charts/tables
  -> table extraction: FX impact rows, units, footnotes
  -> fusion + reranking
  -> context packing: text + normalized table + page crop
  -> answer with page/table citations

Bad OCR-only pipeline versus multimodal retrieval pipeline

On the same question, a weak OCR-only pipeline can look grounded while losing the causal evidence. A multimodal pipeline separately retrieves text chunks, visual pages, the table, footnotes, and page/table citations.

In this pipeline, the LLM is no longer answering from “similar text.” It is answering from an evidence pack: the primary report page, the table range, the visual chart crop, and the transcript quote. That is the practical meaning of multimodal retrieval around LLMs.

What this article means by retrieval

To avoid mixing architectural layers:

search is the search infrastructure: BM25, ANN, dense vectors, metadata filters, web/enterprise search, SQL, graph traversal;
retrieval is the selection of evidence for a specific query;
RAG is generation with selected external context;
grounding is verifiable attachment of claims to sources;
memory is persistent state about a user or task, not runtime retrieval;
context packing is packaging retrieved evidence into a limited context window;
citations are not decorative; they are a traceability contract.

The focus of this article is documents, PDFs, screenshots, and tables. Video, audio, and product image search are treated as extensions: the same ideas apply, but the retrieval units and cost model are different.

How to choose a retrieval strategy

In production, it is more useful to choose the retrieval strategy first and draw the architecture second. A routing mistake is usually more expensive than a bad top_k value.

if query has exact ID / error code / API name / invoice number:
  use BM25/exact index + metadata filters
  add dense retrieval only as secondary signal

if query targets PDF with tables/charts:
  use text/OCR index + visual page index + table extraction
  rerank with source authority and numeric overlap

if query includes screenshot or UI state:
  first parse visible UI state
  then search docs/tickets/logs by extracted labels/errors

if query is open-ended research:
  use agentic retrieval with explicit budgets
  require citations and source diversity

if corpus is small and single-use:
  consider long context
  still keep page/source references

if freshness or ACL matters:
  retrieval is mandatory
  apply permissions before candidate generation

Key point: retrieval strategy is selected at the routing layer. It is not a top_k setting in a vector database.

Minimal production baseline

Level	What it includes	When it is enough
MVP	BM25 + dense text embeddings + metadata/ACL filters + citations	Plain text corpus, known sources, low visual complexity
MVP+ for PDF	MVP + page rendering + visual page embeddings + table crops	Reports, decks, scanned PDFs, screenshots, tables
Advanced	Multi-vector late interaction + cross-encoder/VLM reranking + context packing ablations	High-value enterprise QA, finance, legal, industrial documents
Research-grade	Human-labeled evidence, retrieval ablations, security harness, agentic router benchmark	Before changing production retrieval architecture at scale

For enterprise document RAG, I would not start with an agentic framework. I would start with this baseline:

ingestion:
  native text + OCR
  page images
  table/chart/figure crops
  metadata + ACL + versioning

indexes:
  BM25 over normalized text
  dense text vectors
  visual page vectors
  optional region/table vectors

retrieval:
  ACL before retrieval
  parallel text + dense + visual search
  RRF or score fusion
  cross-encoder rerank top 50-200
  VLM rerank only for top visual candidates

generation:
  context pack with source/page/bbox IDs
  claim-level citations
  unsupported-claim checks

Multimodal indexing for PDF, screenshots, and tables

In text-only RAG, the retrieval unit is usually a chunk. In multimodal retrieval, the retrieval unit is no longer obvious: it can be a PDF page, a table crop, a screenshot region, a video keyframe, a table row, a chart legend, or a transcript segment.

Minimal index design:

Index A: text/OCR chunks
  - BM25/exact matching
  - dense text embeddings
  - doc_id, page_id, section, heading_path
  - ACL, timestamp, version

Index B: visual pages / regions
  - page screenshots
  - table/chart/figure crops
  - visual embeddings
  - bbox, page_id, source_id

Index C: structured tables
  - normalized cells
  - row/column headers
  - units, periods, footnotes
  - source page + bbox

For the EBITDA case, a single index will almost certainly lose. The text/OCR index stores management discussion, transcript passages, and exact terms: EBITDA, Q4, FX, margin. The visual page/region index stores the report page image, the waterfall chart slide, and the FX adjustment table crop. The structured table index stores normalized cells, row/column headers, units, footnotes, and page/bbox citation.

Why not one index? Different failure modes require different signals. BM25 catches exact names and numbers. Dense retrieval catches semantic match. Visual retrieval catches layout, diagrams, forms, labels, and neighboring elements. The structured table index matters when the answer depends on row/column semantics.

A practical evidence unit:

evidence_unit:
  id: ev_fx_margin_impact_q4
  source_id: annual_report_2026
  modality: table_cell_range
  page: 117
  bbox: [82, 190, 1420, 890]
  text: "FX impact reduced EBITDA margin by 1.2 percentage points"
  table:
    table_id: tbl_2026_q4_fx_impact
    row_range: ["Europe", "APAC"]
    column_range: ["FX impact", "EBITDA margin impact"]
    units: ["USD millions", "percentage points"]
    footnotes: ["Excludes one-time restructuring costs"]
  retrieval_scores:
    bm25: 12.4
    dense: 0.73
    visual: 0.81
    rerank: 0.92
  acl_scope: ["finance-team"]
  version: "2026-04-18"

The research direction supports this. ColPali builds retrieval over page images with multi-vector late interaction. VisRAG shows that direct document-image retrieval can deliver 25-39% end-to-end gain over text-based RAG. ViDoRe V3 targets enterprise document retrieval directly: 26,000 pages, 3,099 queries, 6 languages, human-verified page relevance, and bounding boxes.

Tables as a separate evidence type

Tables should not be treated as “text with line breaks.” In financial, legal, medical, and operational documents, the table is often the primary evidence. A Markdown table is convenient for the LLM, but it does not replace a structured representation: a cell without row/column headers, units, and footnotes is close to useless.

Store tables in several forms:

as a visual crop for layout and footnote verification;
as normalized cells for numeric lookup;
as Markdown/CSV for LLM input;
as structured records for filters and aggregations;
as source-linked evidence with page, bbox, and table_id.

For the EBITDA case, the table must answer more than “where does FX occur.” It must preserve the relationship between segment, period, unit, margin impact, and footnote. Otherwise the model sees a number but not what row, period, or qualification it belongs to.

Practical schema:

table_evidence:
  table_id: tbl_2026_q4_fx_impact
  source_id: annual_report_2026
  doc_id: annual_report_2026_pdf
  page_id: p_117
  bbox: [82, 190, 1420, 890]
  title: "FX impact by segment"
  period: "Q4 2026"
  normalized_cells_ref: s3://tables/tbl_2026_q4_fx_impact.parquet
  row_headers: ["North America", "Europe", "APAC"]
  column_headers: ["Q4 revenue", "FX impact", "EBITDA margin impact"]
  cell_range: "B2:C4"
  units: ["USD millions", "percentage points"]
  footnotes: ["Excludes one-time restructuring costs"]
  visual_crop_ref: s3://crops/tbl_2026_q4_fx_impact.png
  extraction_confidence: 0.93
  extraction_model: layout-vlm-2026-04

When to pass a table to the model:

small table: Markdown;
wide table: selected rows/columns + units + footnotes;
numeric QA: structured rows + exact cell references;
low extraction confidence: include visual crop;
high-stakes answer: cite row/cell/range, not only page.

For charts, the same rule applies. A chart is not just an image. You usually need title, axes, legend, series, units, period, visual crop, extracted values, and citation coordinates.

Reranking cascade

Reranking is not one stage in serious document retrieval. It is a cascade that spends more expensive models only after cheaper retrieval has preserved recall.

Stage	Candidate count	Tool/model	What it evaluates	Cost
BM25 / exact	500-2000	Inverted index	IDs, exact terms, numbers, acronyms	low
Dense retrieval	500-2000	Text embeddings	Semantic similarity and paraphrases	low
Visual retrieval	100-500	Page/region embeddings	Layout, chart/table/page similarity	medium
Fusion / RRF	100-500	Deterministic	Combine sparse, dense, visual, metadata signals	low
Cross-encoder rerank	50-200 -> 20-50	Text reranker	Usefulness of text evidence for the query	medium
VLM rerank	visual 10-30 -> 3-8	Vision-language model	Whether visual region supports the claim	high
LLM evidence judge	final 5-15	LLM	Sufficiency, conflict, citation support	high

The important distinction is between topical relevance and answer usefulness. A passage can be about EBITDA and still not explain the Q4 margin decline. For the EBITDA case, the reranker should prefer evidence that jointly covers Q4, margin delta, FX impact, primary source status, and table/footnote support.

Reranking cascade and context packing contract

Reranking is better designed as a cascade: cheap high-recall retrieval first, expensive VLM/LLM judgment only on a narrow candidate set, and a final context pack that carries source IDs, page/bbox/offset, and citation support.

Context packing algorithm

Context packing decides what the model sees, in what order, with which IDs, and under what budget. It is a decision layer after retrieval and reranking, not “take top-k and concatenate.”

def build_context_pack(query, ranked, budget):
    ranked = filter_acl(ranked, query.user)
    ranked = remove_deleted_or_stale_versions(ranked)
    ranked = dedup_by_source_version_and_region(ranked)
    ranked = dedup_by_semantic_hash(ranked)

    primary = select_primary_sources(
        ranked,
        require_direct_answer=True,
        limit=4,
    )

    numeric = select_numeric_evidence(
        ranked,
        require_units=True,
        require_row_and_column_headers=True,
        limit=3,
    )

    visual = select_visual_evidence(
        ranked,
        max_pages=2,
        max_crops=4,
        prefer_regions=True,
        resolution="medium",
    )

    conflicts = detect_conflicts(ranked)

    pack = []
    pack += compress_text(primary, target_tokens=budget.text)
    pack += render_tables(numeric, max_rows=30, include_units=True)
    pack += attach_visual_crops(visual, max_media_tokens=budget.media)
    pack += summarize_conflicts(conflicts, max_tokens=budget.conflicts)

    return order_for_generation(pack)

For the EBITDA case, the context pack should not include the whole annual report. It should include one management discussion paragraph, the normalized FX table, the waterfall chart crop, and a transcript quote. If the packer drops the table crop because of budget, the final answer is already reasoning over an incomplete world.

Ordering matters. Models often overweight earlier context. The pack should begin with primary evidence, not a random dense top-1 result.

Minimum evidence unit contract:

type EvidenceUnit = {
  evidence_id: string
  source_id: string
  source_uri?: string
  doc_id?: string
  page?: number
  section?: string
  bbox?: [number, number, number, number]
  timestamp?: string
  modality: "text" | "pdf_page" | "image_crop" | "table" | "video_frame" | "audio"
  acl_scope: string[]
  retrieved_score: number
  rerank_score?: number
  content_ref: string
  checksum: string
}

Citation contract should be strict:

def validate_citations(answer, packed_evidence):
    allowed = {ev.evidence_id for ev in packed_evidence}

    for claim in answer.claims:
        if not claim.evidence_ids:
            raise CitationViolation("claim has no evidence")
        if any(eid not in allowed for eid in claim.evidence_ids):
            raise CitationViolation("claim cites evidence outside packed context")

    return True

An answer with correct prose and the wrong citation is a production bug.

Cost, storage, and latency

Visual retrieval is not free. If an architecture says “keep a visual page index,” it should also admit the cost.

Track at least:

page rendering DPI and image store size;
batch embedding cost for page images and crops;
single-vector vs multi-vector storage growth;
late-interaction latency;
re-indexing after OCR/layout model upgrades;
cache invalidation for connector/web sources;
TTL for fresh data;
versioning for PDF updates;
media token cost when passing page/crop evidence to the model.

Google’s Gemini 3 media resolution docs expose the token-cost geometry clearly: PDF default/medium is around 560 tokens/page, image high is 1120 tokens, video high is 280 tokens/frame, and video low/medium is 70 tokens/frame (Google AI docs). That means visual context should be passed coarse-to-fine:

page-level retrieval
  -> region crop
  -> high resolution only for selected evidence
  -> generation with page/crop citations

Do not send every retrieved page to the model “just in case.” That turns a retrieval problem into a token budget problem.

Security constraints

Retrieval around LLMs is a security boundary. The system ingests untrusted text from web pages, PDFs, OCR layers, captions, MCP tools, connector results, and memory stores.

Separate three groups of risk.

Retrieval security

ACL before retrieval, not after;
tenant isolation;
source trust levels;
stale/deleted document handling;
audit trail for retrieved, reranked, and rejected evidence;
no private source in logs/traces without redaction.

Prompt injection

untrusted retrieved text;
hidden PDF text;
malicious OCR layer;
webpage instructions;
citation spoofing;
tool result injection.

Agent security

tool-call allowlist;
argument validation;
public-web/private-connector separation;
no write tools without explicit approval;
rate limits and cost limits;
exfiltration constraints.

Security test pack:

HTML/PDF injection:
  hidden instructions in footer/OCR layer
  expected: no unsafe tool call, no exfiltration

Cross-tenant retrieval:
  forbidden doc exists in index
  expected: no hit before rerank

Citation mismatch:
  answer cites source not in packed context
  expected: hard failure

Memory poisoning:
  untrusted web content tries to persist instruction
  expected: read-only or approval gate blocks it

Tool schema smuggling:
  payload hidden in tool args
  expected: schema and allowlist reject it

What OpenAI, Anthropic, and Google publicly expose

Public docs do not prove internal retriever topology. They show the surface around the model.

Vendor	Public surface	How to interpret it
OpenAI	Responses API, web search, file search, deep research, code interpreter, computer use, remote MCP	Unified agentic runtime with tools and retrieval surfaces around the model
Anthropic	Contextual Retrieval, advanced tool use, web search, tool search, citations, multi-agent research	Strong public framing for context engineering, tool ergonomics, and budgeted research
Google / Gemini	Grounding with Google Search / Vertex AI Search, RAG Engine, Gemini Embedding 2, media resolution	The best-documented grounding stack and multimodal embedding/search surface

What can be said:

all three expose public search/grounding/citation/tool surfaces;
retrieval sits on the runtime/tooling layer;
citations are part of the system contract, but not proof that hallucination is impossible;
visual/PDF understanding is becoming a production primitive.

What should not be claimed:

that GPT, Claude, or Gemini internally use a specific retriever;
that each vendor has one hidden unified multimodal index;
that web/file search is built into the model weights;
that public API behavior fully describes the internal inference stack.

When not to use multimodal or agentic retrieval

Strong architecture says not only when to use a technique, but also when not to.

Visual retrieval is usually not needed if:

the corpus is plain text;
there are no tables, diagrams, forms, or layout semantics;
queries are mostly exact lookup;
PDFs have a high-quality native text layer and reliable table extraction;
latency/cost matters more than marginal visual recall;
visual evidence is not used in citations or eval.

Agentic retrieval is usually not needed if:

the task is deterministic;
sources are known upfront;
the answer must be low-latency;
there is a good hybrid index;
the question does not require exploration;
tool calls do not add new evidence.

Long context may be enough if:

the corpus is under roughly 200k tokens;
there is one document and it must be read as a whole;
there are no complex ACL/freshness constraints;
cost and latency are acceptable;
the task needs synthesis and cross-section reasoning.

But long context does not replace retrieval for large, dynamic, and permissioned corpora. Retrieval is still needed for selection, freshness, permissions, citations, observability, and cost control.

Practical conclusion

Multimodal retrieval is not about “supporting images.” It is necessary because, in real documents, evidence often lives in a table, diagram, layout, screenshot, or spatial relation - exactly where text-only retrieval is blind.

A good LLM system does not just call a strong model. It builds an evidence pipeline:

find the right evidence
  -> rank it for answer utility
  -> pack it with source IDs
  -> generate with citations
  -> verify claims
  -> log retrieval path

If retrieval fails, the prompt will not save you. If context packing drops the critical table crop, model reasoning is already operating on the wrong world. If citations are not checked as a contract, a “grounded answer” remains a UI illusion.

Appendix A: engineering anchors

Engineering numbers for multimodal retrieval

These numbers are not universal constants, but they are useful engineering anchors: retrieval quality improves through a cascade, tool surfaces should be loaded on demand, visual media has token cost, and agentic search needs a budget controller.

Source	Number	What it supports
Anthropic Contextual Retrieval	top-20 retrieval failure rate 5.7% -> 3.7% with contextual embeddings; 2.9% with contextual BM25; 1.9% with reranking	Contextual chunks, BM25, and reranking move retrieval quality.
Anthropic Advanced Tool Use	58 tools from 5 MCP servers around 55K tokens; Tool Search around 8.7K tokens and 85% reduction	Tool search matters when tool definitions consume the context budget.
Anthropic Multi-Agent Research	multi-agent systems around 15x tokens relative to chat; single agent around 4x	Agentic retrieval needs explicit budget control.
Gemini media resolution	Gemini 3 PDF medium 560 tokens/page; image high 1120; video high 280/frame	Visual context has real token geometry.
VisRAG	25-39% end-to-end gain over text-based RAG	Parsing/OCR loss is measurable.
ViDoRe V3	26,000 pages, 3,099 queries, 6 languages	Enterprise document retrieval should be evaluated on page-level evidence, not only chunks.
MMDocIR	VLM-text roughly 1.5x OCR-text for tables and 3.8x for figures	OCR loses multimodal signal before retrieval.
From BM25 to Corrective RAG	hybrid + neural rerank: Recall@5 = 0.816, MRR@3 = 0.605	Mixed text/table QA needs hybrid retrieval and reranking.

Appendix B: vendor surfaces

Surface	OpenAI	Anthropic	Google / Gemini
Web retrieval	Web search, deep research	Web search with filtering	Google Search grounding
Private retrieval	File search over vector stores	Files API, search result blocks, citations	Vertex AI Search, File Search API preview, RAG Engine
Tool retrieval	Tool search, MCP	Tool Search Tool, deferred loading	Function calling and platform tool surfaces
Citations	URL and file citations	Search result blocks, document/page/character citations	Grounding metadata and source links
Multimodal docs	Text/images, file/image analysis, cookbook patterns	PDFs/images/files plus citations	Native PDF/document understanding, media resolution
Agentic research	Deep research, background mode, MCP search/fetch	Multi-agent research system	Deep Research Agent / grounding surfaces

Write asymmetrically. Google publicly exposes more of the multimodal embedding/grounding layer. Anthropic publicly explains context engineering and tool ergonomics more clearly. OpenAI packages a unified agentic runtime through Responses/deep research surfaces.

Appendix C: experiments before production rollout

Priority	Experiment	Minimal protocol
P0	Page-image vs OCR-text vs VLM-text vs hybrid	ViDoRe V3 + MMDocIR + internal PDFs; Recall@K, nDCG@10, evidence-page recall, latency, storage
P0	Hybrid + reranker ablation	BM25, dense, hybrid RRF, hybrid + cross-encoder; MRR@3, Recall@5, nDCG@10, cost
P0	Context packing ablation	naive concat vs dedup/diversity/source-order/page-crops; correctness, faithfulness, citation precision
P0	Agentic vs non-agentic routing	simple facts, targeted QA, open-ended research; tool calls, tokens, wall-clock, timeout
P0	Security harness	prompt injection, ACL leakage, stale source handling, citation mismatch, tool result injection
P1	Long context vs retrieval routing	full-document, retrieved, hybrid; cost, latency, faithfulness
P1	Visual reranker vs textual reranker	same candidate pool, compare text reranker and VLM reranker

Without these ablations, the article remains a strong engineering synthesis. With them, it becomes an architecture note you can use to change a production pipeline.

Appendix D: failure modes

Failure mode	Symptom	Mitigation
OCR lost critical signal	Model does not see table/legend/caption	visual page index, region crops, VLM rerank
Visual retriever found a similar page with the wrong meaning	Similar chart, different entity/date	metadata filters, text+visual fusion
Dense retriever missed exact ID	Error code/order ID not found	BM25/exact index
Reranker chose secondary source	Blog/SEO page ranks above official doc	source authority features
Context packing dropped the key evidence	Correct doc retrieved, wrong answer generated	packing eval, evidence ablation
Citation mismatch	Claim cites source outside context	citation contract hard failure
Stale index	Answer uses old version	versioning, freshness routing
ACL leakage	Evidence from another tenant	pre-retrieval ACL filtering
Prompt injection from source	Model follows instructions from PDF/web page	source sandboxing, tool-call validation
Agentic loop burns budget	Many search/fetch calls, no new facts	budget controller, early stopping

Multimodal Retrieval for LLMs

The core tension

Running example: EBITDA margin and the FX table

What this article means by retrieval

How to choose a retrieval strategy

Minimal production baseline

Multimodal indexing for PDF, screenshots, and tables

Tables as a separate evidence type

Reranking cascade

Context packing algorithm

Cost, storage, and latency

Security constraints

What OpenAI, Anthropic, and Google publicly expose

When not to use multimodal or agentic retrieval

Practical conclusion

Appendix A: engineering anchors

Appendix B: vendor surfaces

Appendix C: experiments before production rollout

Appendix D: failure modes

Appendix E: sources

OpenAI

Anthropic

Google / Gemini

Research

FAQ

What is multimodal retrieval for LLMs?

How is multimodal retrieval different from regular RAG?

Why is retrieval not the same as LLM memory?

Why is vector search top-k not enough?

Does long context replace retrieval?

Where is visual document retrieval needed?

What metrics are needed to evaluate multimodal retrieval?

When do you need agentic retrieval?