Multimodal Retrieval for LLMs
Scope note: this article does not try to reconstruct the internal retrieval architecture of GPT, Claude, or Gemini. Public materials expose a different layer: the inference runtime and tools around the model - web search, file search, grounding, connectors, MCP, visual/PDF understanding, citations, and agentic search loops.
Imagine a RAG system for financial reports. A user asks:
Why did EBITDA margin decline in Q4, and where is that supported?
The corpus is typical for enterprise QA: an annual report in PDF, an investor deck, a spreadsheet appendix, and an earnings call transcript. The report contains a paragraph about EBITDA, an FX adjustment table on the next page, and a waterfall chart in the deck. The text retriever finds the EBITDA paragraph. OCR fails to extract the table with correct headers. The dense retriever promotes a passage about operating expenses: semantically close, but causally insufficient. The model answers confidently: margin declined because costs increased. There is a citation, but it points to nearby topical text rather than the actual evidence.
The mistake happened before generation: the system selected the wrong evidence.
A strong model will not fix a bad context pack. It will simply assemble a cleaner answer from incomplete or incorrect sources. In production LLM systems, the failure often does not happen at the decoding stage. It happens at the retrieval stage: what was found, what was filtered out, what was reordered, what was dropped because of budget, and what ultimately entered the context window.
The core tension
For documents, PDFs, screenshots, and tables, three common simplifications break down quickly.
Dense-only retrieval is not enough. Dense similarity is good at semantic paraphrases, but it is weaker on exact IDs, error codes, numbers, ticker-like identifiers, dates, row/column labels, and source authority.
OCR-only retrieval is not enough. OCR-only pipelines do not fail only when OCR itself makes a mistake. Even perfect OCR can destroy spatial relations: which arrow points to which object, which footnote applies to which row, which color in a chart corresponds to which series, where the column header ends and the value begins.
Agentic retrieval is expensive. Agentic search is useful when the search strategy is not known upfront. If the source and task are clear - for example, finding evidence in a report, contract, or policy - agentic behavior often adds cost and nondeterminism instead of quality. Anthropic shows that multi-agent research can deliver a large uplift on breadth-first tasks, but it also reports a substantial increase in token usage in agentic modes.
The right production frame looks like this:
high-recall retrieval
-> utility-aware reranking
-> source-aware context packing
-> citation-bound generation
-> eval / telemetry / security checks
Reference architecture: retrieval here is not a single vector search call. It is a governed evidence selection layer with routing, specialized indexes, reranking, context packing, citations, and an eval/security loop.
Running example: EBITDA margin and the FX table
A weak pipeline:
query
-> dense vector search over OCR chunks
-> top-5 text chunks
-> prompt concat
-> answer with citation
What breaks:
- the query “why did margin decline” requires causal evidence, not just similarity;
- the FX impact table is stored as broken OCR text;
- the waterfall chart was never found because there is no visual page index;
- the citation points to a page with general topical text, not the table or chart;
- the model gives a plausible but incomplete answer.
A better pipeline:
query
-> query understanding: financial report, numeric explanation, source citation required
-> BM25: EBITDA margin, Q4, FX, adjustment
-> dense retrieval: related management discussion chunks
-> visual page retrieval: pages with charts/tables
-> table extraction: FX impact rows, units, footnotes
-> fusion + reranking
-> context packing: text + normalized table + page crop
-> answer with page/table citations
On the same question, a weak OCR-only pipeline can look grounded while losing the causal evidence. A multimodal pipeline separately retrieves text chunks, visual pages, the table, footnotes, and page/table citations.
In this pipeline, the LLM is no longer answering from “similar text.” It is answering from an evidence pack: the primary report page, the table range, the visual chart crop, and the transcript quote. That is the practical meaning of multimodal retrieval around LLMs.
What this article means by retrieval
To avoid mixing architectural layers:
- search is the search infrastructure: BM25, ANN, dense vectors, metadata filters, web/enterprise search, SQL, graph traversal;
- retrieval is the selection of evidence for a specific query;
- RAG is generation with selected external context;
- grounding is verifiable attachment of claims to sources;
- memory is persistent state about a user or task, not runtime retrieval;
- context packing is packaging retrieved evidence into a limited context window;
- citations are not decorative; they are a traceability contract.
The focus of this article is documents, PDFs, screenshots, and tables. Video, audio, and product image search are treated as extensions: the same ideas apply, but the retrieval units and cost model are different.
How to choose a retrieval strategy
In production, it is more useful to choose the retrieval strategy first and draw the architecture second. A routing mistake is usually more expensive than a bad top_k value.
if query has exact ID / error code / API name / invoice number:
use BM25/exact index + metadata filters
add dense retrieval only as secondary signal
if query targets PDF with tables/charts:
use text/OCR index + visual page index + table extraction
rerank with source authority and numeric overlap
if query includes screenshot or UI state:
first parse visible UI state
then search docs/tickets/logs by extracted labels/errors
if query is open-ended research:
use agentic retrieval with explicit budgets
require citations and source diversity
if corpus is small and single-use:
consider long context
still keep page/source references
if freshness or ACL matters:
retrieval is mandatory
apply permissions before candidate generation
Key point: retrieval strategy is selected at the routing layer. It is not a top_k setting in a vector database.
Minimal production baseline
| Level | What it includes | When it is enough |
|---|---|---|
| MVP | BM25 + dense text embeddings + metadata/ACL filters + citations | Plain text corpus, known sources, low visual complexity |
| MVP+ for PDF | MVP + page rendering + visual page embeddings + table crops | Reports, decks, scanned PDFs, screenshots, tables |
| Advanced | Multi-vector late interaction + cross-encoder/VLM reranking + context packing ablations | High-value enterprise QA, finance, legal, industrial documents |
| Research-grade | Human-labeled evidence, retrieval ablations, security harness, agentic router benchmark | Before changing production retrieval architecture at scale |
For enterprise document RAG, I would not start with an agentic framework. I would start with this baseline:
ingestion:
native text + OCR
page images
table/chart/figure crops
metadata + ACL + versioning
indexes:
BM25 over normalized text
dense text vectors
visual page vectors
optional region/table vectors
retrieval:
ACL before retrieval
parallel text + dense + visual search
RRF or score fusion
cross-encoder rerank top 50-200
VLM rerank only for top visual candidates
generation:
context pack with source/page/bbox IDs
claim-level citations
unsupported-claim checks
Multimodal indexing for PDF, screenshots, and tables
In text-only RAG, the retrieval unit is usually a chunk. In multimodal retrieval, the retrieval unit is no longer obvious: it can be a PDF page, a table crop, a screenshot region, a video keyframe, a table row, a chart legend, or a transcript segment.
Minimal index design:
Index A: text/OCR chunks
- BM25/exact matching
- dense text embeddings
- doc_id, page_id, section, heading_path
- ACL, timestamp, version
Index B: visual pages / regions
- page screenshots
- table/chart/figure crops
- visual embeddings
- bbox, page_id, source_id
Index C: structured tables
- normalized cells
- row/column headers
- units, periods, footnotes
- source page + bbox
For the EBITDA case, a single index will almost certainly lose. The text/OCR index stores management discussion, transcript passages, and exact terms: EBITDA, Q4, FX, margin. The visual page/region index stores the report page image, the waterfall chart slide, and the FX adjustment table crop. The structured table index stores normalized cells, row/column headers, units, footnotes, and page/bbox citation.
Why not one index? Different failure modes require different signals. BM25 catches exact names and numbers. Dense retrieval catches semantic match. Visual retrieval catches layout, diagrams, forms, labels, and neighboring elements. The structured table index matters when the answer depends on row/column semantics.
A practical evidence unit:
evidence_unit:
id: ev_fx_margin_impact_q4
source_id: annual_report_2026
modality: table_cell_range
page: 117
bbox: [82, 190, 1420, 890]
text: "FX impact reduced EBITDA margin by 1.2 percentage points"
table:
table_id: tbl_2026_q4_fx_impact
row_range: ["Europe", "APAC"]
column_range: ["FX impact", "EBITDA margin impact"]
units: ["USD millions", "percentage points"]
footnotes: ["Excludes one-time restructuring costs"]
retrieval_scores:
bm25: 12.4
dense: 0.73
visual: 0.81
rerank: 0.92
acl_scope: ["finance-team"]
version: "2026-04-18"
The research direction supports this. ColPali builds retrieval over page images with multi-vector late interaction. VisRAG shows that direct document-image retrieval can deliver 25-39% end-to-end gain over text-based RAG. ViDoRe V3 targets enterprise document retrieval directly: 26,000 pages, 3,099 queries, 6 languages, human-verified page relevance, and bounding boxes.
Tables as a separate evidence type
Tables should not be treated as “text with line breaks.” In financial, legal, medical, and operational documents, the table is often the primary evidence. A Markdown table is convenient for the LLM, but it does not replace a structured representation: a cell without row/column headers, units, and footnotes is close to useless.
Store tables in several forms:
- as a visual crop for layout and footnote verification;
- as normalized cells for numeric lookup;
- as Markdown/CSV for LLM input;
- as structured records for filters and aggregations;
- as source-linked evidence with page, bbox, and table_id.
For the EBITDA case, the table must answer more than “where does FX occur.” It must preserve the relationship between segment, period, unit, margin impact, and footnote. Otherwise the model sees a number but not what row, period, or qualification it belongs to.
Practical schema:
table_evidence:
table_id: tbl_2026_q4_fx_impact
source_id: annual_report_2026
doc_id: annual_report_2026_pdf
page_id: p_117
bbox: [82, 190, 1420, 890]
title: "FX impact by segment"
period: "Q4 2026"
normalized_cells_ref: s3://tables/tbl_2026_q4_fx_impact.parquet
row_headers: ["North America", "Europe", "APAC"]
column_headers: ["Q4 revenue", "FX impact", "EBITDA margin impact"]
cell_range: "B2:C4"
units: ["USD millions", "percentage points"]
footnotes: ["Excludes one-time restructuring costs"]
visual_crop_ref: s3://crops/tbl_2026_q4_fx_impact.png
extraction_confidence: 0.93
extraction_model: layout-vlm-2026-04
When to pass a table to the model:
- small table: Markdown;
- wide table: selected rows/columns + units + footnotes;
- numeric QA: structured rows + exact cell references;
- low extraction confidence: include visual crop;
- high-stakes answer: cite row/cell/range, not only page.
For charts, the same rule applies. A chart is not just an image. You usually need title, axes, legend, series, units, period, visual crop, extracted values, and citation coordinates.
Reranking cascade
Reranking is not one stage in serious document retrieval. It is a cascade that spends more expensive models only after cheaper retrieval has preserved recall.
| Stage | Candidate count | Tool/model | What it evaluates | Cost |
|---|---|---|---|---|
| BM25 / exact | 500-2000 | Inverted index | IDs, exact terms, numbers, acronyms | low |
| Dense retrieval | 500-2000 | Text embeddings | Semantic similarity and paraphrases | low |
| Visual retrieval | 100-500 | Page/region embeddings | Layout, chart/table/page similarity | medium |
| Fusion / RRF | 100-500 | Deterministic | Combine sparse, dense, visual, metadata signals | low |
| Cross-encoder rerank | 50-200 -> 20-50 | Text reranker | Usefulness of text evidence for the query | medium |
| VLM rerank | visual 10-30 -> 3-8 | Vision-language model | Whether visual region supports the claim | high |
| LLM evidence judge | final 5-15 | LLM | Sufficiency, conflict, citation support | high |
The important distinction is between topical relevance and answer usefulness. A passage can be about EBITDA and still not explain the Q4 margin decline. For the EBITDA case, the reranker should prefer evidence that jointly covers Q4, margin delta, FX impact, primary source status, and table/footnote support.
Reranking is better designed as a cascade: cheap high-recall retrieval first, expensive VLM/LLM judgment only on a narrow candidate set, and a final context pack that carries source IDs, page/bbox/offset, and citation support.
Context packing algorithm
Context packing decides what the model sees, in what order, with which IDs, and under what budget. It is a decision layer after retrieval and reranking, not “take top-k and concatenate.”
def build_context_pack(query, ranked, budget):
ranked = filter_acl(ranked, query.user)
ranked = remove_deleted_or_stale_versions(ranked)
ranked = dedup_by_source_version_and_region(ranked)
ranked = dedup_by_semantic_hash(ranked)
primary = select_primary_sources(
ranked,
require_direct_answer=True,
limit=4,
)
numeric = select_numeric_evidence(
ranked,
require_units=True,
require_row_and_column_headers=True,
limit=3,
)
visual = select_visual_evidence(
ranked,
max_pages=2,
max_crops=4,
prefer_regions=True,
resolution="medium",
)
conflicts = detect_conflicts(ranked)
pack = []
pack += compress_text(primary, target_tokens=budget.text)
pack += render_tables(numeric, max_rows=30, include_units=True)
pack += attach_visual_crops(visual, max_media_tokens=budget.media)
pack += summarize_conflicts(conflicts, max_tokens=budget.conflicts)
return order_for_generation(pack)
For the EBITDA case, the context pack should not include the whole annual report. It should include one management discussion paragraph, the normalized FX table, the waterfall chart crop, and a transcript quote. If the packer drops the table crop because of budget, the final answer is already reasoning over an incomplete world.
Ordering matters. Models often overweight earlier context. The pack should begin with primary evidence, not a random dense top-1 result.
Minimum evidence unit contract:
type EvidenceUnit = {
evidence_id: string
source_id: string
source_uri?: string
doc_id?: string
page?: number
section?: string
bbox?: [number, number, number, number]
timestamp?: string
modality: "text" | "pdf_page" | "image_crop" | "table" | "video_frame" | "audio"
acl_scope: string[]
retrieved_score: number
rerank_score?: number
content_ref: string
checksum: string
}
Citation contract should be strict:
def validate_citations(answer, packed_evidence):
allowed = {ev.evidence_id for ev in packed_evidence}
for claim in answer.claims:
if not claim.evidence_ids:
raise CitationViolation("claim has no evidence")
if any(eid not in allowed for eid in claim.evidence_ids):
raise CitationViolation("claim cites evidence outside packed context")
return True
An answer with correct prose and the wrong citation is a production bug.
Cost, storage, and latency
Visual retrieval is not free. If an architecture says “keep a visual page index,” it should also admit the cost.
Track at least:
- page rendering DPI and image store size;
- batch embedding cost for page images and crops;
- single-vector vs multi-vector storage growth;
- late-interaction latency;
- re-indexing after OCR/layout model upgrades;
- cache invalidation for connector/web sources;
- TTL for fresh data;
- versioning for PDF updates;
- media token cost when passing page/crop evidence to the model.
Google’s Gemini 3 media resolution docs expose the token-cost geometry clearly: PDF default/medium is around 560 tokens/page, image high is 1120 tokens, video high is 280 tokens/frame, and video low/medium is 70 tokens/frame (Google AI docs). That means visual context should be passed coarse-to-fine:
page-level retrieval
-> region crop
-> high resolution only for selected evidence
-> generation with page/crop citations
Do not send every retrieved page to the model “just in case.” That turns a retrieval problem into a token budget problem.
Security constraints
Retrieval around LLMs is a security boundary. The system ingests untrusted text from web pages, PDFs, OCR layers, captions, MCP tools, connector results, and memory stores.
Separate three groups of risk.
Retrieval security
- ACL before retrieval, not after;
- tenant isolation;
- source trust levels;
- stale/deleted document handling;
- audit trail for retrieved, reranked, and rejected evidence;
- no private source in logs/traces without redaction.
Prompt injection
- untrusted retrieved text;
- hidden PDF text;
- malicious OCR layer;
- webpage instructions;
- citation spoofing;
- tool result injection.
Agent security
- tool-call allowlist;
- argument validation;
- public-web/private-connector separation;
- no write tools without explicit approval;
- rate limits and cost limits;
- exfiltration constraints.
Security test pack:
HTML/PDF injection:
hidden instructions in footer/OCR layer
expected: no unsafe tool call, no exfiltration
Cross-tenant retrieval:
forbidden doc exists in index
expected: no hit before rerank
Citation mismatch:
answer cites source not in packed context
expected: hard failure
Memory poisoning:
untrusted web content tries to persist instruction
expected: read-only or approval gate blocks it
Tool schema smuggling:
payload hidden in tool args
expected: schema and allowlist reject it
What OpenAI, Anthropic, and Google publicly expose
Public docs do not prove internal retriever topology. They show the surface around the model.
| Vendor | Public surface | How to interpret it |
|---|---|---|
| OpenAI | Responses API, web search, file search, deep research, code interpreter, computer use, remote MCP | Unified agentic runtime with tools and retrieval surfaces around the model |
| Anthropic | Contextual Retrieval, advanced tool use, web search, tool search, citations, multi-agent research | Strong public framing for context engineering, tool ergonomics, and budgeted research |
| Google / Gemini | Grounding with Google Search / Vertex AI Search, RAG Engine, Gemini Embedding 2, media resolution | The best-documented grounding stack and multimodal embedding/search surface |
What can be said:
- all three expose public search/grounding/citation/tool surfaces;
- retrieval sits on the runtime/tooling layer;
- citations are part of the system contract, but not proof that hallucination is impossible;
- visual/PDF understanding is becoming a production primitive.
What should not be claimed:
- that GPT, Claude, or Gemini internally use a specific retriever;
- that each vendor has one hidden unified multimodal index;
- that web/file search is built into the model weights;
- that public API behavior fully describes the internal inference stack.
When not to use multimodal or agentic retrieval
Strong architecture says not only when to use a technique, but also when not to.
Visual retrieval is usually not needed if:
- the corpus is plain text;
- there are no tables, diagrams, forms, or layout semantics;
- queries are mostly exact lookup;
- PDFs have a high-quality native text layer and reliable table extraction;
- latency/cost matters more than marginal visual recall;
- visual evidence is not used in citations or eval.
Agentic retrieval is usually not needed if:
- the task is deterministic;
- sources are known upfront;
- the answer must be low-latency;
- there is a good hybrid index;
- the question does not require exploration;
- tool calls do not add new evidence.
Long context may be enough if:
- the corpus is under roughly 200k tokens;
- there is one document and it must be read as a whole;
- there are no complex ACL/freshness constraints;
- cost and latency are acceptable;
- the task needs synthesis and cross-section reasoning.
But long context does not replace retrieval for large, dynamic, and permissioned corpora. Retrieval is still needed for selection, freshness, permissions, citations, observability, and cost control.
Practical conclusion
Multimodal retrieval is not about “supporting images.” It is necessary because, in real documents, evidence often lives in a table, diagram, layout, screenshot, or spatial relation - exactly where text-only retrieval is blind.
A good LLM system does not just call a strong model. It builds an evidence pipeline:
find the right evidence
-> rank it for answer utility
-> pack it with source IDs
-> generate with citations
-> verify claims
-> log retrieval path
If retrieval fails, the prompt will not save you. If context packing drops the critical table crop, model reasoning is already operating on the wrong world. If citations are not checked as a contract, a “grounded answer” remains a UI illusion.
Appendix A: engineering anchors
These numbers are not universal constants, but they are useful engineering anchors: retrieval quality improves through a cascade, tool surfaces should be loaded on demand, visual media has token cost, and agentic search needs a budget controller.
| Source | Number | What it supports |
|---|---|---|
| Anthropic Contextual Retrieval | top-20 retrieval failure rate 5.7% -> 3.7% with contextual embeddings; 2.9% with contextual BM25; 1.9% with reranking | Contextual chunks, BM25, and reranking move retrieval quality. |
| Anthropic Advanced Tool Use | 58 tools from 5 MCP servers around 55K tokens; Tool Search around 8.7K tokens and 85% reduction | Tool search matters when tool definitions consume the context budget. |
| Anthropic Multi-Agent Research | multi-agent systems around 15x tokens relative to chat; single agent around 4x | Agentic retrieval needs explicit budget control. |
| Gemini media resolution | Gemini 3 PDF medium 560 tokens/page; image high 1120; video high 280/frame | Visual context has real token geometry. |
| VisRAG | 25-39% end-to-end gain over text-based RAG | Parsing/OCR loss is measurable. |
| ViDoRe V3 | 26,000 pages, 3,099 queries, 6 languages | Enterprise document retrieval should be evaluated on page-level evidence, not only chunks. |
| MMDocIR | VLM-text roughly 1.5x OCR-text for tables and 3.8x for figures | OCR loses multimodal signal before retrieval. |
| From BM25 to Corrective RAG | hybrid + neural rerank: Recall@5 = 0.816, MRR@3 = 0.605 | Mixed text/table QA needs hybrid retrieval and reranking. |
Appendix B: vendor surfaces
| Surface | OpenAI | Anthropic | Google / Gemini |
|---|---|---|---|
| Web retrieval | Web search, deep research | Web search with filtering | Google Search grounding |
| Private retrieval | File search over vector stores | Files API, search result blocks, citations | Vertex AI Search, File Search API preview, RAG Engine |
| Tool retrieval | Tool search, MCP | Tool Search Tool, deferred loading | Function calling and platform tool surfaces |
| Citations | URL and file citations | Search result blocks, document/page/character citations | Grounding metadata and source links |
| Multimodal docs | Text/images, file/image analysis, cookbook patterns | PDFs/images/files plus citations | Native PDF/document understanding, media resolution |
| Agentic research | Deep research, background mode, MCP search/fetch | Multi-agent research system | Deep Research Agent / grounding surfaces |
Write asymmetrically. Google publicly exposes more of the multimodal embedding/grounding layer. Anthropic publicly explains context engineering and tool ergonomics more clearly. OpenAI packages a unified agentic runtime through Responses/deep research surfaces.
Appendix C: experiments before production rollout
| Priority | Experiment | Minimal protocol |
|---|---|---|
| P0 | Page-image vs OCR-text vs VLM-text vs hybrid | ViDoRe V3 + MMDocIR + internal PDFs; Recall@K, nDCG@10, evidence-page recall, latency, storage |
| P0 | Hybrid + reranker ablation | BM25, dense, hybrid RRF, hybrid + cross-encoder; MRR@3, Recall@5, nDCG@10, cost |
| P0 | Context packing ablation | naive concat vs dedup/diversity/source-order/page-crops; correctness, faithfulness, citation precision |
| P0 | Agentic vs non-agentic routing | simple facts, targeted QA, open-ended research; tool calls, tokens, wall-clock, timeout |
| P0 | Security harness | prompt injection, ACL leakage, stale source handling, citation mismatch, tool result injection |
| P1 | Long context vs retrieval routing | full-document, retrieved, hybrid; cost, latency, faithfulness |
| P1 | Visual reranker vs textual reranker | same candidate pool, compare text reranker and VLM reranker |
Without these ablations, the article remains a strong engineering synthesis. With them, it becomes an architecture note you can use to change a production pipeline.
Appendix D: failure modes
| Failure mode | Symptom | Mitigation |
|---|---|---|
| OCR lost critical signal | Model does not see table/legend/caption | visual page index, region crops, VLM rerank |
| Visual retriever found a similar page with the wrong meaning | Similar chart, different entity/date | metadata filters, text+visual fusion |
| Dense retriever missed exact ID | Error code/order ID not found | BM25/exact index |
| Reranker chose secondary source | Blog/SEO page ranks above official doc | source authority features |
| Context packing dropped the key evidence | Correct doc retrieved, wrong answer generated | packing eval, evidence ablation |
| Citation mismatch | Claim cites source outside context | citation contract hard failure |
| Stale index | Answer uses old version | versioning, freshness routing |
| ACL leakage | Evidence from another tenant | pre-retrieval ACL filtering |
| Prompt injection from source | Model follows instructions from PDF/web page | source sandboxing, tool-call validation |
| Agentic loop burns budget | Many search/fetch calls, no new facts | budget controller, early stopping |
Appendix E: sources
OpenAI
- Migrate to the Responses API
- Web search
- File search
- Deep research
- Image Understanding with RAG
- Tool search
Anthropic
- How we built our multi-agent research system
- Contextual Retrieval
- Introducing advanced tool use
- Web search tool
- Tool search tool
- Writing effective tools for AI agents
- Effective context engineering for AI agents
- Files API
- Citations
Google / Gemini
- Grounding overview
- Grounding with Vertex AI Search
- Gemini Embedding 2
- Gemini 3 Developer Guide
- Gemini media resolution
- Document understanding
- Use Vertex AI RAG Engine in Gemini Live API
Research
FAQ
What is multimodal retrieval for LLMs?
It is the evidence selection layer before generation. It searches and packages not only text, but also PDF pages, images, tables, charts, screenshots, video/audio segments, and structured records. Its job is to give the model a minimal, relevant, and verifiable context pack.
How is multimodal retrieval different from regular RAG?
Regular RAG is often reduced to text chunks and vector search. Multimodal retrieval works across multiple modalities, multiple indexes, visual/page embeddings, crops, table structures, keyframes, OCR, metadata, ACLs, reranking, and context packing.
Why is retrieval not the same as LLM memory?
Memory is persistent state. Retrieval is runtime selection of external evidence. A model can have memory and use retrieval at the same time, but they are different mechanisms.
Why is vector search top-k not enough?
Dense similarity does not guarantee usefulness for the answer. It is weak on exact IDs, numbers, table semantics, source authority, freshness, ACLs, and visual layout. A production pipeline needs hybrid search, reranking, context packing, and eval.
Does long context replace retrieval?
No. Long context is useful for synthesis and small corpora, but retrieval is still needed for selection, freshness, ACLs, cost, latency, citations, and observability.
Where is visual document retrieval needed?
In PDFs, scans, slide decks, financial reports, scientific papers, contracts with tables, UI screenshots, diagrams, charts, forms, and product images - anywhere the meaning lives in layout or visual structure rather than only in extracted text.
What metrics are needed to evaluate multimodal retrieval?
At minimum: Recall@K, MRR, nDCG@10, evidence page recall, visual region recall, answer correctness, faithfulness, groundedness, citation precision/recall, unsupported claim rate, latency, token usage, cost, and security evals.
When do you need agentic retrieval?
Agentic retrieval is useful for open-ended research, multi-hop investigation, and breadth-first tasks where the search strategy is not known upfront. For simple fact lookup and targeted document QA, hybrid search, reranking, and citation-aware context packing are usually enough.