Multimodal Retrieval for LLMs

35 min
Demo

Scope note: this article does not try to reconstruct the internal retrieval architecture of GPT, Claude, or Gemini. Public materials expose a different layer: the inference runtime and tools around the model - web search, file search, grounding, connectors, MCP, visual/PDF understanding, citations, and agentic search loops.

Imagine a RAG system for financial reports. A user asks:

Why did EBITDA margin decline in Q4, and where is that supported?

The corpus is typical for enterprise QA: an annual report in PDF, an investor deck, a spreadsheet appendix, and an earnings call transcript. The report contains a paragraph about EBITDA, an FX adjustment table on the next page, and a waterfall chart in the deck. The text retriever finds the EBITDA paragraph. OCR fails to extract the table with correct headers. The dense retriever promotes a passage about operating expenses: semantically close, but causally insufficient. The model answers confidently: margin declined because costs increased. There is a citation, but it points to nearby topical text rather than the actual evidence.

The mistake happened before generation: the system selected the wrong evidence.

A strong model will not fix a bad context pack. It will simply assemble a cleaner answer from incomplete or incorrect sources. In production LLM systems, the failure often does not happen at the decoding stage. It happens at the retrieval stage: what was found, what was filtered out, what was reordered, what was dropped because of budget, and what ultimately entered the context window.

The core tension

For documents, PDFs, screenshots, and tables, three common simplifications break down quickly.

Dense-only retrieval is not enough. Dense similarity is good at semantic paraphrases, but it is weaker on exact IDs, error codes, numbers, ticker-like identifiers, dates, row/column labels, and source authority.

OCR-only retrieval is not enough. OCR-only pipelines do not fail only when OCR itself makes a mistake. Even perfect OCR can destroy spatial relations: which arrow points to which object, which footnote applies to which row, which color in a chart corresponds to which series, where the column header ends and the value begins.

Agentic retrieval is expensive. Agentic search is useful when the search strategy is not known upfront. If the source and task are clear - for example, finding evidence in a report, contract, or policy - agentic behavior often adds cost and nondeterminism instead of quality. Anthropic shows that multi-agent research can deliver a large uplift on breadth-first tasks, but it also reports a substantial increase in token usage in agentic modes.

The right production frame looks like this:

high-recall retrieval
  -> utility-aware reranking
  -> source-aware context packing
  -> citation-bound generation
  -> eval / telemetry / security checks
Reference architecture for multimodal retrieval around LLMs

Reference architecture: retrieval here is not a single vector search call. It is a governed evidence selection layer with routing, specialized indexes, reranking, context packing, citations, and an eval/security loop.

Running example: EBITDA margin and the FX table

A weak pipeline:

query
  -> dense vector search over OCR chunks
  -> top-5 text chunks
  -> prompt concat
  -> answer with citation

What breaks:

  • the query “why did margin decline” requires causal evidence, not just similarity;
  • the FX impact table is stored as broken OCR text;
  • the waterfall chart was never found because there is no visual page index;
  • the citation points to a page with general topical text, not the table or chart;
  • the model gives a plausible but incomplete answer.

A better pipeline:

query
  -> query understanding: financial report, numeric explanation, source citation required
  -> BM25: EBITDA margin, Q4, FX, adjustment
  -> dense retrieval: related management discussion chunks
  -> visual page retrieval: pages with charts/tables
  -> table extraction: FX impact rows, units, footnotes
  -> fusion + reranking
  -> context packing: text + normalized table + page crop
  -> answer with page/table citations
Bad OCR-only pipeline versus multimodal retrieval pipeline

On the same question, a weak OCR-only pipeline can look grounded while losing the causal evidence. A multimodal pipeline separately retrieves text chunks, visual pages, the table, footnotes, and page/table citations.

In this pipeline, the LLM is no longer answering from “similar text.” It is answering from an evidence pack: the primary report page, the table range, the visual chart crop, and the transcript quote. That is the practical meaning of multimodal retrieval around LLMs.

What this article means by retrieval

To avoid mixing architectural layers:

  • search is the search infrastructure: BM25, ANN, dense vectors, metadata filters, web/enterprise search, SQL, graph traversal;
  • retrieval is the selection of evidence for a specific query;
  • RAG is generation with selected external context;
  • grounding is verifiable attachment of claims to sources;
  • memory is persistent state about a user or task, not runtime retrieval;
  • context packing is packaging retrieved evidence into a limited context window;
  • citations are not decorative; they are a traceability contract.

The focus of this article is documents, PDFs, screenshots, and tables. Video, audio, and product image search are treated as extensions: the same ideas apply, but the retrieval units and cost model are different.

How to choose a retrieval strategy

In production, it is more useful to choose the retrieval strategy first and draw the architecture second. A routing mistake is usually more expensive than a bad top_k value.

if query has exact ID / error code / API name / invoice number:
  use BM25/exact index + metadata filters
  add dense retrieval only as secondary signal

if query targets PDF with tables/charts:
  use text/OCR index + visual page index + table extraction
  rerank with source authority and numeric overlap

if query includes screenshot or UI state:
  first parse visible UI state
  then search docs/tickets/logs by extracted labels/errors

if query is open-ended research:
  use agentic retrieval with explicit budgets
  require citations and source diversity

if corpus is small and single-use:
  consider long context
  still keep page/source references

if freshness or ACL matters:
  retrieval is mandatory
  apply permissions before candidate generation

Key point: retrieval strategy is selected at the routing layer. It is not a top_k setting in a vector database.

Minimal production baseline

LevelWhat it includesWhen it is enough
MVPBM25 + dense text embeddings + metadata/ACL filters + citationsPlain text corpus, known sources, low visual complexity
MVP+ for PDFMVP + page rendering + visual page embeddings + table cropsReports, decks, scanned PDFs, screenshots, tables
AdvancedMulti-vector late interaction + cross-encoder/VLM reranking + context packing ablationsHigh-value enterprise QA, finance, legal, industrial documents
Research-gradeHuman-labeled evidence, retrieval ablations, security harness, agentic router benchmarkBefore changing production retrieval architecture at scale

For enterprise document RAG, I would not start with an agentic framework. I would start with this baseline:

ingestion:
  native text + OCR
  page images
  table/chart/figure crops
  metadata + ACL + versioning

indexes:
  BM25 over normalized text
  dense text vectors
  visual page vectors
  optional region/table vectors

retrieval:
  ACL before retrieval
  parallel text + dense + visual search
  RRF or score fusion
  cross-encoder rerank top 50-200
  VLM rerank only for top visual candidates

generation:
  context pack with source/page/bbox IDs
  claim-level citations
  unsupported-claim checks

Multimodal indexing for PDF, screenshots, and tables

In text-only RAG, the retrieval unit is usually a chunk. In multimodal retrieval, the retrieval unit is no longer obvious: it can be a PDF page, a table crop, a screenshot region, a video keyframe, a table row, a chart legend, or a transcript segment.

Minimal index design:

Index A: text/OCR chunks
  - BM25/exact matching
  - dense text embeddings
  - doc_id, page_id, section, heading_path
  - ACL, timestamp, version

Index B: visual pages / regions
  - page screenshots
  - table/chart/figure crops
  - visual embeddings
  - bbox, page_id, source_id

Index C: structured tables
  - normalized cells
  - row/column headers
  - units, periods, footnotes
  - source page + bbox

For the EBITDA case, a single index will almost certainly lose. The text/OCR index stores management discussion, transcript passages, and exact terms: EBITDA, Q4, FX, margin. The visual page/region index stores the report page image, the waterfall chart slide, and the FX adjustment table crop. The structured table index stores normalized cells, row/column headers, units, footnotes, and page/bbox citation.

Why not one index? Different failure modes require different signals. BM25 catches exact names and numbers. Dense retrieval catches semantic match. Visual retrieval catches layout, diagrams, forms, labels, and neighboring elements. The structured table index matters when the answer depends on row/column semantics.

A practical evidence unit:

evidence_unit:
  id: ev_fx_margin_impact_q4
  source_id: annual_report_2026
  modality: table_cell_range
  page: 117
  bbox: [82, 190, 1420, 890]
  text: "FX impact reduced EBITDA margin by 1.2 percentage points"
  table:
    table_id: tbl_2026_q4_fx_impact
    row_range: ["Europe", "APAC"]
    column_range: ["FX impact", "EBITDA margin impact"]
    units: ["USD millions", "percentage points"]
    footnotes: ["Excludes one-time restructuring costs"]
  retrieval_scores:
    bm25: 12.4
    dense: 0.73
    visual: 0.81
    rerank: 0.92
  acl_scope: ["finance-team"]
  version: "2026-04-18"

The research direction supports this. ColPali builds retrieval over page images with multi-vector late interaction. VisRAG shows that direct document-image retrieval can deliver 25-39% end-to-end gain over text-based RAG. ViDoRe V3 targets enterprise document retrieval directly: 26,000 pages, 3,099 queries, 6 languages, human-verified page relevance, and bounding boxes.

Tables as a separate evidence type

Tables should not be treated as “text with line breaks.” In financial, legal, medical, and operational documents, the table is often the primary evidence. A Markdown table is convenient for the LLM, but it does not replace a structured representation: a cell without row/column headers, units, and footnotes is close to useless.

Store tables in several forms:

  • as a visual crop for layout and footnote verification;
  • as normalized cells for numeric lookup;
  • as Markdown/CSV for LLM input;
  • as structured records for filters and aggregations;
  • as source-linked evidence with page, bbox, and table_id.

For the EBITDA case, the table must answer more than “where does FX occur.” It must preserve the relationship between segment, period, unit, margin impact, and footnote. Otherwise the model sees a number but not what row, period, or qualification it belongs to.

Practical schema:

table_evidence:
  table_id: tbl_2026_q4_fx_impact
  source_id: annual_report_2026
  doc_id: annual_report_2026_pdf
  page_id: p_117
  bbox: [82, 190, 1420, 890]
  title: "FX impact by segment"
  period: "Q4 2026"
  normalized_cells_ref: s3://tables/tbl_2026_q4_fx_impact.parquet
  row_headers: ["North America", "Europe", "APAC"]
  column_headers: ["Q4 revenue", "FX impact", "EBITDA margin impact"]
  cell_range: "B2:C4"
  units: ["USD millions", "percentage points"]
  footnotes: ["Excludes one-time restructuring costs"]
  visual_crop_ref: s3://crops/tbl_2026_q4_fx_impact.png
  extraction_confidence: 0.93
  extraction_model: layout-vlm-2026-04

When to pass a table to the model:

  • small table: Markdown;
  • wide table: selected rows/columns + units + footnotes;
  • numeric QA: structured rows + exact cell references;
  • low extraction confidence: include visual crop;
  • high-stakes answer: cite row/cell/range, not only page.

For charts, the same rule applies. A chart is not just an image. You usually need title, axes, legend, series, units, period, visual crop, extracted values, and citation coordinates.

Reranking cascade

Reranking is not one stage in serious document retrieval. It is a cascade that spends more expensive models only after cheaper retrieval has preserved recall.

StageCandidate countTool/modelWhat it evaluatesCost
BM25 / exact500-2000Inverted indexIDs, exact terms, numbers, acronymslow
Dense retrieval500-2000Text embeddingsSemantic similarity and paraphraseslow
Visual retrieval100-500Page/region embeddingsLayout, chart/table/page similaritymedium
Fusion / RRF100-500DeterministicCombine sparse, dense, visual, metadata signalslow
Cross-encoder rerank50-200 -> 20-50Text rerankerUsefulness of text evidence for the querymedium
VLM rerankvisual 10-30 -> 3-8Vision-language modelWhether visual region supports the claimhigh
LLM evidence judgefinal 5-15LLMSufficiency, conflict, citation supporthigh

The important distinction is between topical relevance and answer usefulness. A passage can be about EBITDA and still not explain the Q4 margin decline. For the EBITDA case, the reranker should prefer evidence that jointly covers Q4, margin delta, FX impact, primary source status, and table/footnote support.

Reranking cascade and context packing contract

Reranking is better designed as a cascade: cheap high-recall retrieval first, expensive VLM/LLM judgment only on a narrow candidate set, and a final context pack that carries source IDs, page/bbox/offset, and citation support.

Context packing algorithm

Context packing decides what the model sees, in what order, with which IDs, and under what budget. It is a decision layer after retrieval and reranking, not “take top-k and concatenate.”

def build_context_pack(query, ranked, budget):
    ranked = filter_acl(ranked, query.user)
    ranked = remove_deleted_or_stale_versions(ranked)
    ranked = dedup_by_source_version_and_region(ranked)
    ranked = dedup_by_semantic_hash(ranked)

    primary = select_primary_sources(
        ranked,
        require_direct_answer=True,
        limit=4,
    )

    numeric = select_numeric_evidence(
        ranked,
        require_units=True,
        require_row_and_column_headers=True,
        limit=3,
    )

    visual = select_visual_evidence(
        ranked,
        max_pages=2,
        max_crops=4,
        prefer_regions=True,
        resolution="medium",
    )

    conflicts = detect_conflicts(ranked)

    pack = []
    pack += compress_text(primary, target_tokens=budget.text)
    pack += render_tables(numeric, max_rows=30, include_units=True)
    pack += attach_visual_crops(visual, max_media_tokens=budget.media)
    pack += summarize_conflicts(conflicts, max_tokens=budget.conflicts)

    return order_for_generation(pack)

For the EBITDA case, the context pack should not include the whole annual report. It should include one management discussion paragraph, the normalized FX table, the waterfall chart crop, and a transcript quote. If the packer drops the table crop because of budget, the final answer is already reasoning over an incomplete world.

Ordering matters. Models often overweight earlier context. The pack should begin with primary evidence, not a random dense top-1 result.

Minimum evidence unit contract:

type EvidenceUnit = {
  evidence_id: string
  source_id: string
  source_uri?: string
  doc_id?: string
  page?: number
  section?: string
  bbox?: [number, number, number, number]
  timestamp?: string
  modality: "text" | "pdf_page" | "image_crop" | "table" | "video_frame" | "audio"
  acl_scope: string[]
  retrieved_score: number
  rerank_score?: number
  content_ref: string
  checksum: string
}

Citation contract should be strict:

def validate_citations(answer, packed_evidence):
    allowed = {ev.evidence_id for ev in packed_evidence}

    for claim in answer.claims:
        if not claim.evidence_ids:
            raise CitationViolation("claim has no evidence")
        if any(eid not in allowed for eid in claim.evidence_ids):
            raise CitationViolation("claim cites evidence outside packed context")

    return True

An answer with correct prose and the wrong citation is a production bug.

Cost, storage, and latency

Visual retrieval is not free. If an architecture says “keep a visual page index,” it should also admit the cost.

Track at least:

  • page rendering DPI and image store size;
  • batch embedding cost for page images and crops;
  • single-vector vs multi-vector storage growth;
  • late-interaction latency;
  • re-indexing after OCR/layout model upgrades;
  • cache invalidation for connector/web sources;
  • TTL for fresh data;
  • versioning for PDF updates;
  • media token cost when passing page/crop evidence to the model.

Google’s Gemini 3 media resolution docs expose the token-cost geometry clearly: PDF default/medium is around 560 tokens/page, image high is 1120 tokens, video high is 280 tokens/frame, and video low/medium is 70 tokens/frame (Google AI docs). That means visual context should be passed coarse-to-fine:

page-level retrieval
  -> region crop
  -> high resolution only for selected evidence
  -> generation with page/crop citations

Do not send every retrieved page to the model “just in case.” That turns a retrieval problem into a token budget problem.

Security constraints

Retrieval around LLMs is a security boundary. The system ingests untrusted text from web pages, PDFs, OCR layers, captions, MCP tools, connector results, and memory stores.

Separate three groups of risk.

Retrieval security

  • ACL before retrieval, not after;
  • tenant isolation;
  • source trust levels;
  • stale/deleted document handling;
  • audit trail for retrieved, reranked, and rejected evidence;
  • no private source in logs/traces without redaction.

Prompt injection

  • untrusted retrieved text;
  • hidden PDF text;
  • malicious OCR layer;
  • webpage instructions;
  • citation spoofing;
  • tool result injection.

Agent security

  • tool-call allowlist;
  • argument validation;
  • public-web/private-connector separation;
  • no write tools without explicit approval;
  • rate limits and cost limits;
  • exfiltration constraints.

Security test pack:

HTML/PDF injection:
  hidden instructions in footer/OCR layer
  expected: no unsafe tool call, no exfiltration

Cross-tenant retrieval:
  forbidden doc exists in index
  expected: no hit before rerank

Citation mismatch:
  answer cites source not in packed context
  expected: hard failure

Memory poisoning:
  untrusted web content tries to persist instruction
  expected: read-only or approval gate blocks it

Tool schema smuggling:
  payload hidden in tool args
  expected: schema and allowlist reject it

What OpenAI, Anthropic, and Google publicly expose

Public docs do not prove internal retriever topology. They show the surface around the model.

VendorPublic surfaceHow to interpret it
OpenAIResponses API, web search, file search, deep research, code interpreter, computer use, remote MCPUnified agentic runtime with tools and retrieval surfaces around the model
AnthropicContextual Retrieval, advanced tool use, web search, tool search, citations, multi-agent researchStrong public framing for context engineering, tool ergonomics, and budgeted research
Google / GeminiGrounding with Google Search / Vertex AI Search, RAG Engine, Gemini Embedding 2, media resolutionThe best-documented grounding stack and multimodal embedding/search surface

What can be said:

  • all three expose public search/grounding/citation/tool surfaces;
  • retrieval sits on the runtime/tooling layer;
  • citations are part of the system contract, but not proof that hallucination is impossible;
  • visual/PDF understanding is becoming a production primitive.

What should not be claimed:

  • that GPT, Claude, or Gemini internally use a specific retriever;
  • that each vendor has one hidden unified multimodal index;
  • that web/file search is built into the model weights;
  • that public API behavior fully describes the internal inference stack.

When not to use multimodal or agentic retrieval

Strong architecture says not only when to use a technique, but also when not to.

Visual retrieval is usually not needed if:

  • the corpus is plain text;
  • there are no tables, diagrams, forms, or layout semantics;
  • queries are mostly exact lookup;
  • PDFs have a high-quality native text layer and reliable table extraction;
  • latency/cost matters more than marginal visual recall;
  • visual evidence is not used in citations or eval.

Agentic retrieval is usually not needed if:

  • the task is deterministic;
  • sources are known upfront;
  • the answer must be low-latency;
  • there is a good hybrid index;
  • the question does not require exploration;
  • tool calls do not add new evidence.

Long context may be enough if:

  • the corpus is under roughly 200k tokens;
  • there is one document and it must be read as a whole;
  • there are no complex ACL/freshness constraints;
  • cost and latency are acceptable;
  • the task needs synthesis and cross-section reasoning.

But long context does not replace retrieval for large, dynamic, and permissioned corpora. Retrieval is still needed for selection, freshness, permissions, citations, observability, and cost control.

Practical conclusion

Multimodal retrieval is not about “supporting images.” It is necessary because, in real documents, evidence often lives in a table, diagram, layout, screenshot, or spatial relation - exactly where text-only retrieval is blind.

A good LLM system does not just call a strong model. It builds an evidence pipeline:

find the right evidence
  -> rank it for answer utility
  -> pack it with source IDs
  -> generate with citations
  -> verify claims
  -> log retrieval path

If retrieval fails, the prompt will not save you. If context packing drops the critical table crop, model reasoning is already operating on the wrong world. If citations are not checked as a contract, a “grounded answer” remains a UI illusion.

Appendix A: engineering anchors

Engineering numbers for multimodal retrieval

These numbers are not universal constants, but they are useful engineering anchors: retrieval quality improves through a cascade, tool surfaces should be loaded on demand, visual media has token cost, and agentic search needs a budget controller.

SourceNumberWhat it supports
Anthropic Contextual Retrievaltop-20 retrieval failure rate 5.7% -> 3.7% with contextual embeddings; 2.9% with contextual BM25; 1.9% with rerankingContextual chunks, BM25, and reranking move retrieval quality.
Anthropic Advanced Tool Use58 tools from 5 MCP servers around 55K tokens; Tool Search around 8.7K tokens and 85% reductionTool search matters when tool definitions consume the context budget.
Anthropic Multi-Agent Researchmulti-agent systems around 15x tokens relative to chat; single agent around 4xAgentic retrieval needs explicit budget control.
Gemini media resolutionGemini 3 PDF medium 560 tokens/page; image high 1120; video high 280/frameVisual context has real token geometry.
VisRAG25-39% end-to-end gain over text-based RAGParsing/OCR loss is measurable.
ViDoRe V326,000 pages, 3,099 queries, 6 languagesEnterprise document retrieval should be evaluated on page-level evidence, not only chunks.
MMDocIRVLM-text roughly 1.5x OCR-text for tables and 3.8x for figuresOCR loses multimodal signal before retrieval.
From BM25 to Corrective RAGhybrid + neural rerank: Recall@5 = 0.816, MRR@3 = 0.605Mixed text/table QA needs hybrid retrieval and reranking.

Appendix B: vendor surfaces

SurfaceOpenAIAnthropicGoogle / Gemini
Web retrievalWeb search, deep researchWeb search with filteringGoogle Search grounding
Private retrievalFile search over vector storesFiles API, search result blocks, citationsVertex AI Search, File Search API preview, RAG Engine
Tool retrievalTool search, MCPTool Search Tool, deferred loadingFunction calling and platform tool surfaces
CitationsURL and file citationsSearch result blocks, document/page/character citationsGrounding metadata and source links
Multimodal docsText/images, file/image analysis, cookbook patternsPDFs/images/files plus citationsNative PDF/document understanding, media resolution
Agentic researchDeep research, background mode, MCP search/fetchMulti-agent research systemDeep Research Agent / grounding surfaces

Write asymmetrically. Google publicly exposes more of the multimodal embedding/grounding layer. Anthropic publicly explains context engineering and tool ergonomics more clearly. OpenAI packages a unified agentic runtime through Responses/deep research surfaces.

Appendix C: experiments before production rollout

PriorityExperimentMinimal protocol
P0Page-image vs OCR-text vs VLM-text vs hybridViDoRe V3 + MMDocIR + internal PDFs; Recall@K, nDCG@10, evidence-page recall, latency, storage
P0Hybrid + reranker ablationBM25, dense, hybrid RRF, hybrid + cross-encoder; MRR@3, Recall@5, nDCG@10, cost
P0Context packing ablationnaive concat vs dedup/diversity/source-order/page-crops; correctness, faithfulness, citation precision
P0Agentic vs non-agentic routingsimple facts, targeted QA, open-ended research; tool calls, tokens, wall-clock, timeout
P0Security harnessprompt injection, ACL leakage, stale source handling, citation mismatch, tool result injection
P1Long context vs retrieval routingfull-document, retrieved, hybrid; cost, latency, faithfulness
P1Visual reranker vs textual rerankersame candidate pool, compare text reranker and VLM reranker

Without these ablations, the article remains a strong engineering synthesis. With them, it becomes an architecture note you can use to change a production pipeline.

Appendix D: failure modes

Failure modeSymptomMitigation
OCR lost critical signalModel does not see table/legend/captionvisual page index, region crops, VLM rerank
Visual retriever found a similar page with the wrong meaningSimilar chart, different entity/datemetadata filters, text+visual fusion
Dense retriever missed exact IDError code/order ID not foundBM25/exact index
Reranker chose secondary sourceBlog/SEO page ranks above official docsource authority features
Context packing dropped the key evidenceCorrect doc retrieved, wrong answer generatedpacking eval, evidence ablation
Citation mismatchClaim cites source outside contextcitation contract hard failure
Stale indexAnswer uses old versionversioning, freshness routing
ACL leakageEvidence from another tenantpre-retrieval ACL filtering
Prompt injection from sourceModel follows instructions from PDF/web pagesource sandboxing, tool-call validation
Agentic loop burns budgetMany search/fetch calls, no new factsbudget controller, early stopping

Appendix E: sources

OpenAI

Anthropic

Google / Gemini

Research

FAQ

What is multimodal retrieval for LLMs?

It is the evidence selection layer before generation. It searches and packages not only text, but also PDF pages, images, tables, charts, screenshots, video/audio segments, and structured records. Its job is to give the model a minimal, relevant, and verifiable context pack.

How is multimodal retrieval different from regular RAG?

Regular RAG is often reduced to text chunks and vector search. Multimodal retrieval works across multiple modalities, multiple indexes, visual/page embeddings, crops, table structures, keyframes, OCR, metadata, ACLs, reranking, and context packing.

Why is retrieval not the same as LLM memory?

Memory is persistent state. Retrieval is runtime selection of external evidence. A model can have memory and use retrieval at the same time, but they are different mechanisms.

Why is vector search top-k not enough?

Dense similarity does not guarantee usefulness for the answer. It is weak on exact IDs, numbers, table semantics, source authority, freshness, ACLs, and visual layout. A production pipeline needs hybrid search, reranking, context packing, and eval.

Does long context replace retrieval?

No. Long context is useful for synthesis and small corpora, but retrieval is still needed for selection, freshness, ACLs, cost, latency, citations, and observability.

Where is visual document retrieval needed?

In PDFs, scans, slide decks, financial reports, scientific papers, contracts with tables, UI screenshots, diagrams, charts, forms, and product images - anywhere the meaning lives in layout or visual structure rather than only in extracted text.

What metrics are needed to evaluate multimodal retrieval?

At minimum: Recall@K, MRR, nDCG@10, evidence page recall, visual region recall, answer correctness, faithfulness, groundedness, citation precision/recall, unsupported claim rate, latency, token usage, cost, and security evals.

When do you need agentic retrieval?

Agentic retrieval is useful for open-ended research, multi-hop investigation, and breadth-first tasks where the search strategy is not known upfront. For simple fact lookup and targeted document QA, hybrid search, reranking, and citation-aware context packing are usually enough.