Tool role	Default failure policy	What must never happen	User-visible behavior
Permissions or entitlement lookup before retrieval	Fail closed on widening scope; optionally fail open to a narrower safe subset	Never broaden access because a lookup failed	Fewer results is acceptable; unauthorized results are not
Query interpretation before retrieval	Fail open to the raw-query baseline	Never replace a healthy baseline with an unvalidated rewrite	Baseline results remain available
Top-K enrichment after reranking	Fail open to the reranked baseline with claim suppression	Never let missing tool features silently reshuffle the result set	Answer without the live claim; keep the baseline order
Post-answer action	Fail closed behind approval	Never execute side effects on timeout, ambiguity, or denied approval	Keep the answer path; disable the action path

Pattern	When to use	Main benefit	Main risk	Latency impact	Control requirement	Fallback strategy
Structured pre-retrieval interpretation	Noisy queries, permissions-aware search, filter extraction, multi-index routing	Better query structure without changing the search contract	Query rewriting can drift from original intent	Low to medium	Strict schemas, bounded outputs, log original and interpreted query	Revert to raw query plus baseline retrieval
Retrieval-first with gated tool escalation	Most queries succeed with baseline search, but some need freshness or verification	Preserves the main retrieval path and spends tool budget only on ambiguous cases	Tool escalation becomes the happy path over time	Low for most traffic, medium on escalated queries	Explicit escalation rules, per-request tool budgets, clear fallback thresholds	Answer from baseline results with reduced confidence or narrower claims
Post-ranking top-K enrichment	Search quality depends on dynamic features such as inventory, entitlement, or policy	Fresh signals improve precision where they matter most	Partial tool failures distort reranking	Medium	Per-tool timeouts, default feature handling, rerank robustness checks	Keep reranked baseline and mark enrichment unavailable
Post-answer action with approval	Users need the system to act after selecting a result	Higher task completion without leaving the search flow	Side effects, access abuse, duplicate actions	Medium to high	Approval gates, least privilege, idempotency keys, audit logs	Keep the answer path, disable the action path, and hand off to a deterministic workflow

Metric group	Example metrics	What it tells you	Release gate example
Retrieval	Recall@K, nDCG@K, rewritten-vs-original query delta	Whether interpretation or tool-assisted rewriting changed evidence quality	Block if retrieval quality drops beyond the agreed tolerance
Reranking	Win rate, relevance delta, feature coverage	Whether the reranker still improves the candidate set under tool enrichment	Block if reranker win rate falls or feature coverage becomes unstable
Tools	Tool-call rate, redundant tool-call rate, invalid-parameter rate, timeout rate	Whether tool placement is bounded and reliable	Block if call rate or timeout rate exceeds budget
Latency	p95, p99, tool overhead delta, cancellation rate	Whether the second control loop is breaking the main SLO	Block if tool overhead breaches latency budget
Control	Fallback rate, approval rate, unsafe action attempts	Whether the system stays governable under real traffic	Block if fallback spikes or unsafe attempts rise
Economics	Cost per successful session, tool cost share, token growth	Whether added capability is worth the operational cost	Block if session economics worsen without offsetting quality gains

Failure mode	Symptom	Why it happens	Mitigation
Tool spam	Tool-call rate, token usage, and p95 rise without better relevance	Tools become the default path instead of an escalation path	Enforce bounded tool placement, `tool_choice`, allowlisting, and per-request tool budgets via the OpenAI function calling controls
Latency explosion	p99 grows, cancellations rise, and downstream pressure spreads	Multi-hop tool chains multiply tail latency and retries	Propagate deadlines, cap retries, and follow the Google SRE guidance on cascading failures and overload handling
Retrieval drift	Offline retrieval metrics look different after query interpretation changes	Rewriting changes what the retriever is actually solving for	Log original and interpreted queries, evaluate both, and gate rollout on rewritten-vs-original deltas
Reranking distortion	Items with successful enrichments outrank more relevant items with missing features	Partial tool failures leak into reranking decisions	Use timeout budgets, default feature handling, and a deterministic fallback rerank path
Tool contract drift	Invalid-parameter rate rises, structured outputs stop matching downstream expectations, and reranker feature coverage drops	Tool description, schema, or output shape changed without replay tests or eval updates	Version contracts, validate outputs, pin schemas, and run per-tool replay tests before rollout
Authorization drift	Missing or inconsistent results, approval-denial spikes, or unexplained access mismatches across environments	Scope changes, token audience bugs, approval path regressions, or MCP trust misconfiguration	Separate read and write scopes, validate token audience, log approvals and denials, and run auth regression tests per tool
Reasoning-budget creep	p95, token usage, and cost per successful session rise without meaningful quality gain	Planning or tool escalation quietly becomes the default path instead of the exception path	Cap reasoning effort, cap tool-call budgets, and compare enriched and baseline paths at session level
Indirect prompt injection	Retrieved content shapes tool parameters or triggers unsafe actions	Search corpora often contain untrusted text	Treat retrieved text as untrusted and apply the controls described in NIST’s Generative AI Profile and OpenAI’s agent safety guidance

What agentic search means in a production retrieval stack

Retrieval and reranking remain the primary control loop

What the tool layer is not allowed to change

Treat tool outputs as typed features, not free-form authority

Before retrieval, after reranking, or after answer selection?

Before retrieval

After reranking

After answer selection

Fail-open versus fail-closed by tool role

Four production patterns for retrieval, reranking, and tool orchestration

Production controls for tool use

1. Strict schemas and bounded outputs

2. Tool allowlisting and call discipline

3. Approval gates and MCP trust boundaries

3.5 Authorization, consent, and token boundaries are part of ranking control

4. Trace grading and runtime metrics

4.5 Telemetry contract for the second control loop

5. Kill switches, circuit breakers, and degraded mode

Search-specific evals and release gates

Worked example: e-commerce search with top-K enrichment

Step 1. Structured interpretation

Step 2. Retrieval and reranking

Step 3. Top-K enrichment

Step 4. Answer path

Step 5. Action path

Fallback behavior

Monitoring signals for this example

A common failure pattern in practice

Failure modes that break search quality first

Release discipline: stage 0 to stage 3

Stage 0. Keep a tool-free baseline

Stage 1. Add one bounded tool role

Stage 2. Add trace-based evals and release gates

Stage 3. Add action tools behind approval

Practical checklist

Final point

Further watching

Standards and references

FAQ

What is agentic search in a production retrieval stack?

What is the safest way to place tool calling in a search stack?

When should tool calls happen before retrieval versus after reranking?

Should tool outputs ever change final ordering in search?

Which metrics show that tool use is hurting search quality?

How should MCP tools be governed in production search systems?