What made this RAG assistant production-ready instead of demo-ready?

The system shipped with clear retrieval fallback paths, reranking, observability, and rollout controls from day one. It was designed to protect search quality and operating cost at the same time.

Why did hybrid retrieval matter in this case?

Semantic retrieval improved recall on messy natural-language catalog questions, while BM25 fallback and structured filters preserved coverage on exact brand, price, and attribute constraints.

How did the project stay within a strict cost envelope?

The serving stack was tuned around a single-GPU operating target, with model quantization, batching controls, and selective use of the generative layer instead of treating the LLM as the primary search engine.

Production RAG Catalog Assistant Case Study: Cost, Latency, CTR

One-liner: In six weeks, this RAG catalog assistant cut zero-result searches from 30% to 9%, improved CTR by 1.6 percentage points, and held the production path inside a one-GPU budget.

Executive summary

The client was a U.S. retailer with roughly 9 million products and a search experience that failed too often on natural-language queries. About 30% of catalog searches ended in an empty result page. When the system did respond, p90 latency was about 1.6 seconds, which was slow enough to hurt sessions and conversion. On top of that, the team was burning about $45,000 per month on a cloud LLM setup that had not yet produced a durable production architecture.

The mandate was practical: build a retrieval-first catalog assistant that could improve discovery without turning search into an unbounded chat demo. The resulting system combined semantic retrieval, lexical fallback, structured filters, reranking, and a fine-tuned Mistral-7B response layer. It reached the business target in one quarter and created a path toward broader search modernization without exploding infrastructure cost.

Metric	Before	After	Why it moved
Zero-result rate	30%	9%	Hybrid retrieval plus BM25 fallback recovered queries that previously died on lexical mismatch
p90 latency	1.6 s	0.8 s	Quantized serving, controlled batching, and a retrieval-first path reduced response time
CTR	12.4%	14.0%	Better candidate quality and reranking improved what users saw first
Cost per 1,000 queries	$2.2	$1.2	Self-hosted inference and budget-aware serving cut the cloud premium
GMV on 10% A/B slice	baseline	+6.3%	Search quality improvements propagated to commercial outcome

For the broader design philosophy behind this kind of retrieval-first system, the closest companion on the site is Agentic Search in Production. For the operational side, the rollout discipline matches the production patterns described in MLOps for a Support RAG Agent in 2026.

Why the client needed an MVP fast

This was not a greenfield AI initiative. It was a rescue project for a search surface that was already leaking value.

The catalog had grown large enough that lexical search alone no longer covered the real query distribution.
Shoppers were increasingly typing natural-language questions instead of short keyword strings.
Empty-result pages were suppressing sessions that should have converted.
The existing cloud LLM setup was expensive but still failed to behave like a disciplined retrieval system.

The business goal was not to “add chat.” It was to prove, within one quarter, that a production RAG layer could reduce empty searches, improve click behavior, and do so under a hard operating-cost ceiling. That framing matters because it changes the architecture. The LLM becomes a bounded interpretation and response layer on top of retrieval, not the authority that decides what exists in the catalog.

Success criteria

The MVP had five targets from the start:

Goal	Target	Result
Zero-result rate	below 10%	9%
p90 latency	below 1 second	0.8 s
Cost reduction	at least 30%	45%
CTR lift	measurable positive delta	+1.6 pp
Production reliability	SLA 99.5%	99.6%

These targets forced a very specific operating model. Retrieval had to do most of the heavy lifting. The reranker had to improve candidate order without destabilizing latency. The generative layer had to stay within a tight runtime budget. That is exactly the kind of constraint set that separates an actual product case study from an AI-flavored prototype.

System architecture

The system was built as a retrieval-first catalog assistant rather than a free-form conversational app.

Catalog RAG query-path diagram showing interpretation, semantic retrieval, BM25 fallback, reranking, and grounded answer generation.

Query path for hybrid retrieval, fallback, reranking, and grounded answer generation.

1. Ingestion and indexing

Catalog records were chunked by product content and operational attributes through an Airflow pipeline. Embeddings were generated with an E5-small family encoder and written into a three-shard Weaviate cluster using HNSW. A full reindex of roughly 10 million records could complete in about 45 minutes without taking the system offline.

This mattered because the assistant needed fresher coverage than a traditional nightly batch search stack. If indexing lags, the generative layer becomes a liability: it speaks fluently about stale evidence.

2. Retrieval

The retrieval path was intentionally hybrid.

Semantic retrieval handled natural-language intent and loose wording.
BM25 fallback protected exact brand, SKU, and attribute lookup cases.
Structured filters applied price and brand constraints where they were extractable.
A cross-encoder reranker compressed the candidate set to the most commercially and semantically useful results.

That combination is why the system could reduce zero-result pages without handing control to the LLM. On this site, the same principle appears again in the Search topic cluster: retrieval and ranking remain the contract, while language models extend interpretation and explanation.

3. Response layer

Ray Serve hosted an int8 Mistral-7B model in a tightly controlled serving path. First token streaming started at about 250 ms on the healthy path. The model had been fine-tuned on about 50,000 e-commerce dialogs, which improved response behavior, but the real production win came from discipline around when the model was allowed to answer.

It was not allowed to invent a catalog view outside the retrieved evidence set. The generation layer summarized, clarified, and packaged results. It did not replace search.

Catalog RAG serving architecture diagram showing UI and BFF, retrieval layer, generation layer, rollout control, and observability.

Serving architecture built around a one-GPU operating envelope, rollout control, and observable runtime behavior.

Critical engineering decisions

Retrieval before generation

The easiest way to ship a more impressive demo would have been to let the model “reason” more and search less. That would have been the wrong production choice. The final design kept the search stack authoritative and used the LLM as a bounded layer on top.

Reranking over bigger generation

A cross-encoder reranker improved the top of the result set more reliably than spending more tokens in the response stage. This was the same kind of trade-off discussed in The Offline-to-Online Gap in Deep Learning Recommender Systems: better ranking control tends to survive production better than more sophisticated narration.

Cost-aware serving

The operating envelope was real. The GPU budget was not a guideline. It was a hard limit. The team therefore used quantization, batching discipline, and cautious warm-path behavior to preserve latency without moving back to a costly cloud-heavy approach.

Guardrails against hallucination

The system used a simple but effective confidence policy:

no response without retrieved evidence
fallback to lexical retrieval when semantic coverage was weak
suppression of low-confidence generative claims
rollback-ready serving via canary controls

Customer complaints related to hallucinated answers stayed below 1%.

Rollout and observability

The serving layer used a FastAPI BFF, gRPC communication into Ray Serve, and KServe autoscaling with Argo Rollouts for progressive deployment. The rollout pattern was roughly 50/50 canary during evaluation, with rollback below one minute when the live path deviated from budget or relevance expectations.

The observability stack included:

Prometheus and Grafana for latency, throughput, and token cost
Loki for logs
Evidently drift checks on live query behavior
production alerts around latency and degraded coverage

One of the first post-launch issues was a spike in latency caused by overly aggressive batching that overheated the GPU path. The fix was operational rather than theoretical: throttle concurrency, add warm-up behavior, and keep the cost discipline intact. That is a useful pattern in its own right. Production systems usually fail first at the runtime boundary, not in the slide deck.

Business impact and economics

The commercial value of the system came from recovering previously lost search sessions and making those sessions more actionable.

Zero-result reduction from 30% to 9% created about 240,000 recovered sessions per month.
CTR improved by 1.6 percentage points.
GMV lifted by 6.3% on the A/B-tested traffic slice.
Cost per 1,000 queries dropped from $2.2 to $1.2.
GPU CAPEX paid back in about 70 days.

The important point is not just that cost went down. It is that cost went down while the system became more useful. Cheap retrieval that does not help the business is trivia. Expensive retrieval that does not survive production is theater. This case landed in the narrow band where search quality, user behavior, and unit economics all moved in the same direction.

My role

I owned the core ML and productionization decisions on the retrieval and serving path:

tuned Weaviate indexing and HNSW parameters
prepared the dataset and fine-tuned Mistral-7B with LoRA in an int8 deployment path
implemented the MiniLM-based reranking layer and prompt templates
wired metrics into Prometheus and drift signals into Evidently
analyzed the A/B results and turned them into a client-ready operating and ROI readout

This was not just model work. It was architecture, operating discipline, and measurement.

What this case proves

This project proved that a catalog assistant can be both commercially useful and operationally disciplined if retrieval remains primary.

It also created a credible next phase:

personalization features based on purchase history
upsell and cross-sell logic on top of the same retrieval core
mobile SDK support
reuse of the Helm-based deployment package for a second pilot environment

That is the right shape of a production AI case. The MVP is not an endpoint. It is a pressure-tested first operating model.

Bottom line

This RAG assistant did three things that matter in production: it recovered coverage, improved the top of the result set, and made the economics viable. The final system was not “agentic” in the loose demo sense. It was a bounded retrieval product with measurable gains in search quality, CTR, GMV, and operating cost.

For a CTO or Head of AI, that is the real signal: the architecture improved the business without giving up control.

RAG Assistant for Catalog

Executive summary

Why the client needed an MVP fast

Success criteria

System architecture

1. Ingestion and indexing

2. Retrieval

3. Response layer

Critical engineering decisions

Retrieval before generation

Reranking over bigger generation

Cost-aware serving

Guardrails against hallucination

Rollout and observability

Business impact and economics

My role

What this case proves

Bottom line

FAQ

What made this RAG assistant production-ready instead of demo-ready?

Why did hybrid retrieval matter in this case?

How did the project stay within a strict cost envelope?

Contact

Igor Yakushev
Senior ML Engineer

Executive summary

Why the client needed an MVP fast

Success criteria

System architecture

1. Ingestion and indexing

2. Retrieval

3. Response layer

Critical engineering decisions

Retrieval before generation

Reranking over bigger generation

Cost-aware serving

Guardrails against hallucination

Rollout and observability

Business impact and economics

My role

What this case proves

Bottom line

FAQ

What made this RAG assistant production-ready instead of demo-ready?

Why did hybrid retrieval matter in this case?

How did the project stay within a strict cost envelope?

Contact

Igor YakushevSenior ML Engineer

Igor Yakushev
Senior ML Engineer