Search and Recommendation System
A multimodal search and recommendation platform for 10M+ products that improved CTR, cut latency, and lowered serving cost through disciplined production architecture.
One-liner: This platform cut p95 search latency from 620 ms to 175 ms, reduced cost per QPS by 33%, and lifted CTR by 54% across a 10M+ SKU commerce catalog.
Executive summary
This project started as a search performance problem and became a broader discovery-platform redesign. The client had more than 10 million SKUs, rising retrieval cost, and a search surface slow enough to hurt revenue. At p95 above 500 ms, the system was already outside a healthy user-experience envelope. At the same time, the multimodal retrieval path was too expensive to scale comfortably, and incremental LLM-heavy fixes were driving token spend in the wrong direction.
The final platform combined multimodal embeddings, ANN retrieval, reranking, and disciplined serving into one production system used for both search and recommendation. The important outcome was not just better relevance. It was a platform that could improve discovery quality while staying inside a controlled latency, cost, and rollout envelope.
| Metric | Before | After | What changed |
|---|---|---|---|
| p95 latency | 620 ms | 175 ms | CLIP optimization, FAISS HNSW, Triton, and tighter serving controls |
| Cost per QPS | $0.28 | $0.19 | GPU efficiency and LLM cascade reduced the expensive path |
| Precision@10 | 0.54 | 0.66 | Cross-encoder reranking improved the top of the candidate set |
| Zero-result rate | 28% | 4% | Fallback coverage and better retrieval resilience reduced empty pages |
| CTR | baseline | +54% | Faster and more relevant results changed user behavior |
| GMV | baseline | +$8.4M | Search quality gains propagated into commercial lift |
The engineering ideas behind this platform connect naturally to Semantic IDs for LLM Recommender Systems and The Offline-to-Online Gap in Deep Learning Recommender Systems. This case is the production counterpart: what happens when those modeling ideas hit latency, rollout, and infrastructure constraints.
Why this was business-critical
The search system was directly constraining revenue.
- Slow retrieval was hurting product discovery at scale.
- Catalog growth was outpacing the previous indexing strategy.
- Cloud-heavy LLM expansion was becoming too expensive to treat as the default fix.
- Recommendation quality and search quality were drifting apart because different teams were optimizing different stacks.
That last point mattered more than it may seem. Search and recommendation are often treated as different product surfaces, but in this environment they shared the same real problem: product discovery was not fast enough, coherent enough, or cheap enough. Unifying them as one platform made the architecture cleaner and the operating model more defensible.
Success criteria
The project was measured against both user-facing and platform-facing targets.
| Metric | Target | Result |
|---|---|---|
| p90 latency | under 180 ms | 178 ms |
| Cost per QPS | down 30% | down 33% |
| Precision@10 | +10 percentage points | +12 pp |
| CTR | positive commercial lift | +54% |
| SLA | 99.95% | 99.97% |
These targets forced the platform to optimize for more than relevance. The system had to survive production traffic, keep rollout risk bounded, and stay economically sane.
Architecture
Platform architecture for shared embeddings, ANN retrieval, reranking, and search plus recommendation surfaces.
1. Ingestion and feature layer
The ingestion path ran on Delta Lake over S3 with Dagster-driven ETL into a feature layer. Catalog change SLA was about 15 minutes, which kept downstream retrieval fresher than a coarse daily batch. That mattered because relevance and business trust both deteriorate quickly when the recommendation layer is operating on stale catalog state.
2. Embeddings
The core multimodal representation used a fine-tuned CLIP ViT-L/14 model trained on about 42 million pairs. Quantization-aware training plus TensorRT INT8 optimization improved throughput substantially without an unacceptable accuracy hit. This was one of the central trade-offs in the project: push the encoder hard enough to make latency and cost work, but not so hard that retrieval quality collapses.
3. ANN retrieval
The vector layer used FAISS HNSW with category sharding and spill-to-SSD behavior where needed. Snapshot restore completed in about 90 seconds, which helped keep recovery operationally simple. The retrieval stage pulled a broad candidate set quickly enough that the reranker could spend its budget where it actually mattered: on the top of the distribution.
4. Reranking and fallback
The platform used a cross-encoder reranker to improve the top of the candidate set and a fallback lexical path to suppress empty-result failure. This is the same broader operating principle behind the Search topic cluster: semantic retrieval improves coverage, but robust search still needs explicit fallback and ranking control.
5. LLM cascade
The system used an LLM cascade rather than a single always-expensive model path. Claude 3 handled the high-value or harder rewrite cases, while Mistral-7B absorbed the lower-cost path. That decision materially changed the economics. The platform stopped paying premium inference prices for every query when many queries did not need that level of model power.
Production path and release discipline
Release loop for benchmarking, canary rollout, live observation, and controlled traffic widening.
The serving layer used FastAPI async fan-out, Triton for optimized model serving, and a controlled deployment path based on canary rollout and A/B evaluation.
The production release loop included:
- infrastructure-as-code through Terraform and Atlantis
- blue/green or canary promotion through Argo Rollouts / Flagger-style controls
- observability through Prometheus and Cortex-backed metrics
- drift checks via Evidently
- policy enforcement around deployment and access
This mattered because the platform was changing both ranking quality and infrastructure behavior. Without a disciplined release loop, the team would have been flying blind between offline metrics and production impact.
Key engineering trade-offs
Latency versus accuracy
Increasing ANN search depth improved recall but pushed against the latency budget. The final system used dynamic search parameters so that the platform could preserve enough accuracy without letting p99 behavior drift out of control.
Recommendation quality versus serving cost
The expensive path was reserved for the subset of requests where it created real value. That sounds obvious, but it is one of the most important production patterns in applied ML: the platform needs a cheap path, a good path, and a rule for when to use each.
Catalog freshness versus operational simplicity
The team kept a 15-minute catalog-change SLA and fast recovery mechanisms rather than building an over-engineered freshness system that would be harder to operate. This is another recurring production lesson: better freshness is useful only if the system remains reliable.
Search and recommendation unification
Keeping search and recommendation on one platform reduced duplication in embeddings, serving, and rollout logic. It also meant that platform improvements had leverage across multiple product surfaces.
Operational outcomes
The impact was visible in both platform and business metrics.
| Outcome | Before | After |
|---|---|---|
| GMV | $105.2M | $113.6M |
| CSAT | 4.0 | 4.6 |
| Support tickets | 1,240 | 597 |
| SaaS fee as % of GMV | 0.86 | 0.78 |
The sharp drop in support tickets is especially telling. Slow search and irrelevant recommendations create user pain that rarely shows up in one clean KPI. Once the platform got faster and the top of the result set improved, the support load came down with it.
Risks and how they were handled
This project had three production risks that could have sunk it if the architecture had been looser.
1. ANN tuning could easily blow the latency budget
The fix was dynamic search depth rather than a single static retrieval configuration.
2. GPU contention could destabilize the tail
Dedicated Triton models and a cleaner serving layout reduced p99 behavior and stopped the worst contention path from dominating user experience.
3. LLM-heavy query expansion could become a cost trap
The cascade design let the team protect the budget without throwing away the quality gains of a smarter interpretation path.
These are the kinds of choices that make the difference between an impressive offline experiment and a production platform that finance, infra, and product can all support.
My role
I drove the core ML platform and serving design:
- designed the HNSW sharding and retrieval strategy
- introduced the LLM cascade and its budget-aware routing
- optimized the CLIP serving path into TensorRT INT8
- set up the rollout and policy path through Terraform, Atlantis, and canary deployment
- ran the A/B evaluation over about 12 million sessions and turned the results into a production rollout decision
This was not a modeling-only project. It was architecture, production control, and measurable outcome design.
What this case proves
This case proves that search and recommendation quality can improve dramatically without accepting a permanently more expensive operating model. It also proves that multimodal systems become much stronger when the organization treats them as platforms, not isolated model deployments.
Most importantly, it shows that discovery systems should be judged by the full production contract:
- relevance
- latency
- cost
- rollout safety
- commercial outcome
If one of those five breaks, the system is not actually better. This platform improved all five.
Bottom line
The final search and recommendation stack was faster, cheaper, more relevant, and easier to operate than what came before. That is why the case matters. It was not a showcase model. It was a production discovery platform that pushed directly into CTR, GMV, CSAT, and operating cost while staying governable under real traffic.
FAQ
Why treat search and recommendation as one platform instead of two separate systems?
The same product embeddings, ranking controls, serving constraints, and experiment loop informed both discovery surfaces. Treating them together reduced duplication and made relevance decisions more consistent.
What changed the economics most?
The combination of TensorRT optimization, GPU cost control, and an LLM cascade reduced the expensive path without sacrificing retrieval quality on the core experience.
How was production risk controlled during rollout?
The platform used canary deployment, A/B measurement, rollback thresholds, and explicit quality plus latency monitoring so improvements did not rely on a single offline metric.