Search and Recommendation System

A multimodal search and recommendation platform for 10M+ products that improved CTR, cut latency, and lowered serving cost through disciplined production architecture.

Search and Recommendation System

One-liner: This platform cut p95 search latency from 620 ms to 175 ms, reduced cost per QPS by 33%, and lifted CTR by 54% across a 10M+ SKU commerce catalog.

Executive summary

This project started as a search performance problem and became a broader discovery-platform redesign. The client had more than 10 million SKUs, rising retrieval cost, and a search surface slow enough to hurt revenue. At p95 above 500 ms, the system was already outside a healthy user-experience envelope. At the same time, the multimodal retrieval path was too expensive to scale comfortably, and incremental LLM-heavy fixes were driving token spend in the wrong direction.

The final platform combined multimodal embeddings, ANN retrieval, reranking, and disciplined serving into one production system used for both search and recommendation. The important outcome was not just better relevance. It was a platform that could improve discovery quality while staying inside a controlled latency, cost, and rollout envelope.

MetricBeforeAfterWhat changed
p95 latency620 ms175 msCLIP optimization, FAISS HNSW, Triton, and tighter serving controls
Cost per QPS$0.28$0.19GPU efficiency and LLM cascade reduced the expensive path
Precision@100.540.66Cross-encoder reranking improved the top of the candidate set
Zero-result rate28%4%Fallback coverage and better retrieval resilience reduced empty pages
CTRbaseline+54%Faster and more relevant results changed user behavior
GMVbaseline+$8.4MSearch quality gains propagated into commercial lift

The engineering ideas behind this platform connect naturally to Semantic IDs for LLM Recommender Systems and The Offline-to-Online Gap in Deep Learning Recommender Systems. This case is the production counterpart: what happens when those modeling ideas hit latency, rollout, and infrastructure constraints.

Why this was business-critical

The search system was directly constraining revenue.

  • Slow retrieval was hurting product discovery at scale.
  • Catalog growth was outpacing the previous indexing strategy.
  • Cloud-heavy LLM expansion was becoming too expensive to treat as the default fix.
  • Recommendation quality and search quality were drifting apart because different teams were optimizing different stacks.

That last point mattered more than it may seem. Search and recommendation are often treated as different product surfaces, but in this environment they shared the same real problem: product discovery was not fast enough, coherent enough, or cheap enough. Unifying them as one platform made the architecture cleaner and the operating model more defensible.

Success criteria

The project was measured against both user-facing and platform-facing targets.

MetricTargetResult
p90 latencyunder 180 ms178 ms
Cost per QPSdown 30%down 33%
Precision@10+10 percentage points+12 pp
CTRpositive commercial lift+54%
SLA99.95%99.97%

These targets forced the platform to optimize for more than relevance. The system had to survive production traffic, keep rollout risk bounded, and stay economically sane.

Architecture

Search and recommendation architecture diagram showing ingestion, embeddings, ANN retrieval, reranking, and discovery surfaces.

Platform architecture for shared embeddings, ANN retrieval, reranking, and search plus recommendation surfaces.

1. Ingestion and feature layer

The ingestion path ran on Delta Lake over S3 with Dagster-driven ETL into a feature layer. Catalog change SLA was about 15 minutes, which kept downstream retrieval fresher than a coarse daily batch. That mattered because relevance and business trust both deteriorate quickly when the recommendation layer is operating on stale catalog state.

2. Embeddings

The core multimodal representation used a fine-tuned CLIP ViT-L/14 model trained on about 42 million pairs. Quantization-aware training plus TensorRT INT8 optimization improved throughput substantially without an unacceptable accuracy hit. This was one of the central trade-offs in the project: push the encoder hard enough to make latency and cost work, but not so hard that retrieval quality collapses.

3. ANN retrieval

The vector layer used FAISS HNSW with category sharding and spill-to-SSD behavior where needed. Snapshot restore completed in about 90 seconds, which helped keep recovery operationally simple. The retrieval stage pulled a broad candidate set quickly enough that the reranker could spend its budget where it actually mattered: on the top of the distribution.

4. Reranking and fallback

The platform used a cross-encoder reranker to improve the top of the candidate set and a fallback lexical path to suppress empty-result failure. This is the same broader operating principle behind the Search topic cluster: semantic retrieval improves coverage, but robust search still needs explicit fallback and ranking control.

5. LLM cascade

The system used an LLM cascade rather than a single always-expensive model path. Claude 3 handled the high-value or harder rewrite cases, while Mistral-7B absorbed the lower-cost path. That decision materially changed the economics. The platform stopped paying premium inference prices for every query when many queries did not need that level of model power.

Production path and release discipline

Search release-loop diagram showing benchmark, canary rollout, live observation, decision, and traffic widening.

Release loop for benchmarking, canary rollout, live observation, and controlled traffic widening.

The serving layer used FastAPI async fan-out, Triton for optimized model serving, and a controlled deployment path based on canary rollout and A/B evaluation.

The production release loop included:

  • infrastructure-as-code through Terraform and Atlantis
  • blue/green or canary promotion through Argo Rollouts / Flagger-style controls
  • observability through Prometheus and Cortex-backed metrics
  • drift checks via Evidently
  • policy enforcement around deployment and access

This mattered because the platform was changing both ranking quality and infrastructure behavior. Without a disciplined release loop, the team would have been flying blind between offline metrics and production impact.

Key engineering trade-offs

Latency versus accuracy

Increasing ANN search depth improved recall but pushed against the latency budget. The final system used dynamic search parameters so that the platform could preserve enough accuracy without letting p99 behavior drift out of control.

Recommendation quality versus serving cost

The expensive path was reserved for the subset of requests where it created real value. That sounds obvious, but it is one of the most important production patterns in applied ML: the platform needs a cheap path, a good path, and a rule for when to use each.

Catalog freshness versus operational simplicity

The team kept a 15-minute catalog-change SLA and fast recovery mechanisms rather than building an over-engineered freshness system that would be harder to operate. This is another recurring production lesson: better freshness is useful only if the system remains reliable.

Search and recommendation unification

Keeping search and recommendation on one platform reduced duplication in embeddings, serving, and rollout logic. It also meant that platform improvements had leverage across multiple product surfaces.

Operational outcomes

The impact was visible in both platform and business metrics.

OutcomeBeforeAfter
GMV$105.2M$113.6M
CSAT4.04.6
Support tickets1,240597
SaaS fee as % of GMV0.860.78

The sharp drop in support tickets is especially telling. Slow search and irrelevant recommendations create user pain that rarely shows up in one clean KPI. Once the platform got faster and the top of the result set improved, the support load came down with it.

Risks and how they were handled

This project had three production risks that could have sunk it if the architecture had been looser.

1. ANN tuning could easily blow the latency budget

The fix was dynamic search depth rather than a single static retrieval configuration.

2. GPU contention could destabilize the tail

Dedicated Triton models and a cleaner serving layout reduced p99 behavior and stopped the worst contention path from dominating user experience.

3. LLM-heavy query expansion could become a cost trap

The cascade design let the team protect the budget without throwing away the quality gains of a smarter interpretation path.

These are the kinds of choices that make the difference between an impressive offline experiment and a production platform that finance, infra, and product can all support.

My role

I drove the core ML platform and serving design:

  • designed the HNSW sharding and retrieval strategy
  • introduced the LLM cascade and its budget-aware routing
  • optimized the CLIP serving path into TensorRT INT8
  • set up the rollout and policy path through Terraform, Atlantis, and canary deployment
  • ran the A/B evaluation over about 12 million sessions and turned the results into a production rollout decision

This was not a modeling-only project. It was architecture, production control, and measurable outcome design.

What this case proves

This case proves that search and recommendation quality can improve dramatically without accepting a permanently more expensive operating model. It also proves that multimodal systems become much stronger when the organization treats them as platforms, not isolated model deployments.

Most importantly, it shows that discovery systems should be judged by the full production contract:

  • relevance
  • latency
  • cost
  • rollout safety
  • commercial outcome

If one of those five breaks, the system is not actually better. This platform improved all five.

Bottom line

The final search and recommendation stack was faster, cheaper, more relevant, and easier to operate than what came before. That is why the case matters. It was not a showcase model. It was a production discovery platform that pushed directly into CTR, GMV, CSAT, and operating cost while staying governable under real traffic.

FAQ

Why treat search and recommendation as one platform instead of two separate systems?

The same product embeddings, ranking controls, serving constraints, and experiment loop informed both discovery surfaces. Treating them together reduced duplication and made relevance decisions more consistent.

What changed the economics most?

The combination of TensorRT optimization, GPU cost control, and an LLM cascade reduced the expensive path without sacrificing retrieval quality on the core experience.

How was production risk controlled during rollout?

The platform used canary deployment, A/B measurement, rollback thresholds, and explicit quality plus latency monitoring so improvements did not rely on a single offline metric.

Contact

If you have a production ML problem worth fixing, send the context and I'll reply directly.