ML Inference Latency and Cost Evaluation Platform
Internal tool for profiling latency, throughput, and $/req of models in production
One-liner: Built internal platform that reduced $/req by 43% and stabilized p99 latency by standardizing model profiling and cost monitoring.
What the system does in simple terms
Problem: Teams deployed models without unified monitoring standard. GPUs were idle, latency fluctuated, costs were not tracked. No visibility into $/req per model.
Solution: Platform with Prometheus, Kubecost, and Torch/ONNX profiling provides visibility into latency, throughput, load, and $/req at model level. Standardized deployment process with cost tracking.
Savings: $/req decreased by 43%, p99 latency stabilized. Better GPU utilization, cost transparency per model.
ML part: System uses PyTorch/ONNX profiling, Prometheus for metrics, Kubecost for cost tracking, Grafana for visualization.
This is an English placeholder. Full translation coming soon.