Telegram Antifraud Analytics for Media Plans
A production antifraud analytics system for Telegram media buying that cut inefficient spend, compressed batch review from 25 hours to 12 minutes, and kept explainability attached to every verdict.
One-liner: This Telegram antifraud system reduced inefficient media spend by 24%, compressed review of 100 channels from 25 hours to 12 minutes, and delivered a fraud-verification path with explicit explainability.
Executive summary
Telegram media buying is an ugly decision environment. Subscriber counts, views, reactions, and post timing can all be manipulated, and the more sophisticated channels do not look obviously fake. Media planners therefore end up paying for channels that look healthy on the surface but burn budget through inflated engagement and fabricated reach. At the same time, manual verification does not scale. Reviewing 100 channels could take about 25 analyst hours, which is too slow for real media planning cadence.
This project replaced that manual bottleneck with a hybrid fraud analytics system that scored channels in batch, generated clear buy / hold / avoid verdicts, and attached a transparent explanation to every decision. The system reduced inefficient spend by 24%, held Precision@Fraud at 0.90 with Recall@Fraud around 0.70, and brought the p95 turnaround for 100-channel review down to 12 minutes.
| Metric | Before | After | Why it mattered |
|---|---|---|---|
| Inefficient spend | baseline | -24% | Media plans shifted away from manipulated channels |
| Review time for 100 channels | 25 hours | 12 min p95 | Channel audit became operationally usable during planning windows |
| Precision@Fraud | n/a | 0.90 | High-risk flags became credible enough for business decisions |
| Recall@Fraud | n/a | 0.70 | The system still captured enough fraud to matter commercially |
| False-positive rate | 22% | 9% | Better thresholds reduced waste from over-blocking good inventory |
| Explainability coverage | fragmented | 100% | Every verdict carried a top-driver explanation |
This case aligns most closely with Production ML Release Gates and the MLOps and Reliability topic page, because the hard part was not only fraud detection. It was building a production decision system that could explain itself, recalibrate safely, and remain economically useful.
Why the problem was worth solving
The business pain was straightforward and measurable.
- Agencies were losing roughly 27% to 30% of media budget on low-quality or manipulated channels.
- Review teams were too slow to keep up with the volume and velocity of media planning.
- Existing market analytics tools exposed descriptive metrics but did not provide a calibrated fraud operating point tuned to the client’s cost of error.
- There was no unified “channel quality” score that a planner could actually use in a buy decision.
The system therefore needed to do more than detect anomalies. It needed to produce a decision artifact that a media planner could trust under deadline pressure.
Operating targets
The project targets were deliberately set as a mix of model quality, operational speed, and product usability.
| Metric | Target | Result |
|---|---|---|
| Precision@Fraud | at least 0.90 | 0.90 |
| Recall@Fraud | around 0.70 | 0.70 |
| MCC | at least 0.55 | 0.58 |
| False-positive rate | below 12% | 9% |
| 100-channel audit p95 | at most 15 min | 12 min |
| Data freshness | at most 24 h | 12 h or better |
| Feedback recalibration p95 | at most 5 min | under 5 min |
| Explainability coverage | 100% | 100% |
These targets mattered because the system was not allowed to be clever but opaque. It had to be fast, precise enough, and interpretable.
How the system worked
Production pipeline from public-signal collection through scoring, verdict generation, and analyst feedback.
The runtime followed six major stages.
1. Data collection
The collector pulled only public Telegram channel signals. It used rate-limit-aware queues, batching, and backpressure to survive the source constraints without breaking review SLAs.
2. Feature engineering
The system built channel-level features around engagement behavior, growth behavior, variance patterns, reaction patterns, late-view behavior, and additional anomaly flags. This is where the business advantage started. The pipeline did not treat “fraud” as one vague indicator. It decomposed channel quality into interpretable signals.
3. Rules and anomaly scoring
A hybrid scoring layer combined deterministic rules with anomaly-detection logic. This gave the team the control of an explainable system without giving up the ability to catch less obvious patterns.
4. Topic-aware thresholding
Channel behavior is not uniform across domains. Entertainment, news, and finance channels have different normal engagement ranges. The system therefore used a topic classifier plus size-aware baselines so that “suspicious” meant suspicious relative to the correct peer group.
5. Verdict generation
Channels were scored on a 0 to 100 FraudScore and mapped to buy / hold / avoid decisions. Every verdict included the strongest contributing signals rather than a mysterious single score.
6. Feedback loop
Analysts reviewed outcomes through a Telegram bot and could confirm or reject a verdict with one click. That feedback updated recalibration logic and fed model review without turning the system into an opaque self-learning black box.
FraudScore design
FraudScore was built from six interpretable signals and then thresholded under a cost-sensitive policy.
The system used six interpretable submetrics:
- engagement-rate anomalies
- subscriber-growth anomalies
- coverage stability and coefficient-of-variation behavior
- reactions versus expected response
- late views after 24 hours
- residual anomalies such as zero-engagement or duplicate-content patterns
The final score was not just a sum of weird heuristics. It was a production operating point tuned under asymmetric error cost. In practice, the team optimized for expected loss rather than raw accuracy because false negatives and false positives do not cost the business the same amount.
That is the right production framing:
- a false negative means buying fraudulent inventory
- a false positive means discarding legitimate reach
The threshold was therefore calibrated under a cost-sensitive objective instead of a vanity metric.
Reporting and planner workflow
Review workflow from batch output through planner export, analyst dispute handling, and safe recalibration.
The output was intentionally operational:
- XLSX reports for media-planning workflows
- JSON API for integration into internal systems
- channel-level verdicts with top contributing reasons
- estimated waste for budget planning
That is what turned the model into a workflow capability. Media teams did not need to inspect raw feature tables. They needed a review artifact they could use quickly and then contest or confirm through a feedback path.
Validation and evidence
The validation layer was strong enough to defend the system in front of skeptical operators.
- more than 850 channels in the labeled dataset
- dual-expert labeling with Cohen’s kappa of 0.78
- group-aware splits to reduce leakage
- bootstrap confidence intervals for precision, recall, and MCC
- probability calibration with Brier and ECE monitoring
- time-aware validation windows so the system did not overfit one historical snapshot
The holdout operating point delivered:
- Precision@Fraud 0.90
- Recall@Fraud 0.70
- MCC 0.58
- false-positive rate 9%
This is the kind of evidence that actually matters in production. It is not “the model did well on a benchmark.” It is “the decision policy landed where the business needed it to land.”
Topic-aware modeling
One of the better production decisions in the system was to avoid one global fraud baseline.
| Topic | Typical ER range | Operational implication |
|---|---|---|
| Entertainment | 8% to 30% | High interaction is not automatically suspicious |
| News | 3% to 12% | Stable reach with lower reaction volume is normal |
| Finance | 4% to 18% | Interaction patterns are lower and more concentrated |
A lightweight TF-IDF plus logistic-regression topic classifier supported this layer. It reached macro-F1 of about 0.84 with latency under 50 ms, which was good enough for routing and adaptive thresholding without introducing a heavy inference dependency.
This is a strong example of production pragmatism. The topic classifier did not need to be state of the art. It needed to be cheap, stable, and good enough to stop the fraud system from overgeneralizing across very different channel categories.
Economics
The case became commercially compelling because the scoring system connected directly to budget waste.
For a typical RUB 9M monthly media plan:
| Scenario | Fraud share | Budget loss | Savings vs. pre-system |
|---|---|---|---|
| Before deployment | 27% | RUB 2.43M | baseline |
| After deployment, conservative | 15% | RUB 1.35M | RUB 1.08M |
| After deployment, observed case | 11% | RUB 990k | RUB 1.44M |
The unit economics were also clear:
- about RUB 8 to 20 per channel at batch volumes above 5,000
- about RUB 30 to 80 per channel below 1,000, where collection and caching overhead dominate more of the cost structure
That level of clarity matters when the buyer is not a research team. It is a media operation trying to decide whether the system pays for itself. Here it clearly did.
Impact after four weeks
| Metric | Before | After | Delta |
|---|---|---|---|
| Fraud share in media plans | 27% | 11% | -59% |
| Analyst hours for 100 channels | 25 h | 0.2 h | -99% |
| False-positive rate | 22% | 9% | -59% |
| Time to report | 24 to 48 h | 0.08 h | -99% |
| Disputed cases per week | 18 | 4 | -78% |
The most telling part is not just speed. It is that disputed cases went down while the system got faster. That means the model was not simply pushing more automated noise downstream.
By the time the case was documented, the system had analyzed more than 10,000 channels and helped agencies avoid tens of millions of rubles in ineffective placement.
One-click feedback and safe self-improvement
The feedback loop deserves separate attention because it is where many fraud systems become unstable.
Analysts could upvote or downvote a verdict directly in the Telegram bot. That signal was written into the feedback layer and used for:
- threshold recalibration by topic and size segment
- weekly topic-classifier refits in shadow mode first
- promotion only after quality checks
- rollback if MCC or PR-AUC slipped below baseline
The system also included anti-poisoning protections:
- weighting based on labeler trust history
- quorum on contested changes
- collusion-pattern detection
- per-user rate limiting
This is important. “Self-learning” is usually where production language gets sloppy. In practice, safe improvement requires gating, rate limits, promotion rules, and rollback. Otherwise the feedback loop becomes a corruption path.
My role
I owned the ML and production decision layer for the system:
- designed the six-part FraudScore and the thresholding policy
- implemented topic-aware ranges that reduced false positives materially
- optimized the batch path to hit the 100-channels-in-12-minutes objective
- trained and deployed the topic classifier
- built the recalibration and drift-monitoring loop
- packaged the system into planner-ready reporting and ROI logic
This was not a notebook-only fraud model. It was a production analytics system designed to support money-moving decisions.
Technical annex
Six submetrics in more detail
The six interpretable submetrics were:
- engagement-rate anomalies relative to topic and size
- subscriber-growth spikes against expected baseline
- suspiciously low or high coverage stability
- reactions misaligned with expected view patterns
- unusually high late-view share after 24 hours
- residual anomalies such as duplicate content and zero-engagement behavior
The score was then calibrated into a probability and mapped into an operating decision. If explainability was incomplete, the system would not issue the strongest avoid verdict.
Validation policy
The validation path used holdout splits by channel identity, time-aware windows, and bootstrap confidence intervals. The business threshold was set by expected-cost minimization rather than a one-size-fits-all accuracy metric.
Infrastructure and API path
The system ran on Python 3.11 with FastAPI, PostgreSQL, Redis, Celery, and structured logging. Reports were generated as XLSX for media teams, while JSON responses supported system integration. Docker-based deployment kept the runtime simple enough for a small product team to operate.
Governance and retention
The system used only public data, stored decision logs for 90 days, and retained aggregate data longer for operational analysis. Explainability and audit trail were treated as product requirements, not internal debugging conveniences.
What this case proves
This case proves that antifraud analytics becomes strategically valuable only when it is attached to planner workflow, explainability, and cost-aware thresholds. A model that only says “this looks suspicious” is not enough. A production system needs to tell a team what to do, how confident to be, and how to learn safely from disagreement.
That is what this project delivered: not just fraud detection, but a decision system for media planning under uncertainty.
Bottom line
The platform reduced budget waste, accelerated review speed, and made fraud decisions explainable enough for real operations. It combined rules, anomaly logic, feedback, and planner workflow into one production path. For an adtech environment where bad inventory can quietly destroy ROI, that is the difference between analytics and real control.
FAQ
What did the system actually classify?
It scored Telegram channels for purchase risk by combining behavioral metrics, anomaly patterns, topic-aware baselines, and transparent verdict rules that media planners could review.
Why not use a single end-to-end ML classifier?
The client needed explainable production decisions and fast iteration on failure cases. A hybrid rules-plus-anomaly design gave better control over precision, feedback, and cost-sensitive thresholding.
How did the system keep improving after launch?
Analyst feedback from the Telegram bot fed recalibration and weekly model updates, while drift and quality metrics were monitored so threshold changes did not silently degrade performance.