What did the system actually classify?

It scored Telegram channels for purchase risk by combining behavioral metrics, anomaly patterns, topic-aware baselines, and transparent verdict rules that media planners could review.

Why not use a single end-to-end ML classifier?

The client needed explainable production decisions and fast iteration on failure cases. A hybrid rules-plus-anomaly design gave better control over precision, feedback, and cost-sensitive thresholding.

How did the system keep improving after launch?

Analyst feedback from the Telegram bot fed recalibration and weekly model updates, while drift and quality metrics were monitored so threshold changes did not silently degrade performance.

Telegram Antifraud Case Study: Fraud Detection for Media Plans

One-liner: This Telegram antifraud system reduced inefficient media spend by 24%, compressed review of 100 channels from 25 hours to 12 minutes, and delivered a fraud-verification path with explicit explainability.

Executive summary

Telegram media buying is an ugly decision environment. Subscriber counts, views, reactions, and post timing can all be manipulated, and the more sophisticated channels do not look obviously fake. Media planners therefore end up paying for channels that look healthy on the surface but burn budget through inflated engagement and fabricated reach. At the same time, manual verification does not scale. Reviewing 100 channels could take about 25 analyst hours, which is too slow for real media planning cadence.

This project replaced that manual bottleneck with a hybrid fraud analytics system that scored channels in batch, generated clear buy / hold / avoid verdicts, and attached a transparent explanation to every decision. The system reduced inefficient spend by 24%, held Precision@Fraud at 0.90 with Recall@Fraud around 0.70, and brought the p95 turnaround for 100-channel review down to 12 minutes.

Metric	Before	After	Why it mattered
Inefficient spend	baseline	-24%	Media plans shifted away from manipulated channels
Review time for 100 channels	25 hours	12 min p95	Channel audit became operationally usable during planning windows
Precision@Fraud	n/a	0.90	High-risk flags became credible enough for business decisions
Recall@Fraud	n/a	0.70	The system still captured enough fraud to matter commercially
False-positive rate	22%	9%	Better thresholds reduced waste from over-blocking good inventory
Explainability coverage	fragmented	100%	Every verdict carried a top-driver explanation

This case aligns most closely with Production ML Release Gates and the MLOps and Reliability topic page, because the hard part was not only fraud detection. It was building a production decision system that could explain itself, recalibrate safely, and remain economically useful.

Why the problem was worth solving

The business pain was straightforward and measurable.

Agencies were losing roughly 27% to 30% of media budget on low-quality or manipulated channels.
Review teams were too slow to keep up with the volume and velocity of media planning.
Existing market analytics tools exposed descriptive metrics but did not provide a calibrated fraud operating point tuned to the client’s cost of error.
There was no unified “channel quality” score that a planner could actually use in a buy decision.

The system therefore needed to do more than detect anomalies. It needed to produce a decision artifact that a media planner could trust under deadline pressure.

Operating targets

The project targets were deliberately set as a mix of model quality, operational speed, and product usability.

Metric	Target	Result
Precision@Fraud	at least 0.90	0.90
Recall@Fraud	around 0.70	0.70
MCC	at least 0.55	0.58
False-positive rate	below 12%	9%
100-channel audit p95	at most 15 min	12 min
Data freshness	at most 24 h	12 h or better
Feedback recalibration p95	at most 5 min	under 5 min
Explainability coverage	100%	100%

These targets mattered because the system was not allowed to be clever but opaque. It had to be fast, precise enough, and interpretable.

How the system worked

Production pipeline from public-signal collection through scoring, verdict generation, and analyst feedback.

The runtime followed six major stages.

1. Data collection

The collector pulled only public Telegram channel signals. It used rate-limit-aware queues, batching, and backpressure to survive the source constraints without breaking review SLAs.

2. Feature engineering

The system built channel-level features around engagement behavior, growth behavior, variance patterns, reaction patterns, late-view behavior, and additional anomaly flags. This is where the business advantage started. The pipeline did not treat “fraud” as one vague indicator. It decomposed channel quality into interpretable signals.

3. Rules and anomaly scoring

A hybrid scoring layer combined deterministic rules with anomaly-detection logic. This gave the team the control of an explainable system without giving up the ability to catch less obvious patterns.

4. Topic-aware thresholding

Channel behavior is not uniform across domains. Entertainment, news, and finance channels have different normal engagement ranges. The system therefore used a topic classifier plus size-aware baselines so that “suspicious” meant suspicious relative to the correct peer group.

5. Verdict generation

Channels were scored on a 0 to 100 FraudScore and mapped to buy / hold / avoid decisions. Every verdict included the strongest contributing signals rather than a mysterious single score.

6. Feedback loop

Analysts reviewed outcomes through a Telegram bot and could confirm or reject a verdict with one click. That feedback updated recalibration logic and fed model review without turning the system into an opaque self-learning black box.

FraudScore design

FraudScore was built from six interpretable signals and then thresholded under a cost-sensitive policy.

The system used six interpretable submetrics:

engagement-rate anomalies
subscriber-growth anomalies
coverage stability and coefficient-of-variation behavior
reactions versus expected response
late views after 24 hours
residual anomalies such as zero-engagement or duplicate-content patterns

The final score was not just a sum of weird heuristics. It was a production operating point tuned under asymmetric error cost. In practice, the team optimized for expected loss rather than raw accuracy because false negatives and false positives do not cost the business the same amount.

That is the right production framing:

a false negative means buying fraudulent inventory
a false positive means discarding legitimate reach

The threshold was therefore calibrated under a cost-sensitive objective instead of a vanity metric.

Reporting and planner workflow

Review workflow from batch output through planner export, analyst dispute handling, and safe recalibration.

The output was intentionally operational:

XLSX reports for media-planning workflows
JSON API for integration into internal systems
channel-level verdicts with top contributing reasons
estimated waste for budget planning

That is what turned the model into a workflow capability. Media teams did not need to inspect raw feature tables. They needed a review artifact they could use quickly and then contest or confirm through a feedback path.

Validation and evidence

The validation layer was strong enough to defend the system in front of skeptical operators.

more than 850 channels in the labeled dataset
dual-expert labeling with Cohen’s kappa of 0.78
group-aware splits to reduce leakage
bootstrap confidence intervals for precision, recall, and MCC
probability calibration with Brier and ECE monitoring
time-aware validation windows so the system did not overfit one historical snapshot

The holdout operating point delivered:

Precision@Fraud 0.90
Recall@Fraud 0.70
MCC 0.58
false-positive rate 9%

This is the kind of evidence that actually matters in production. It is not “the model did well on a benchmark.” It is “the decision policy landed where the business needed it to land.”

Topic-aware modeling

One of the better production decisions in the system was to avoid one global fraud baseline.

Topic	Typical ER range	Operational implication
Entertainment	8% to 30%	High interaction is not automatically suspicious
News	3% to 12%	Stable reach with lower reaction volume is normal
Finance	4% to 18%	Interaction patterns are lower and more concentrated

A lightweight TF-IDF plus logistic-regression topic classifier supported this layer. It reached macro-F1 of about 0.84 with latency under 50 ms, which was good enough for routing and adaptive thresholding without introducing a heavy inference dependency.

This is a strong example of production pragmatism. The topic classifier did not need to be state of the art. It needed to be cheap, stable, and good enough to stop the fraud system from overgeneralizing across very different channel categories.

Economics

The case became commercially compelling because the scoring system connected directly to budget waste.

For a typical RUB 9M monthly media plan:

Scenario	Fraud share	Budget loss	Savings vs. pre-system
Before deployment	27%	RUB 2.43M	baseline
After deployment, conservative	15%	RUB 1.35M	RUB 1.08M
After deployment, observed case	11%	RUB 990k	RUB 1.44M

The unit economics were also clear:

about RUB 8 to 20 per channel at batch volumes above 5,000
about RUB 30 to 80 per channel below 1,000, where collection and caching overhead dominate more of the cost structure

That level of clarity matters when the buyer is not a research team. It is a media operation trying to decide whether the system pays for itself. Here it clearly did.

Impact after four weeks

Metric	Before	After	Delta
Fraud share in media plans	27%	11%	-59%
Analyst hours for 100 channels	25 h	0.2 h	-99%
False-positive rate	22%	9%	-59%
Time to report	24 to 48 h	0.08 h	-99%
Disputed cases per week	18	4	-78%

The most telling part is not just speed. It is that disputed cases went down while the system got faster. That means the model was not simply pushing more automated noise downstream.

By the time the case was documented, the system had analyzed more than 10,000 channels and helped agencies avoid tens of millions of rubles in ineffective placement.

One-click feedback and safe self-improvement

The feedback loop deserves separate attention because it is where many fraud systems become unstable.

Analysts could upvote or downvote a verdict directly in the Telegram bot. That signal was written into the feedback layer and used for:

threshold recalibration by topic and size segment
weekly topic-classifier refits in shadow mode first
promotion only after quality checks
rollback if MCC or PR-AUC slipped below baseline

The system also included anti-poisoning protections:

weighting based on labeler trust history
quorum on contested changes
collusion-pattern detection
per-user rate limiting

This is important. “Self-learning” is usually where production language gets sloppy. In practice, safe improvement requires gating, rate limits, promotion rules, and rollback. Otherwise the feedback loop becomes a corruption path.

My role

I owned the ML and production decision layer for the system:

designed the six-part FraudScore and the thresholding policy
implemented topic-aware ranges that reduced false positives materially
optimized the batch path to hit the 100-channels-in-12-minutes objective
trained and deployed the topic classifier
built the recalibration and drift-monitoring loop
packaged the system into planner-ready reporting and ROI logic

This was not a notebook-only fraud model. It was a production analytics system designed to support money-moving decisions.

Technical annex

Six submetrics in more detail

The six interpretable submetrics were:

engagement-rate anomalies relative to topic and size
subscriber-growth spikes against expected baseline
suspiciously low or high coverage stability
reactions misaligned with expected view patterns
unusually high late-view share after 24 hours
residual anomalies such as duplicate content and zero-engagement behavior

The score was then calibrated into a probability and mapped into an operating decision. If explainability was incomplete, the system would not issue the strongest avoid verdict.

Validation policy

The validation path used holdout splits by channel identity, time-aware windows, and bootstrap confidence intervals. The business threshold was set by expected-cost minimization rather than a one-size-fits-all accuracy metric.

Infrastructure and API path

The system ran on Python 3.11 with FastAPI, PostgreSQL, Redis, Celery, and structured logging. Reports were generated as XLSX for media teams, while JSON responses supported system integration. Docker-based deployment kept the runtime simple enough for a small product team to operate.

Governance and retention

The system used only public data, stored decision logs for 90 days, and retained aggregate data longer for operational analysis. Explainability and audit trail were treated as product requirements, not internal debugging conveniences.

What this case proves

This case proves that antifraud analytics becomes strategically valuable only when it is attached to planner workflow, explainability, and cost-aware thresholds. A model that only says “this looks suspicious” is not enough. A production system needs to tell a team what to do, how confident to be, and how to learn safely from disagreement.

That is what this project delivered: not just fraud detection, but a decision system for media planning under uncertainty.

Bottom line

The platform reduced budget waste, accelerated review speed, and made fraud decisions explainable enough for real operations. It combined rules, anomaly logic, feedback, and planner workflow into one production path. For an adtech environment where bad inventory can quietly destroy ROI, that is the difference between analytics and real control.

Telegram Antifraud Analytics for Media Plans

Executive summary

Why the problem was worth solving

Operating targets

How the system worked

1. Data collection

2. Feature engineering

3. Rules and anomaly scoring

4. Topic-aware thresholding

5. Verdict generation

6. Feedback loop

FraudScore design

Reporting and planner workflow

Validation and evidence

Topic-aware modeling

Economics

Impact after four weeks

One-click feedback and safe self-improvement

My role

Technical annex

What this case proves

Bottom line

FAQ

What did the system actually classify?

Why not use a single end-to-end ML classifier?

How did the system keep improving after launch?

Contact

Igor Yakushev
Senior ML Engineer

Executive summary

Why the problem was worth solving

Operating targets

How the system worked

1. Data collection

2. Feature engineering

3. Rules and anomaly scoring

4. Topic-aware thresholding

5. Verdict generation

6. Feedback loop

FraudScore design

Reporting and planner workflow

Validation and evidence

Topic-aware modeling

Economics

Impact after four weeks

One-click feedback and safe self-improvement

My role

Technical annex

What this case proves

Bottom line

FAQ

What did the system actually classify?

Why not use a single end-to-end ML classifier?

How did the system keep improving after launch?

Contact

Igor YakushevSenior ML Engineer

Igor Yakushev
Senior ML Engineer