Gaming the Metric, Not the Harm: Certifying Safety Audits against Strategic Platform Manipulation
Florian A. D. Burnat, Brittany I. Davidson

TL;DR
This paper analyzes how online safety metrics can be manipulated by strategic platforms and proposes a semantic-envelope approach to certify genuine harm reduction despite such manipulation.
Contribution
It introduces a formal framework for understanding metric manipulation and proposes a class-stratified certification method that resists strategic gaming.
Findings
Fragile metrics are easily manipulated and fail invariance.
Semantic-envelope metrics resist manipulation and maintain certification.
Experimental results show the proposed method outperforms fragile metrics.
Abstract
Online-safety regulation under the UK Online Safety Act and the EU Digital Services Act increasingly treats scalar metrics as compliance evidence. Once announced, such a metric also becomes an optimization target: a strategic platform can improve its score by routing recommendations through semantically equivalent content variants, without reducing true harm. We ask when such an audit metric can still certify a genuine reduction in harm. The protocol is modeled as a published transformation graph whose connected components form semantic classes, and the metric itself is treated as a security object. Three results follow. First, any metric that scores variants directly is manipulable as soon as two equivalent variants in a harmful class disagree in score. Second, the semantic-envelope lift, which assigns each variant the maximum score in its class, is the unique pointwise minimum among…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
