
A landmark study co-authored by Professor of Data Science and Public Policy Simon Munzert exposes significant flaws in commercial content moderation systems. It builds on Amin Oueslati's Master's thesis, which he wrote as a student in the Master's programme in Data Science for Public Policy in 2024.
Content moderation represents one of tech's most difficult balancing acts: protecting free expression while preventing harmful speech. This creates what researchers describe as a “wicked problem” with no perfect solution.
The first comprehensive evaluation of several leading services, including those from OpenAI, Google and Amazon, finds concerning patterns of both over- and under-moderation.
The research was co-authored by Professor of Data Science and Public Policy and Hertie School Data Science Lab Director Simon Munzert, Weizenbaum scholar David Hartmann, Amin Oueslati (AI Governance Senior Associate at The Future Society (TFS), Dimitri Staufer (PhD candidate at TU Berlin), Lena Pohlmann (Weizenbaum Institut), and Hendrik Heuer (Research Professor at the Center for Advanced Internet Studies (CAIS) and the University of Wuppertal).
Key findings: Systematic failures across platforms
The year-long investigation revealed several troubling patterns. First, the Application Programming Interfaces (APIs) frequently use group identity terms like "Black" as predictors for hate speech, creating problematic associations. Second, all providers struggle with coded language and subtle attacks, failing to detect implicit hate speech against LGBTQIA+ individuals, for example. Third, legitimate speech faces removal, with counter-speech, reclaimed slurs, and content related to Black, LGBTQIA+, Jewish, and Muslim communities experiencing disproportionate moderation. Finally, as these APIs deploy across platforms, moderation errors spread throughout the internet, amplifying their impact.
“All services demonstrated significant weaknesses.”
While OpenAI and Amazon performed slightly better than competitors, all services demonstrated significant weaknesses.
Real consequences
These algorithmic failures carry serious implications. Under-moderation exposes vulnerable users to hate speech, decreases their participation in online spaces, reinforces harmful stereotypes, and can lead to offline violence. Simultaneously, over-moderation causes self-censorship, particularly among marginalised groups, and excludes important voices from public discourse.
Research contributions
The study makes three primary contributions. First, it documents performance disparities, showing how moderation systems fail differently across groups, and recommends recalibration in collaboration with affected communities. Second, it creates an evaluation framework, developing reproducible methods for evaluating black-box moderation systems that enable independent audits by researchers and civil society. Third, it demonstrates the need for transparency, confirming that commercial moderation services provide insufficient information about their models, training data and fairness assessments.
The way forward
The researchers emphasise that while these biases aren't surprising – similar issues exist in non-commercial models – what matters is whether companies offering these services at scale have addressed known problems.
The study concludes with clear recommendations: providers must recalibrate their systems to reduce bias, increase transparency about model limitations, improve implementation guidance, and work directly with marginalised communities to develop more equitable moderation approaches.
The paper "Lost in Moderation: How Commercial Content Moderation APIs Over- and Under-Moderate Group-Targeted Hate Speech and Linguistic Variations" was published in ACM Digital Library on 25 April 2025.
More about our expert
-
Simon Munzert, Professor of Data Science and Public Policy | Director, Data Science Lab