
Despite its initial promise of connecting people, social media has enabled the amplification of extreme voices, allowing hate speech to thrive online and fueling real-world violence. To counter these potential severe consequences, social media platforms invest in content moderation, through activities such as deleting hateful posts or suspending hateful users.
Yet, little is known about the actual enforcement of moderation policies and the feasibility to moderate hate speech at scale. In a presentation at this week's Brown Bag seminar series, Manuel Tonneau from the University of Oxford shared findings on this topic, highlighting a low enforcement of hate speech removal on Twitter and the promise of human-AI collaboration to moderate hate speech at scale.
Key findings
Tonneau's research explored two fundamental questions: "How much are platforms actually doing to moderate hate speech?" and "Is large-scale moderation even feasible?"
One alarming revelation, specifically regarding Twitter, is that remarkably little hate speech is actually removed. Although Twitter's guidelines explicitly stated in September 2022 that hate speech would be taken down, the present research indicates that only approximately 20% of such content is effectively removed.
The other main result of the study is that much higher moderation rates (>80%) could be achieved by using a combination of AI and small teams of human moderators. While the researchers find that AI is too error-prone to be used independently for content moderation, combining AI to flag potential hate speech with human review could enable moderation rates above 80% with minimal error. This work represents the first empirical evaluation of global platform hate speech moderation practices and provides novel and valuable insights on the feasibility to moderate hate speech at scale on social media.
Unequal protection from hate across languages
One particularly concerning finding from Tonneau's research is how users may be more or less protected from hate speech depending on the language they speak. One reason for this lies in differences in AI detection performance, with performance in non-European languages such as Arabic, Indonesian or Turkish tending to be lower than in English, French or German. Beyond, different levels of regulation across geographies may play a role in how much platforms enforce moderation. "In countries with stronger regulation, like Germany, France, and Indonesia, the removal rates [of hate speech] are higher," Tonneau explained. "In contrast, regions with fewer resources or less regulation, such as Arabic-speaking countries, tend to have lower removal rates." This linguistic disparity creates what Tonneau describes as "moderation blind spots", particularly in the Global South, where harmful content can proliferate unchecked because of a lack of accurate detection tools and regulatory incentives to tackle this issue.
The promise of human-AI collaboration
Tonneau's research doesn't merely highlight problems; it also points toward potential solutions. He and his coauthors investigated the effectiveness of a human-AI collaborative approach to content moderation and found promising results. "Our analysis suggests that even a relatively small team of human moderators, when strategically deployed alongside AI detection tools, could effectively address a large proportion of hate speech on Twitter," Tonneau said. This hybrid approach leverages the strengths of both artificial intelligence and human judgment to enable moderation at scale: AI systems can rapidly process enormous volumes of content to flag potential violations, while human moderators provide the nuanced understanding needed to make final determinations in complex cases.
Policy implications
This research carries important policy implications at a time when major platforms like Meta are scaling back moderation efforts over concerns about free speech and wrongful content removal. Contrary to such concerns, these findings demonstrate that high moderation rates—over 80%—can be achieved through human-AI collaboration. It also underlines the fact that with enough resources, platforms can comply with existing regulations, such as the EU’s Digital Services Act which mandates the timely removal of online hateful content. The study finally underscores the critical role of researcher access to platform data in identifying systemic issues and proposing effective solutions—capabilities that are severely limited under current data access restrictions.
About the study:
The dataset used in this study, along with the analysis of AI detection performance, is detailed in this working paper. The results presented at Hertie will be the focus of a separate paper, expected to be released in summer 2025. This research is supported by grants from the Research Support Budget of the Development Economics Vice Presidency of the World Bank, the Foreign, Commonwealth & Development Office and the Gates Foundation. Watch video summary of the study here.
About the speaker:
Manuel Tonneau is a PhD student in Social Data Science and a Shirley Scholar at the Oxford Internet Institute. His research sits at the intersection of natural language processing (NLP) and AI ethics, focusing on AI-driven hate speech moderation and its impact across cultures. He also works on mitigating harms in text generated by large language models, with a particular focus on Global Majority contexts. Manuel consults for the World Bank and is affiliated with NYU’s Open Networks and Big Data Lab. He holds degrees in Statistics and Economics from ENSAE Paris and Humboldt-Universität zu Berlin. More about Manuel Tonneau can be found on his website.
-
Aliya Boranbayeva, Associate Communications and Events
-
William Lowe, Senior Research Scientist