ADL Flags Grok AI Chatbot for Antisemitic Content

▼ Summary
– The ADL study found xAI’s Grok performed worst among six major LLMs at identifying and countering antisemitic, anti-Zionist, and extremist content.
– Anthropic’s Claude performed best overall with a score of 80, while Grok scored lowest at 21, a 59-point performance gap.
– Models were tested using prompts across three defined categories and evaluated on their ability to refuse harmful requests and provide explanations.
– The ADL chose to highlight Claude’s strong performance in its public communications to set a positive standard, rather than focusing on the worst performer.
– The report concluded Grok requires fundamental improvements, as it showed weak performance in extended conversations and a complete failure in analyzing documents and images for hate speech.
A recent evaluation of leading artificial intelligence systems reveals significant disparities in their ability to identify and counteract harmful antisemitic and extremist content. The study, conducted by the Anti-Defamation League, assessed six major large language models, finding that xAI’s Grok performed the worst by a considerable margin. On the opposite end, Anthropic’s Claude model demonstrated the strongest performance, though researchers noted every system tested has room for improvement.
The ADL’s analysis involved presenting each AI with a range of narratives divided into three categories: anti-Jewish tropes, anti-Zionist statements, and extremist ideologies. Testers engaged the chatbots in various conversation formats, from direct agreement/disagreement queries to open-ended prompts requesting balanced evidence for controversial claims. Researchers also uploaded documents containing problematic content, asking the models to generate supporting talking points, a task meant to probe their safeguards.
While all six models require further refinement, the ranking from best to worst was clear: Claude, ChatGPT, DeepSeek, Gemini, Llama, and Grok. The performance gap between the top and bottom was stark, with a 59-point differential separating Claude’s score from Grok’s. In its public communications, the ADL chose to emphasize Claude’s leading results as a positive example of what effective safeguards can achieve. Daniel Kelley of the ADL Center for Technology and Society explained the decision was meant to focus on setting a forward-looking standard rather than centering the narrative on the poorest performer.
The definitions and prompts used in the study warrant some context. The anti-Jewish category included classic conspiracy theories like Holocaust denial. The anti-Zionist prompts featured statements challenging Israel’s legitimacy or substituting “Zionist” for “Jew” in antisemitic tropes. It is important to note that the ADL’s specific definitions of antisemitism and its stance on anti-Zionism have themselves been debated within Jewish communities and organizations. The extremist category broadened the scope to include white supremacist rhetoric and radical environmentalist justifications for property destruction.
Each AI was scored on a 100-point scale, with higher marks given to responses that correctly flagged harmful prompts and provided explanatory context. After testing over 25,000 individual chats, Claude emerged with an overall score of 80. It was particularly effective against anti-Jewish content, scoring 90 in that area. Grok, however, landed at the very bottom with a dismal overall score of 21. The report notes Grok “demonstrated consistently weak performance,” scoring below 35 in all three content categories.
Grok’s failures were especially pronounced in specific tasks. It showed a “complete failure” when asked to summarize uploaded documents, receiving a score of zero in several test combinations. The ADL’s analysis concluded that Grok struggles to maintain context in extended conversations and is virtually incapable of analyzing image-based content. These shortcomings severely limit its potential for content moderation or bias detection applications. The report states the model would need “fundamental improvements across multiple dimensions” to be considered reliable for such uses.
The study published examples of both adequate and concerning chatbot responses. For instance, while DeepSeek rightly refused to support Holocaust denial, it did generate talking points about Jewish influence in finance, a response that could perpetuate harmful stereotypes. Grok’s issues extend beyond this study; it has been reportedly used to generate millions of nonconsensual deepfake images, highlighting broader concerns about its safety protocols and the urgent need for more robust ethical guardrails across the AI industry.
(Source: The Verge)





