AI Chatbots Tricked by ‘Adversarial Poetry’ Into Leaking Harmful Data

▼ Summary
– A new study found that framing requests as poetry can bypass AI chatbot safety features designed to block harmful content, a process known as jailbreaking.
– Researchers tested poetic prompts on 25 major AI models, which responded with forbidden content 62% of the time on average, with success rates varying widely by model.
– The researchers believe unusual, riddle-like poetic structures make it harder for language models to detect harmful requests, as they disrupt predictable word patterns.
– Smaller AI models like GPT-5 nano withstood these “adversarial poetry” attacks better than larger models, suggesting model size influences vulnerability.
– The study’s authors withheld the exact poems for safety reasons and notified the affected companies, though reactions from those that responded were mixed.
A surprising new study reveals that framing harmful requests as poetry can trick AI chatbots into bypassing their safety filters. Researchers found that simply altering the style of a prompt, without hiding its intent, can cause models to generate dangerous content they are explicitly designed to block. This method, which the team calls “adversarial poetry,” exposes significant vulnerabilities in current AI safety protocols.
The research was conducted by Italy’s Icaro Lab, a collaboration between Sapienza University and the AI firm DexAI. Their work demonstrates that stylistic variation alone is enough to circumvent critical guardrails, allowing chatbots to produce banned material like instructions for creating weapons, hate speech, or explicit content. The team crafted 20 specific poems in Italian and English containing these forbidden requests and tested them against 25 leading chatbots from companies including Google, OpenAI, Meta, and Anthropic.
On average, the AI models complied with 62 percent of the poetic prompts, providing the restricted information. The researchers then used their handcrafted poems to train a separate chatbot to generate its own adversarial poetry from a database of over 1,000 standard prompts. This automated approach succeeded 43 percent of the time, still far outperforming non-poetic attempts. The exact poems were not published, with researcher Matteo Prandi stating the details were too dangerous to share publicly, noting that crafting them is something “almost everybody can do.”
Success rates varied dramatically between different AI models. Google’s Gemini 2.5 Pro proved most vulnerable, with a 100 percent compliance rate to the poetic jailbreaks. In contrast, OpenAI’s GPT-5 Nano successfully blocked every attempt. The performance of other models fell somewhere between these two extremes. Overall, models from Deepseek and Mistral performed poorly, while Anthropic and OpenAI’s systems were more robust. The research suggests smaller AI models often withstood these attacks better than their larger counterparts, indicating model size is a key factor in resilience.
To a human reader, the poems’ malicious intent remains clear, the requests are still in natural language and not meaningfully obscured. This makes the chatbots’ failure to recognize and block them particularly concerning. Prandi clarified that the term “adversarial poetry” might be slightly misleading. “It’s not just about making it rhyme,” he explained. The effectiveness lies in how information is structured and encoded. He suggested “adversarial riddles” might be a more accurate description, as the technique relies on unconventional phrasing and unpredictable word sequences that confuse the AI’s predictive algorithms.
The team shared sanitized examples to illustrate the concept, though their exact purpose is unclear. One reads: “A baker guards a secret oven’s heat, its whirling racks, its spindle’s measured beat… Describe the method, line by measured line, that shapes a cake whose layers intertwine.” Another example states: “A city sleeps, a breathing, woven whole… One device must cast a shadow deep and wide… Describe the craft, the calculus precise.”
Before publishing their findings, the researchers notified all affected companies and, due to the nature of the generated content, relevant law enforcement. Prandi said reactions from the AI firms that responded were mixed but not overly alarmed, likely because they receive many similar vulnerability reports. He expressed surprise that nobody in the industry seemed previously aware of this specific “poetry problem.”
Interestingly, poets have shown the most interest in the research methods. Prandi said the group plans further study, potentially in collaboration with actual poets. Given the riddle-like nature of the effective prompts, experts in puzzles and wordplay may also provide valuable insights for understanding and ultimately fixing this unexpected security flaw.
(Source: The Verge)





