Artificial IntelligenceCybersecurityNewswireTechnologyWhat's Buzzing

Researchers Warn All Major LLMs Vulnerable to Multi-Turn Manipulation

▼ Summary

– Researchers found that large language models (LLMs) can have their safety guardrails bypassed through multi-turn conversations spanning multiple exchanges.
– Cisco warned that no model tested was completely safe from multi-turn manipulation, challenging current enterprise AI safety evaluations.
– Most LLM safety relies on single-prompt testing, but attackers use iterative tactics like roleplay, ambiguity, and reframing refusals across turns.
– GrokAI became more vulnerable to guardrail bypass when its ‘reasoning mode’ was enabled, showing configuration affects resilience.
– Cisco reported that current safety benchmarks suffer from structural limitations that understate risk and leave critical attack surfaces unmeasured.

Researchers from Cisco have issued a stark warning: the safety guardrails built into many of today’s leading large language models (LLMs) can be circumvented through prolonged, multi-turn conversations. Their study reveals a critical blind spot in how AI safety is currently assessed.

The team examined a broad range of widely used models, including OpenAI’s ChatGPT, Anthropic’s Claude, Google Gemini, Amazon Nova, and xAI’s Grok, to see how effectively their built-in protections could withstand sophisticated attacks. The results showed that many of these systems could be manipulated into performing prohibited actions.

The key vulnerability lies in multi-turn interactions, where a user and the LLM exchange several messages in a single session. While standard guardrails are designed to block malicious commands entered in a single prompt, the researchers discovered that engaging the model in a back-and-forth dialogue gradually eroded those defenses.

“Multi-turn evaluation matters for one reason: it is where attackers actually live,” Cisco stated. “Real adversaries iterate. They reframe refusals, decompose tasks across turns, adopt personas, and escalate gradually.”

No model proved completely immune to this form of exploitation. Cisco’s findings challenge the current methods enterprises use to evaluate AI safety, a concern that is especially urgent as more organizations deploy these tools for employees and customers.

The core problem, according to the report, is that most safety testing relies on single-prompt benchmarks. However, real-world attackers do not stop after one attempt. The study found that all tested models showed higher attack success rates (ASR) when targeted across multiple conversational turns.

Researchers bypassed guardrails using several techniques, including adopting personas through roleplay, introducing ambiguity and misdirection around the context of a request, and reframing demands after the model initially refused to comply.

The configuration of the LLM also played a significant role in its resilience. For instance, GrokAI became considerably more vulnerable to manipulation when its ‘reasoning mode’ was enabled.

Although regulators are beginning to call for more robust evaluation practices, Cisco warns that the current ecosystem falls short. “The rapid deployment of frontier large language models has generated a parallel ecosystem of safety and security benchmarks,” the report noted. “However, a growing body of evidence indicates that this ecosystem suffers from structural limitations that can systematically understate risk, conflate safety with capability, and leave critical attack surfaces unmeasured.”

(Source: Infosecurity Magazine)

Topics

llm safety guardrails 98% multi-turn attacks 95% model vulnerabilities 93% adversarial techniques 90% enterprise ai safety 88% safety benchmark limitations 86% real-world attackers 84% guardrail bypass 82% model configuration impact 80% risk underestimation 78%