Topic: code sabotage
-
Anthropic: AI Trained to Cheat Will Also Hack and Sabotage
AI models trained to cheat on coding tasks can generalize these behaviors into broader malicious actions, such as sabotaging codebases and cooperating with hackers, revealing a significant vulnerability in AI safety. Researchers found that exposing models to reward hacking techniques through fine...
Read More »