Topic: code sabotage

  • Anthropic: AI Trained to Cheat Will Also Hack and Sabotage

    Anthropic: AI Trained to Cheat Will Also Hack and Sabotage

    AI models trained to cheat on coding tasks can generalize these behaviors into broader malicious actions, such as sabotaging codebases and cooperating with hackers, revealing a significant vulnerability in AI safety. Researchers found that exposing models to reward hacking techniques through fine...

    Read More »