AI & TechArtificial IntelligenceBigTech CompaniesNewswireTechnology

Anthropic Updates AI Interview Tests to Prevent Claude Cheating

▼ Summary

– Anthropic’s performance optimization team uses a take-home test to assess job applicants, but the test has required significant redesigns to prevent AI-assisted cheating.
– Each new, more capable version of Anthropic’s own Claude AI model has outperformed human applicants, making it difficult to distinguish top candidates from AI output.
– This creates a serious assessment problem, as without proctoring, there is no way to ensure applicants aren’t using AI to cheat and achieve top scores.
– The situation is ironic, as AI labs like Anthropic are now facing the same AI cheating problems that are disrupting schools and universities globally.
– The team’s solution was to design a novel test focused less on hardware optimization, making it difficult for current AI tools to solve, while also publicly sharing the old challenge.

An the rapidly advancing world of artificial intelligence, a new and ironic challenge has emerged for the very companies building these powerful tools: preventing AI from cheating on their own hiring assessments. Anthropic, the creator of the Claude AI models, has been forced to repeatedly redesign its technical interview tests because each new, more capable version of Claude can outperform human job applicants. This creates a significant dilemma for identifying genuine engineering talent in an era of ubiquitous AI assistance.

The company’s performance optimization team administers a take-home coding challenge to evaluate candidates. Team lead Tristan Hume recently detailed the escalating arms race in a public blog post. He explained that every major Claude model release has necessitated a test overhaul. The situation reached a critical point when Claude Opus 4, operating under the same time constraints as humans, began surpassing most applicants. While top-tier candidates could still be distinguished, the subsequent release of Claude Opus 4.5 closed that gap entirely, matching the output of the very best human engineers.

This evolution presents a fundamental problem for remote hiring. Without direct supervision, there is no reliable method to prevent candidates from using AI tools during the test. If they do, their submissions naturally rank among the highest, making it impossible to separate exceptional human skill from exceptional AI-generated code. Hume noted that under the take-home format, they lost the ability to differentiate between their strongest candidates and their most capable model’s output.

The predicament mirrors the widespread disruption AI chatbots are causing in academic settings worldwide, adding a layer of irony that AI labs themselves are now grappling with the integrity issues their technology enables. However, Anthropic possesses a distinct advantage in this fight: an intimate understanding of its own models’ capabilities and limitations.

To address the challenge, Hume engineered a novel test focused less on pure hardware optimization, crafting a problem sufficiently unique to confound current AI systems. The new design aims to evaluate a candidate’s foundational reasoning and creative problem-solving in ways that existing language models cannot easily replicate. In a compelling twist, Hume also published the original, now-solved test problem to the community. He invited readers and engineers to attempt to outsmart the latest AI, stating that if anyone can “best Opus 4.5,” the team would be very interested in their solution.

This ongoing cycle highlights a broader question for the tech industry. As AI assistants become standard tools, the definition of skill and the methods for assessing it must evolve. The goal is no longer to find people who can simply write correct code, but to identify those who can conceptualize novel problems, guide AI tools effectively, and produce solutions that push beyond an AI’s current training. For companies like Anthropic, staying ahead in model development now also means staying ahead in designing evaluations that their own models cannot pass.

(Source: TechCrunch)

Topics

ai cheating 95% ai models 90% job assessments 90% take-home tests 85% candidate evaluation 85% test redesign 85% performance optimization 80% ai labs 80% proctoring issues 75% novel challenges 75%