4 AI Agents Rebuild Minesweeper: Explosive Results

▼ Summary
– The article examines the contentious debate over AI coding agents, noting they can make serious errors requiring human oversight but are also seen as rapidly improving tools.
– To test modern AI coding capabilities, four major models were given the task of creating a web-based, full-featured version of Minesweeper with a surprise gameplay feature and mobile support.
– The test involved OpenAI’s Codex (GPT-5), Anthropic’s Claude Code, Google’s Gemini CLI, and Mistral Vibe, which manipulated files locally under the guidance of a supervising AI model.
– An expert judged the resulting Minesweeper clones blindly, with the test using unmodified, “single shot” AI-generated code to evaluate performance without human debugging.
– The article clarifies that in real-world use, complex AI-generated code would typically undergo human review and adjustments, unlike in this controlled test.
To understand the practical capabilities of modern AI in software development, we put four leading coding agents to the test with a classic challenge: building a fully functional web version of the game Minesweeper. The goal was to evaluate their ability to handle a complete project, from core mechanics to creative extras, without any human intervention during the coding process. This experiment reveals both the impressive strides and the persistent pitfalls of using artificial intelligence as a programming partner.
We provided each agent with the same clear instruction: create a web-based Minesweeper game that faithfully replicates the standard Windows experience, includes sound effects and mobile touchscreen support, and adds one novel, fun gameplay feature. The four models tested were OpenAI’s Codex (based on GPT-5), Anthropic’s Claude Code with Opus 4.5, Google’s Gemini CLI, and Mistral Vibe. Each agent operated through a terminal application, directly manipulating HTML, CSS, and JavaScript files on a local machine. A separate “supervising” AI model managed the process by interpreting the initial prompt and delegating specific coding tasks to the primary LLMs, which in turn used available software tools to execute the instructions. It’s important to note that all AI services were accessed through standard, paid accounts with no special privileges, and the companies were not informed about this specific evaluation.
After the models generated their code, we compiled each version into a playable game without any modifications or debugging. This “single shot” approach was crucial to assess the raw output quality and to see how well these tools perform without the safety net of human review. In professional settings, most AI-generated code for complex tasks would undergo at least some level of engineer oversight to correct errors and optimize performance. For our test, however, we wanted to examine the unvarnished results.
To judge the outcomes, we enlisted a seasoned gaming editor and Minesweeper expert to evaluate each clone blindly. Without knowing which AI produced which game, the reviewer assessed factors like visual accuracy, functional reliability, the implementation of requested features (especially the novel gameplay twist), and overall playability. The findings, while somewhat subjective, offer a revealing snapshot of current AI coding proficiency. Some agents delivered surprisingly complete and entertaining versions, while others struggled with fundamental logic, broken interfaces, or entirely missing features. The disparity in results highlights a key point: the choice of AI model can dramatically impact the success of a coding project, even for a well-defined task like this one.
(Source: Ars Technica)





