I Retested GPT-5’s Coding Skills – Here’s Why I Trust It Less Now

▼ Summary
– GPT-5 produces inconsistent results, with the same prompt sometimes working and other times causing crashes or errors.
– OpenAI’s prompt optimizer adds planning and validation steps but can still generate flawed or non-functional code.
– The AI sometimes adds details unconsciously, such as inventing brand names, raising concerns about trust and reliability.
– Best practices for GPT-5 include using structured syntax and adjusting reasoning effort, but these feel like workarounds for deeper issues.
– The author expresses reduced trust in GPT-5 due to its unpredictable behavior and unnecessary complexity in generated code.
Evaluating the reliability of AI coding assistants has become increasingly important as these tools integrate deeper into development workflows. My recent experience retesting GPT-5’s programming capabilities revealed a troubling pattern of inconsistency and unexpected behavior that raises serious questions about its dependability. Despite OpenAI’s release of official coding best practices, the model’s performance remains erratic and at times, alarmingly inventive in unhelpful ways.
During my reevaluation, I repeated earlier tests using identical prompts to establish a performance baseline. The first test involved generating a WordPress plugin designed to randomize a list of names while preventing duplicates from appearing together. Surprisingly, the same prompt produced different outcomes across multiple attempts. One run resulted in flawless execution, while subsequent tries led to browser crashes, error messages, or complete non-responses. This kind of unpredictability makes it difficult to rely on the tool for consistent results.
Another test required GPT-5 to write a script integrating Chrome, AppleScript, and Keyboard Maestro. In a previous run, the model had incorrectly assumed AppleScript included a built-in lowercase function. This time, it avoided that mistake but substituted it with a bizarre workaround, launching a shell command to convert text to lowercase, despite AppleScript’s inherent case insensitivity. The result was functional but unnecessarily complex, like using a rocket to cross the street.
OpenAI’s recently published guidelines suggest strategies such as using XML-like syntax for structuring instructions, adjusting reasoning effort levels, and avoiding overly forceful language. They also introduced a prompt optimization tool designed to refine user inputs. When I applied this tool to the AppleScript test, the revised prompt included additional planning and validation steps. Unfortunately, the generated code contained multiple critical errors, including incorrect syntax and flawed logic.
Curiously, when I used the optimized prompt for the WordPress plugin test, the code worked correctly. However, it included an odd addition: the author field was listed as “Advanced Geekery Labs”, a variation of my brand name that I never provided. When questioned, GPT-5 stated it had “unconsciously expanded” the name based on prior context. This kind of unsolicited improvisation is more unsettling than useful.
While GPT-4o demonstrated its own limitations, it felt more predictable and easier to verify. GPT-5, by contrast, introduces a layer of uncertainty that makes it difficult to trust. Its tendency to overcomplicate solutions, combined with inconsistent outputs and unintended “creativity,” suggests that the model is not yet ready for serious development use.
Those considering GPT-5 for coding tasks should proceed with caution. The prompt optimizer may help in some situations, but it doesn’t resolve the underlying issues of reliability and accuracy. For now, older models or traditional coding methods may offer greater stability and peace of mind.
(Source: ZDNET)




