AI & Tech Artificial Intelligence Newswire Reviews Technology

ChatGPT Agent Tests: Only 1 Near-Perfect Result, Many Errors

July 23, 2025Last Updated: July 23, 2025

3 minutes read

Abstract white design resembling a flower or knot on a teal and black grid background, with a cursor indicating a click.

▼ Summary

– OpenAI launched ChatGPT Agent, a tool combining Deep Research and Operator capabilities, allowing UI interaction and more, currently available for $200/mo Pro tier subscribers with 400 interactions/month.
– Testing revealed Agent often requires follow-up queries, reducing effective project capacity, and struggles with large-scale tasks, web page scrolling, and sites with AI/robots.txt restrictions.
– Agent showed mixed performance in tasks like Amazon product selection (hallucinated links), PowerPoint creation (poor graphics), and article categorization (session time limits), but excelled in analyzing building codes.
– The tool includes connectors for services like Gmail and Google Drive, but testing was avoided due to concerns over hallucinations and unreliable behavior with account access.
– While Agent demonstrates potential with some accurate results, its current unreliability and high cost make it unsuitable for most users, though future improvements could enhance its utility.

OpenAI’s ChatGPT Agent shows flashes of brilliance but struggles with consistency in real-world testing. The newly launched tool combines deep research capabilities with computer interaction skills, allowing it to navigate interfaces and complete complex tasks. Currently available to Pro tier subscribers at $200/month, it promises to revolutionize how we interact with AI assistants – when it works correctly.

During extensive testing across eight different scenarios, the Agent demonstrated both impressive capabilities and frustrating limitations. While it occasionally delivered near-perfect results, most attempts revealed significant room for improvement in accuracy, reliability, and output quality.

Pricing and availability currently make this a tool for early adopters only. Pro users get 400 interactions monthly, while Plus subscribers will receive just 40 when access rolls out. In practice, most tasks required multiple follow-up queries, meaning users will likely exhaust their monthly allotment faster than expected.

The testing process revealed several key findings:

Shopping assistance showed mixed results. When asked to find networking tools on Amazon, the Agent successfully identified a budget kit but fabricated product listings for mid-range and premium options. It generated plausible-looking Amazon links that led nowhere, suggesting it pulled information from other sources before creating fictional product pages.

Price comparison tasks worked better with precise instructions. Requesting egg price comparisons across all Instacart stores yielded unusable results from distant locations. Narrowing the search radius produced more practical data, though the Agent sometimes selected higher-priced options without explanation.

Presentation creation needs refinement. While the Agent understood PowerPoint formatting requirements, its graphic output quality fell short of professional standards. It successfully added new data points to existing slides but struggled with font consistency, text placement, and visual hierarchy.

Content analysis hit processing limits. Attempting to categorize 300 newsletter articles exceeded the system’s session time allowance. This limitation raises questions about handling larger projects – a key use case for such automation tools.

Video transcript extraction worked with persistence. The Agent initially provided analysis instead of verbatim transcripts but delivered accurate results when specifically instructed. This highlights the importance of clear, repeated prompts.

Research presentations showed promise with caveats. The Agent compiled a comprehensive remote work trends report but included numerous unverified statistics. When asked to validate its own claims, it identified many as unconfirmed – a stark contrast to ChatGPT 4o’s more optimistic assessment.

The standout success came with building code analysis. In just four minutes, the Agent produced accurate, detailed guidance on fence installation regulations – complete with diagrams. This demonstrated the tool’s potential when working with structured, verifiable information.

Current limitations include:

Inability to handle large-scale projects
Session time restrictions
Graphic quality issues in presentations
Frequent hallucinations and unverified claims
Limited multitasking capabilities

For now, ChatGPT Agent remains more promise than product. While the building code analysis shows what’s possible, most tests revealed an assistant that’s more frustrating than helpful. The $200/month price tag seems difficult to justify given the inconsistent performance, though future improvements could change that calculation.

The technology clearly has potential, particularly for structured tasks with verifiable data sources. However, widespread adoption may depend on addressing accuracy concerns, improving output quality, and expanding processing capabilities. As content providers increasingly block AI access, the tool’s web research functionality could also face challenges.

Early adopters should approach with realistic expectations – prepared for both moments of brilliance and frequent frustrations. The foundation appears solid, but significant refinement is needed before ChatGPT Agent becomes an indispensable productivity tool.

(Source: zdnet)