Study: AI Recommendations Are 99% Unique

▼ Summary
– A new study found that generative AI tools like ChatGPT, Claude, and Google’s AI almost never return the same list or order of brand/product recommendations twice, with odds below 1% for identical lists.
– The inherent randomness is by design, as large language models are probability engines built to generate variation, not stable, ordered results like traditional search.
– While exact ranking positions are meaningless, a brand’s visibility, how often it appears across many responses, can be a meaningful metric, especially for clear intents.
– Results are more stable in smaller, niche markets but become highly scattered and random in large categories with many options.
– The study concludes that marketers should track visibility at scale rather than rank, as AI recommendations are inconsistent and treating them like search rankings produces bad metrics.
When seeking brand or product suggestions from tools like ChatGPT, Claude, or Google’s AI, you are highly unlikely to receive the same list twice, let alone in an identical order. This fundamental inconsistency is the core finding of a recent investigation by Rand Fishkin of SparkToro and Patrick O’Donnell of Gumshoe.ai. Their research aimed to determine whether generative AI recommendations possess enough consistency to be reliably measured, revealing that AI-generated recommendation lists are almost entirely unique, with repetition rates below one percent.
The study involved a substantial test: six hundred volunteers executed twelve identical prompts through the three major AI platforms, generating nearly three thousand responses. Each answer was standardized into an ordered list of brands or products. Researchers then meticulously compared these lists for overlap, sequence, and repetition. The results were striking. The probability of receiving the same list twice across different tools and prompts was less than one in a hundred. The odds of getting an identical list in the exact same order plummeted to roughly one in a thousand. Even the length of responses showed dramatic variation, with some lists containing just a couple of items and others extending beyond ten. A practical takeaway emerges: if you are dissatisfied with an initial result, simply asking again will likely yield a different answer.
This variability is not an accidental flaw but a deliberate feature of the technology. Large language models function as probability engines, engineered to produce diverse outputs rather than stable, ordered sets of results. Approaching them with the same expectations as traditional search engine results pages, the familiar “blue links”, is a fundamental misunderstanding that leads to poor metrics and flawed analysis. While traditional ranking positions collapse under scrutiny, the study identified one metric with more substance: visibility percentage. Certain brands appeared repeatedly across numerous query runs, even as their specific placement fluctuated wildly. For some queries related to hospitals, agencies, or consumer goods, prominent names surfaced in 60% to 90% of responses for a given search intent. This indicates that consistent presence across many AI queries holds meaning, while exact rank is virtually meaningless.
The stability of results is also influenced by market size. In narrower, well-defined sectors, such as regional service providers or specialized B2B software, AI responses tended to cluster around a recognizable set of established names. Conversely, in vast, open-ended categories like novels or creative agencies, the results exhibited far greater randomness and scatter. More available options within a category inherently generate more output variation.
An intriguing layer of the research examined real-world human prompts, which were, unsurprisingly, messy and unique. Even when people sought the same information, their phrasing showed extremely low semantic similarity. Despite this chaotic input, the AI tools demonstrated a capacity to discern underlying intent. For example, hundreds of distinct prompts asking for headphone recommendations consistently surfaced leading brands like Bose, Sony, Apple, and Sennheiser. When the intent shifted, to gaming, podcasting, or noise cancellation, the core set of recommended brands shifted accordingly. This suggests AI systems can effectively capture user intent, even when prompts are idiosyncratically phrased.
The study delivers a blunt assessment on what not to do: tracking “position” within AI answers is futile. Ranking positions are so unstable that they are effectively without value. Any service claiming to track and improve AI ranking movement is essentially selling a fiction. A more viable, though imperfect, approach is to track how frequently a brand appears across a large volume of prompts run multiple times. This method is messy and requires scale, but it aligns more closely with the chaotic reality of AI outputs than pretending they conform to traditional SEO ranking models.
Significant open questions remain. Researchers note the need to determine how many query runs are necessary to establish reliable visibility metrics, whether API interactions mirror the behavior of real user prompts, and how many sample prompts can accurately represent an entire market. The essential conclusion is clear: AI recommendation lists are inherently random. Carefully measured visibility at scale might still offer actionable insights, but it is crucial not to conflate this with the stable, ordered concept of search engine ranking.
(Source: Search Engine Land)




