ChatGPT uses just 15% of its retrieved data: Study

▼ Summary
– ChatGPT retrieves far more webpages for research than it actually cites, with 85% of discovered sources never appearing in its final answers.
– Earning a citation requires more than just being retrieved; a page must be selected during the AI’s synthesis process based on how well it matches the prompt.
– ChatGPT frequently expands user prompts with additional internal searches, known as “fan-out,” which generates a significant secondary source for citations.
– There is a strong correlation with Google rankings, as pages ranking higher on Google are significantly more likely to be cited by ChatGPT.
– Citation rates vary by query type, with product discovery queries having the highest rate and validation searches having the lowest.
When aiming for visibility in AI-generated content, simply having your webpage discovered is no longer sufficient. A recent analysis reveals that ChatGPT retrieves a vast number of webpages but only cites a small fraction of them in its final answers. This means the majority of content pulled during its research phase never reaches the end user, fundamentally changing how we think about optimization for AI tools.
The core insight is straightforward: retrieval does not guarantee citation. Your page might rank well and be found by the AI’s search process, yet still lose out in the final answer to another source that more closely aligns with the user’s specific prompt or provides better supporting context. This shifts the strategic focus from merely appearing in search results to earning selection during the AI’s internal synthesis process, where it decides which information to present.
The numbers from the study are telling. Out of a massive pool of retrieved data, only 82,108 citations made it into the final responses. This represents a mere 15% citation rate, leaving a staggering 85% of surfaced pages completely absent from the answers users see. This rate wasn’t uniform across all types of questions, either. Product discovery queries saw the highest citation rate at 18.3%, followed by how-to queries at 16.9%. Validation searches, where users seek to confirm facts, had the lowest rate at just 11.3%.
A particularly interesting behavior observed is what researchers term “fan-out queries.” ChatGPT frequently expands on the original user prompt with additional, internal searches while it is generating an answer. This creates a secondary opportunity for citation that wasn’t part of the initial request. The data shows this is a common practice, with nearly 90% of prompts triggering two or more of these follow-up searches. What began as 15,000 user prompts ballooned into over 43,000 internal queries. Crucially, almost a third of all cited pages appeared exclusively in these fan-out results, not in response to the original prompt. Furthermore, an overwhelming 95% of these fan-out queries had no measurable traditional search volume, indicating they are unique to the AI’s reasoning process.
The analysis also uncovered a strong link between traditional search performance and AI citation. There is a significant correlation between high Google rankings and being cited by ChatGPT. Over half of all cited pages held a position within Google’s top 20 search results. Pages occupying the coveted number one spot were cited 3.5 times more often than pages that ranked outside the top 20. This suggests that authority and relevance signals recognized by search engines continue to carry substantial weight within AI systems.
The findings are based on a substantial dataset, examining over half a million pages retrieved across 15,000 distinct prompts to understand how the AI expands queries and makes its final citation decisions. For content creators and marketers, the implication is clear. Success requires a dual strategy: maintaining strong organic search visibility while also crafting content that is exceptionally well-suited to be selected and synthesized by AI during its complex, multi-step answer generation.
(Source: Search Engine Land)





