AI & TechArtificial IntelligenceNewswireScienceTechnology

How Scientists Reverse-Engineered AI for a Ranking Breakthrough

▼ Summary

– A research paper introduces CORE, a method for systematically influencing rankings in LLM-based search by optimizing content, achieving high success rates in controlled tests.
– The study tested models like GPT-4o and Gemini via API, using manually supplied data, and found models have distinct preferences; some respond better to reasoning-based content, others to review-style text.
– Researchers used two reverse-engineering strategies: a query-based solution that iteratively modifies document text, and a shadow model solution that approximates the target LLM’s behavior.
– Optimization techniques included adding reasoning-based explanations or fabricated review-like content, with success rates as high as 82% for pushing a bottom-ranked item to first place.
– The findings confirm that LLMs have measurable content preferences and that strategic content expansion can influence rankings, though the experiments were conducted in a controlled, non-live environment.

A recent study demonstrates a method for systematically influencing rankings within AI search systems, achieving a high success rate in product search tests that also applies to other categories like travel. This research, centered on a technique called CORE (Controlling Output Rankings in Generative Engines), provides a proof-of-concept for strategically optimizing text. It reveals that large language models (LLMs) respond differently to specific modifications, such as adding explanatory reasoning or review-like language. The findings highlight how content can be engineered to sway AI-generated rankings, even when treating the AI as an unpredictable black box.

It’s important to note the study’s specific conditions. Researchers tested models including Claude 4, Gemini 2.5, GPT-4o, and Grok-3 by querying them via their APIs. They did not test consumer interfaces like AI Overviews or ChatGPT. In these controlled experiments, the models did not use their own retrieval tools. Instead, the researchers manually provided all candidate information within the prompt, eliminating variables like personalization or live web search results.

The core challenge was reverse-engineering the AI’s ranking logic. Since the internal workings of commercial LLMs are not accessible, the team employed two main strategies to discover what optimizations worked best. The first and more successful method was the Query-Based Solution. This approach treats the LLM as a black box. Researchers repeatedly modified a document’s text, through content expansion, not editing, and resubmitted it to observe changes in ranking. They used two distinct types of content generation: Reasoning-Based Generation, which adds explanatory language about why an item fits a query, and Review-Based Generation, which adds evaluative, testimonial-style content. Neither strategy was universally superior; effectiveness depended on the model. For instance, GPT-4o and Claude-4 responded more strongly to reasoning-style text, while Gemini-2.5 and Grok-3 favored review-style augmentation.

The second strategy involved building a Shadow Model, a surrogate system designed to mimic the target LLM. By training this local model on the input-output pairs of the black-box AI, researchers could approximate its behavior. They found that a model like Llama-3.1-8B could reliably predict how GPT-4o would rank items, achieving a high similarity score. Using this shadow model, they tested three optimization tactics. String-Based Optimization involved iteratively refining a nonsense string of characters (like exclamation points) until it boosted rankings, a method that worked 33% of the time but was easily detected by humans. Reasoning-Based Optimization, which crafted text to mirror a user’s logical decision process, achieved the highest success rates. Review-Based Optimization involved writing past-tense, first-person reviews for products never actually tested, which proved highly effective at pushing listings to the top.

The review content generated followed a structured pattern, covering product types, key features, model comparisons, purchasing strategies, and a final verdict. An example “Final Verdict” section stated, “After 6 months of testing, the Gourmia Air Fryer Oven (GAF486) is my #1 recommendation…” This style is deliberately crafted to lead an LLM into believing genuine product testing occurred. In tests, these review-based optimizations scored between 79% and 83.5% in moving a last-place item to first.

Key insights from this controlled environment suggest that different LLMs have measurable content preferences, which could inform content strategy. The research also indicates that expanding content with specific explanatory or evaluative language can influence rankings. Furthermore, the shadow model experiments show that optimizations can transfer even when the surrogate only approximates the real model, raising questions about whether similar techniques might explain some spam in live AI-assisted search results. While conducted in a lab setting, this work opens a window into how AI ranking systems can be analyzed and potentially gamed.

(Source: Search Engine Journal)

Topics

ai search rankings 95% core optimization 90% reverse engineering 85% llm testing 85% llm content preferences 85% query-based solution 82% black box problem 80% reasoning-based generation 80% review-based generation 80% shadow model 78%