AI & TechArtificial IntelligenceBusinessDigital MarketingNewswireTechnology

Schema, LLMs, and the Low Bar for Evidence in GEO

▼ Summary

– An experiment with a fake duck company page showed that LLMs returned an address from deliberately invalid JSON-LD, proving they treat schema as just text rather than parsing it as structured data.
– Schema markup is likely stripped from LLM training data by cleaning pipelines (e.g., FineWeb using trafilatura), which remove HTML and script tags to extract clean prose.
– Even if schema survived training, tokenization dissolves its structure, so the disambiguation schema provides is lost, and facts require many repetitions to be “learned” by the model.
– Claims that LLMs use schema at query time lack evidence; Google’s own AI Overviews sometimes ignore its own structured Business Profile data, showing the gap is not yet solved.
– Schema remains useful for disambiguation (e.g., new or challenger brands), but current evidence does not support selling it as a direct driver of LLM citations.

A small, deliberately broken experiment has revealed a significant flaw in how the Generative Engine Optimization (GEO) industry interprets the relationship between schema markup and large language models (LLMs). I created a test page about a fictional duck T-shirt company, placing a fake address inside invalid JSON-LD that referenced no real Schema.org vocabulary. The visible text of the page mentioned no location at all. When I asked ChatGPT and Perplexity for the company’s address, both models returned the fabricated details without hesitation. Perplexity even cited the “embedded structured data” as its source. The catch? The JSON-LD was complete nonsense. The models were not parsing schema as intended. They were simply reading the raw HTML, ignoring the broken structure, and treating the curly-braced text as just another part of the page.

This experiment, later covered by Search Engine Roundtable, was quickly seized upon by proponents of GEO who claimed it proved LLMs are meticulously following Schema.org protocols. In reality, it proved the opposite. The schema was deliberately invalid, yet the LLMs returned the data anyway because, to a token predictor, JSON-LD is just text garnished with curly braces. This distinction is critical. A growing number of “GEO experts” point to the fact that an LLM returned information found only in schema markup as ironclad proof that the system is using the markup as designed. They are wrong. The models are reading the HTML and shrugging at the structure.

Let’s be clear: I am not arguing that schema markup is worthless. You should absolutely continue to use it. However, the way it is currently sold to clients, as a magical lever for LLM citations, rests on a remarkably thin foundation of evidence.

To understand why, we need a refresher on what schema is actually for. Schema.org structured data is a collaborative vocabulary designed by search engines to embed machine-readable information on web pages. Its purpose is disambiguation. When a page mentions “Apple,” schema tells a machine whether it is a fruit, a company, or a record label. The data is fed into systems like Google’s Knowledge Graph, a curated database of entities and relationships. That is the contract: explicit clues for machine-resolvable identity. LLMs, however, are a fundamentally different animal.

The debate over how LLMs use schema falls into two camps. The first argues that schema is ingested during model training and “baked in.” This theory has the weakest mechanical case. Pre-training pipelines aggressively strip out HTML, boilerplate, and script tags (where JSON-LD lives) to extract clean prose. Widely used datasets like FineWeb explicitly use libraries designed to discard `