AI & TechArtificial IntelligenceBigTech CompaniesDigital MarketingNewswireTechnology

How Multilingual Regions Shape the Future of AI Search

▼ Summary

– AI search systems decide which sources and realities are surfaced, and multilingual regions like Catalonia reveal how these systems collapse distinct linguistic and jurisdictional signals into a dominant corpus.
– In Catalonia, Catalan and Spanish queries for the same topic produce different retrieval pools, citations, and cultural framings, with the system often misidentifying Catalan as Occitan or suggesting Spanish results.
– Commercial retrieval for Catalan queries shows zero SEM bidding and autocorrection errors, creating a self-reinforcing cycle where the minority language is deprioritized for transactional intent.
– Language identification failures in AI search pipelines are inherited from older infrastructure, and the generation of low-quality AI content in minority languages threatens to degrade training data further.
– The structural flaws exposed in Catalonia—corpus-weight defaults, semantic collapse, and unreliable language identification—generalize to other multilingual regions and may surface in monolingual markets with sub-national jurisdictional differences.

When a user types a query into an AI search engine, the system doesn’t just translate words or pick a regional variant. It makes a fundamental decision about which sources, institutions, and versions of reality are worth surfacing. In multilingual regions, that decision becomes visible in ways that reveal the deep structural assumptions of the entire retrieval pipeline.

Catalonia serves as a stress test for this system. Two languages share the same geography, making retrieval patterns easier to isolate and analyze. When identical queries run in Catalan and Spanish across Google AI Overviews and ChatGPT, the differences extend far beyond vocabulary. They expose a cascade of failures that have implications for any market where language, jurisdiction, and authority don’t align neatly.

Consider a small but telling example. Search for Tradicions de Sant Jordi in Catalan, and Google Translate identifies the source language as Occitan. The answer is technically defensible , Catalan and Occitan share a common Romance ancestry , but it’s statistically odd. Occitan has roughly 200,000 speakers, mostly in southern France. Catalan has about 9 million speakers and is the co-official language of Catalonia, a region Google has operated in for over two decades. From a Barcelona IP, the system chooses the less plausible option, then castilianizes the saint’s name.

This single quirk is anecdotal. What it points at is not. Google’s own Search Liaison account acknowledged the problem in January 2023, responding to complaints about Catalan results being systematically downgraded in favor of Spanish. The company called it a priority and pushed updates that improved visibility in classical SERPs. But the underlying language-identification layer was never structurally repaired. When a Catalan speaker today watches Google’s AI Overview answer a Catalan-language query in Spanish, it isn’t a new bug. It’s an old bug now sitting underneath a synthesis layer that propagates it.

The retrieval pipeline that flattens Catalan into Spanish today is the same pipeline that will, in modified forms, flatten sub-national jurisdictional context in markets where the surface language never changes. Multilingual regions are where the architectural defaults become visible because users can switch languages and watch the system reassign meaning, authority, and sometimes even the answer’s language.

Four distinct patterns emerged from testing paired queries across Google AI Overviews and ChatGPT from a Barcelona IP. The first involves vocabulary and source plurality. When asked about Catalan independence, the Spanish version produced a legalistic frame anchored in the 1978 Constitution and the 2017 referendum’s illegality. The Catalan version foregrounded dret a decidir and autodeterminació, retaining anti-independence arguments absent from the Spanish version. The citations diverged entirely: Spanish pulled from BBC, Wikipedia ES, and France 24; Catalan added El Punt Avui, VilaWeb, and Wikipedia CA. Same engine, same geography, same question. The language wasn’t labeling the answer , it was filtering the corpus.

The second pattern involves commercial retrieval. Searching for the best accountants for freelancers in Barcelona produced different recommendations depending on language. ChatGPT surfaced largely the same physical firms but different online providers. Google’s organic SERP showed a more pronounced split, and critically, the system autocorrected the Catalan query to suggest ice cream shops instead. The Spanish results carried paid ads; the Catalan results carried zero. The SEM market treats Catalan as territory without bidders, and the absence of commercial signal teaches the system that the language isn’t commercially serious. The mechanism teaches itself: less bidding produces less visibility, which produces less signal.

The third pattern involves cultural authority reassignment. For Sant Jordi traditions, the Spanish-language AI Overview led with hotel chains as primary citations in one session, then shifted to state tourism portals in another. The Catalan version consistently cited the Ajuntament de Barcelona and the Generalitat de Catalunya , the institutions native to the tradition. The same 600-year-old tradition is described as exotic-from-outside in one language and as tradition-from-inside in the other. The model isn’t lying. It’s producing the most statistically plausible synthesis given a retrieval pool that is constituted differently by language.

The fourth and most fundamental pattern is that language identification was already broken before LLMs touched it. The query receptes de calçots , a vegetable that exists only in Catalonia , produces a banner suggesting the user filter Catalan results out. No AI Overview is generated. The infrastructure has decided that a recipe search for a Catalan-only vegetable is more usefully answered in Spanish. The behavior is inconsistent across sessions, which is worse than consistently wrong: it is undiagnosable.

A second, slower mechanism amplifies these problems over time. LLMs trained on web-scale corpora now generate significant quantities of low-quality content in minority languages. That generated content gets indexed, crawled, and fed back into the next generation of training data. The model that doesn’t understand Catalan well produces the Catalan content that trains the next model. A 2024 Princeton study found that over 5% of newly created English Wikipedia articles showed signs of AI generation, and minority-language editions with thinner editorial oversight absorb a greater proportional impact. MIT Technology Review reported in September 2025 on a linguistic “doom loop” in vulnerable-language Wikipedias, with volunteers estimating that 40% to 60% of articles in four African-language editions were uncorrected machine translations.

The clearest institutional signal came on March 20, when the English Wikipedia community formally voted to prohibit LLM-generated article content across its 7.1 million articles. If a platform with strong volunteer governance and explicit neutrality policies has concluded that AI-generated text damages knowledge integrity, the SEO industry should not assume that retrieval pipelines downstream of Wikipedia will produce better answers.

The mechanism that causes these failures is mechanical, not political. When two languages share one geography, the system can’t default to “the country the language belongs to.” It chooses whichever corpus is larger, more recent, or more commercially tagged. Researchers call this semantic collapse: when retrieval embeddings can’t reliably separate sub-national signals, the system flattens them into the dominant variant. Sub-national governments have noticed. The Aina Project in Catalonia and the Latxa models in the Basque Country are direct attempts to build language-resource sovereignty because standard global LLMs perform measurably worse on Catalan and Basque than on Spanish.

The pattern isn’t unique to Catalonia. Quebec users receive Parisian-French defaults. Belgian users get conflated jurisdictional rules. Swiss users see retrieval flattened toward German or French national defaults. The Catalan case is the easiest to test, but the structural finding generalizes to every region where two or more languages share one geography.

The interesting question is what Catalonia means for everyone else. Multilingual regions are the canary. The architectural flaw exposed when two languages share one geography will show up in other forms as AI search matures and attempts genuinely sub-national answers. In monolingual markets, AI search does have access to localization signals that the Catalan case partly removes: IP geolocation, GPS context, browser locale. But as AI Overviews increasingly answer queries by synthesis rather than by surfacing localized links, the protective effect of IP-based localization weakens. The system has to make a decision about which corpus represents the answer, and the corpus weight tends to win.

The places this is most likely to surface inside monolingual English markets include state-level regulation with significant corpus asymmetry. California’s CCPA and Texas’s data privacy regime are written in the same language but represent different jurisdictional realities. The privacy literature is heavily California-weighted. When an AI Overview synthesizes a generic answer about privacy rights, the defaults tilt toward whichever jurisdiction has more authority signals. Sub-national regulatory granularity in any large country , liquor licensing, contractor licensing, real estate disclosure rules, zoning codes , all face the same dynamic.

What to do about it requires treating sub-national jurisdictions as distinct entities. Each variant should canonicalize to itself, not to a national parent page that would invite collapse. Encode jurisdiction explicitly in structured data and copy. Schema.org’s areaServed operates at any geographic granularity; pair it with explicit copy markers like regulator names and state-specific identifiers. Reinforce sub-national grounding through Wikidata, where jurisdiction properties let you encode boundaries at the knowledge-graph level where AI systems pull entity context. Audit for sub-national authority gaps the same way you’d audit for international ones. Watch the secondary signals: if no one bids on Texas-specific terminology, Texas-specific content gets deprioritized in synthesis.

This isn’t a new playbook. It’s the cultural SEO framework applied below the country line: market segmentation, transcreation, retrieval constraints, and entity reinforcement, but at sub-national granularity. The Sant Jordi answer didn’t fail because of bad translation. It failed because the language-identification layer beneath the translation has never consistently distinguished Catalan from Occitan, Catalan from Spanish, or Catalan as the language of the query from Catalan as irrelevant noise.

Google said so itself, in Catalan, three years ago. The retrieval pipeline built on top of that layer inherits every one of those decisions. The brands that operate well across Spain and Mexico already know how to fix this for languages. The same techniques are now table stakes for operating well across any pair of jurisdictions, in any language combination. If you operate across multiple jurisdictions, the question to ask isn’t whether your content is localized. It’s whether the model can tell.

(Source: Search Engine Land)

Topics

ai search bias 98% language identification 95% catalan language 94% multilingual seo 91% google ai overviews 90% chatgpt output 88% commercial retrieval 87% cultural authority 86% training data bias 84% sub-national jurisdictions 83%