Google: Attackers Made 100,000+ Attempts to Clone Gemini AI

▼ Summary
– Google reported that actors have attempted to clone its Gemini AI by prompting it extensively, with one session exceeding 100,000 queries to collect training data.
– The company frames this “model extraction” as intellectual property theft in a self-assessment, despite having built its own models from scraped internet data.
– Google itself faced accusations in 2023 of using outputs from OpenAI’s ChatGPT to train its Bard chatbot, leading to a researcher’s resignation.
– Google’s terms forbid such data extraction, and it suspects private companies and researchers worldwide are behind these cloning efforts for a competitive edge.
– The industry often calls this practice “distillation,” where a new, cheaper model is trained using the outputs of an existing, expensive LLM like Gemini.
In a recent disclosure, Google revealed that its Gemini AI chatbot has been the target of extensive efforts to replicate its capabilities. The tech giant reported over 100,000 adversarial prompts across multiple non-English languages in a single session, an operation it attributes to commercially driven actors. According to Google, these attempts aimed to harvest the model’s responses to train a less expensive, competing system. This activity, which the company labels as “model extraction,” is framed as a significant form of intellectual property theft within its latest threat assessment report.
The practice of creating a new AI model by training it on the outputs of an existing one is commonly known in the industry as “distillation.” It offers a potential shortcut for entities that want to develop a sophisticated large language model but lack the immense financial resources and time required for original training from scratch. By querying a powerful model like Gemini thousands of times, attackers can compile a dataset to train their own derivative system, bypassing the need for the foundational work.
Google’s position on this matter is notably firm, with its terms of service explicitly prohibiting such data extraction from its AI models. The company asserts that the perpetrators are primarily private firms and researchers seeking a competitive advantage, with attacks originating globally. However, Google has not publicly identified any specific suspects involved in these cloning attempts.
This stance invites scrutiny, given the broader context of how AI models are developed. Many foundational models, including Google’s own, have been trained on vast amounts of data scraped from the internet, often without explicit permission from the original creators. This reality complicates the narrative of pure victimhood. Furthermore, Google itself has faced allegations related to similar practices. Last year, reports surfaced that Google’s Bard team was accused of using outputs from OpenAI’s ChatGPT, shared on a public forum, to assist in training its chatbot. A senior AI researcher reportedly raised concerns about this violating terms of service before departing the company. Google denied the allegations but is said to have ceased using the data in question.
The report ultimately provides a glimpse into the competitive and sometimes ethically ambiguous tactics emerging in the race to develop artificial intelligence. It highlights the tension between protecting proprietary technology and the industry’s common reliance on publicly available data and knowledge. As AI capabilities become increasingly valuable, conflicts over how that knowledge is shared, used, and protected are likely to intensify.
(Source: Ars Technica)





