Artificial Intelligence BigTech Companies Newswire Technology

Did DeepSeek Train Its AI Using Google’s Gemini?

June 4, 2025Last Updated: June 4, 2025

2 minutes read

▼ Summary

– DeepSeek released an updated R1 AI model performing well on math and coding benchmarks, with speculation it was trained partly on Google’s Gemini data.
– Developer Sam Paech claims evidence that DeepSeek’s R1-0528 model shows linguistic preferences similar to Gemini 2.5 Pro, suggesting potential training on Gemini outputs.
– DeepSeek has faced prior accusations of using rival AI data, including OpenAI’s ChatGPT, with OpenAI alleging distillation techniques were used.
– AI companies like OpenAI and Google are tightening security measures, such as ID verification and trace summarization, to prevent unauthorized data use.
– Experts suggest DeepSeek may have used synthetic data from top API models like Gemini due to GPU shortages and available funding.

The recent release of DeepSeek’s upgraded R1 reasoning model has sparked speculation about its training data sources, with some experts suggesting potential ties to Google’s Gemini AI. While the Chinese lab hasn’t disclosed its data origins, independent developers have identified linguistic patterns and behavioral traces resembling Gemini’s outputs in the new model.

Sam Paech, an AI developer specializing in emotional intelligence assessments, pointed out striking similarities between DeepSeek’s R1-0528 and Google’s Gemini 2.5 Pro in word choice and phrasing. Another anonymous researcher behind the SpeechMap tool noted that the model’s reasoning traces closely mirror those produced by Gemini. Though not definitive proof, these observations have fueled debate about whether DeepSeek leveraged competitor data for training.

This isn’t the first time DeepSeek has faced such allegations. Late last year, its V3 model occasionally identified itself as OpenAI’s ChatGPT, hinting at possible training on ChatGPT logs. OpenAI reportedly found evidence of distillation—a technique where smaller models learn from larger ones—being used by DeepSeek. Microsoft also detected unusual data transfers from OpenAI accounts believed linked to the Chinese lab, though no formal accusations were made.

While model misidentification isn’t rare—thanks to the growing volume of AI-generated content online—experts like Nathan Lambert of AI2 suggest DeepSeek may have strategic reasons to use synthetic data from top-tier models. “With limited GPU access but significant funding, distillation could effectively boost their computational efficiency,” Lambert noted.

To counter such practices, major AI firms are tightening security. OpenAI now mandates ID verification for advanced model access, excluding unsupported regions like China. Google and Anthropic have begun modifying model traces to hinder distillation attempts, safeguarding their proprietary advantages.

As the industry grapples with data integrity challenges, the line between innovation and imitation remains blurred. Whether DeepSeek’s approach crosses ethical boundaries or simply reflects competitive optimization continues to divide experts. Google has yet to comment on the allegations.

(Source: TechCrunch)