Artificial IntelligenceNewswireStartupsTechnology

EleutherAI Launches Huge Open-Source AI Training Dataset

▼ Summary

– EleutherAI released The Common Pile v0.1, a large licensed and open-domain text dataset for AI training, developed over two years with partners like Hugging Face and academic institutions.
– The 8TB dataset was used to train two AI models (Comma v0.1-1T and Comma v0.1-2T), which EleutherAI claims perform comparably to models trained on unlicensed data.
– AI companies face lawsuits over using copyrighted data for training, with EleutherAI arguing these lawsuits have reduced transparency in AI research.
– The Common Pile v0.1 includes sources like 300,000 public domain books and transcribed audio, aiming to provide a legal alternative to copyrighted datasets.
– EleutherAI commits to releasing more open datasets in the future, correcting its past use of copyrighted material in earlier datasets like The Pile.

EleutherAI has unveiled a massive open-source dataset designed to train AI models while avoiding copyright concerns that have plagued the industry. The newly released Common Pile v0.1, developed in partnership with AI firms like Hugging Face and academic institutions, spans 8 terabytes of licensed and public-domain text. This dataset was used to train two new models, Comma v0.1-1T and Comma v0.1-2T, which reportedly match the performance of proprietary models built on unlicensed data.

The move comes as major AI companies face lawsuits over training practices involving copyrighted material scraped from the web. While some firms argue fair use protects them, EleutherAI contends these legal battles have stifled transparency in AI research. Stella Biderman, the organization’s executive director, noted that lawsuits have led companies to withhold critical research, hindering progress in the field.

The Common Pile v0.1 sources its data from legally vetted materials, including 300,000 public-domain books from the Library of Congress and Internet Archive. OpenAI’s Whisper speech-to-text tool was also used to transcribe audio content. EleutherAI asserts that its 7-billion-parameter Comma models, trained on just a portion of the dataset, perform comparably to Meta’s Llama in coding, math, and image comprehension tasks.

Biderman emphasized that the belief that unlicensed text is essential for high-performing AI is misguided. As openly licensed datasets expand, she predicts models trained on them will continue improving. The release also marks a shift for EleutherAI, which previously distributed The Pile, a controversial dataset containing copyrighted material.

Looking ahead, EleutherAI plans to release more open datasets in collaboration with research partners, including the University of Toronto. This initiative aims to foster transparency while providing developers with ethical alternatives for AI training.

Editor’s note: Additional details were added regarding EleutherAI’s collaborative efforts in dataset development.

(Source: TechCrunch)

Topics

common pile v01 release 95% ai model training licensed data 90% legal concerns ai training 85% performance comma models 80% sources common pile v01 75% eleutherais shift from copyrighted material 70% future open dataset releases 65% collaboration academic institutions 60%
Show More

The Wiz

Wiz Consults, home of the Internet is led by "the twins", Wajdi & Karim, experienced professionals who are passionate about helping businesses succeed in the digital world. With over 20 years of experience in the industry, they specialize in digital publishing and marketing, and have a proven track record of delivering results for their clients.