EleutherAI Launches Huge Open-Source AI Training Dataset

▼ Summary
– EleutherAI released The Common Pile v0.1, a large licensed and open-domain text dataset for AI training, developed over two years with partners like Hugging Face and academic institutions.
– The 8TB dataset was used to train two AI models (Comma v0.1-1T and Comma v0.1-2T), which EleutherAI claims perform comparably to models trained on unlicensed data.
– AI companies face lawsuits over using copyrighted data for training, with EleutherAI arguing these lawsuits have reduced transparency in AI research.
– The Common Pile v0.1 includes sources like 300,000 public domain books and transcribed audio, aiming to provide a legal alternative to copyrighted datasets.
– EleutherAI commits to releasing more open datasets in the future, correcting its past use of copyrighted material in earlier datasets like The Pile.
EleutherAI has unveiled a massive open-source dataset designed to train AI models while avoiding copyright concerns that have plagued the industry. The newly released Common Pile v0.1, developed in partnership with AI firms like Hugging Face and academic institutions, spans 8 terabytes of licensed and public-domain text. This dataset was used to train two new models, Comma v0.1-1T and Comma v0.1-2T, which reportedly match the performance of proprietary models built on unlicensed data.
The move comes as major AI companies face lawsuits over training practices involving copyrighted material scraped from the web. While some firms argue fair use protects them, EleutherAI contends these legal battles have stifled transparency in AI research. Stella Biderman, the organization’s executive director, noted that lawsuits have led companies to withhold critical research, hindering progress in the field.
The Common Pile v0.1 sources its data from legally vetted materials, including 300,000 public-domain books from the Library of Congress and Internet Archive. OpenAI’s Whisper speech-to-text tool was also used to transcribe audio content. EleutherAI asserts that its 7-billion-parameter Comma models, trained on just a portion of the dataset, perform comparably to Meta’s Llama in coding, math, and image comprehension tasks.
Biderman emphasized that the belief that unlicensed text is essential for high-performing AI is misguided. As openly licensed datasets expand, she predicts models trained on them will continue improving. The release also marks a shift for EleutherAI, which previously distributed The Pile, a controversial dataset containing copyrighted material.
Looking ahead, EleutherAI plans to release more open datasets in collaboration with research partners, including the University of Toronto. This initiative aims to foster transparency while providing developers with ethical alternatives for AI training.
Editor’s note: Additional details were added regarding EleutherAI’s collaborative efforts in dataset development.
(Source: TechCrunch)