EleutherAI releases large AI coaching dataset of licensed and open area textual content

June 6, 2025 icemelon17@gmail.com

EleutherAI, an AI analysis group, has launched what it claims is likely one of the largest collections of licensed and open-domain textual content for coaching AI fashions.

The dataset, referred to as The Frequent Pile v0.1, took round two years to finish in collaboration with AI startups Poolside, Hugging Face, and others, together with a number of educational establishments. Weighing in at 8 terabytes in dimension, The Frequent Pile v0.1 was used to coach two new AI fashions from EleutherAI, Comma v0.1-1T and Comma v0.1-2T, that EleutherAI claims carry out on par with fashions developed utilizing unlicensed, copyrighted information.

AI corporations, together with OpenAI, are embroiled in lawsuits over their AI coaching practices, which depend on scraping the net — together with copyrighted materials like books and analysis journals — to construct mannequin coaching datasets. Whereas some AI corporations have licensing preparations in place with sure content material suppliers, most preserve that the U.S. authorized doctrine of truthful use shields them from legal responsibility in circumstances the place they skilled on copyrighted work with out permission.

EleutherAI argues that these lawsuits have “drastically decreased” transparency from AI corporations, which the group says has harmed the broader AI analysis area by making it extra obscure how fashions work and what their flaws is likely to be.

“[Copyright] lawsuits haven’t meaningfully modified information sourcing practices in [model] coaching, however they’ve drastically decreased the transparency corporations interact in,” Stella Biderman, EleutherAI’s govt director, wrote in a weblog submit on Hugging Face early Friday. “Researchers at some corporations we’ve spoken to have additionally particularly cited lawsuits as the rationale why they’ve been unable to launch the analysis they’re doing in extremely data-centric areas.”

The Frequent Pile v0.1, which will be downloaded from Hugging Face’s AI dev platform and GitHub, was created in session with authorized consultants, and it attracts on sources, together with 300,000 public area books digitized by the Library of Congress and the Web Archive. EleutherAI additionally used Whisper, OpenAI’s open supply speech-to-text mannequin, to transcribe audio content material.

EleutherAI claims Comma v0.1-1T and Comma v0.1-2T are proof that the Frequent Pile v0.1 was curated fastidiously sufficient to allow builders to construct fashions aggressive with proprietary options. In keeping with EleutherAI, the fashions, each of that are 7 billion parameters in dimension and have been skilled on solely a fraction of the Frequent Pile v0.1, rival fashions like Meta’s first Llama AI mannequin on benchmarks for coding, picture understanding, and math.

Parameters, typically known as weights, are the inner elements of an AI mannequin that information its conduct and solutions.

“Normally, we expect that the frequent concept that unlicensed textual content drives efficiency is unjustified,” Biderman wrote in her submit. “As the quantity of accessible overtly licensed and public area information grows, we will count on the standard of fashions skilled on overtly licensed content material to enhance.”

The Frequent Pile v0.1 seems to be partly an effort to proper EleutherAI’s historic wrongs. Years in the past, the corporate launched The Pile, an open assortment of coaching textual content that features copyrighted materials. AI corporations have come underneath hearth — and authorized strain — for utilizing The Pile to coach fashions.

EleutherAI is committing to releasing open datasets extra steadily going ahead in collaboration with its analysis and infrastructure companions.

Newsphere24

Newsphere24

EleutherAI releases large AI coaching dataset of licensed and open area textual content

Leave a Reply Cancel reply