Announcing Common Corpus: A 2+ trillion token dataset that's fully open and accessible
We released Common Corpus, the largest fully open dataset of over 2 trillion tokens.
Pleias is committed to training LLMs in the open. This means not only releasing our models but also being open about every aspect, from the training data to the training code. We define "open" strictly: all data must be both accessible and under permissive licenses.
The open LLM ecosystem particularly lacks transparency around training data in volume and provenance. Common Corpus addresses this gap.
Inside Common Corpus: A Diverse Dataset for Better Models
Common Corpus contains data from diverse domains and genres, including books, newspapers, scientific articles, government and legal documents, code, and more. Dataset diversity is vital for developing models that generalize well to different contexts.
The Common Corpus has the largest amount of multilingual open content of any open dataset (over 40% is non-English).
One of the unique contributions of Common Corpus is our cultural heritage data, which consists of books and newspapers. Books, in particular, are extremely valuable for training language models, as they contribute high-quality text that can help train models to generate high-quality and stylized text. There is a lack of open data and text. Additionally, books are valuable for developing long-context models, which are helpful for various applications. Book data is complex, mainly due to legal challenges, as copyright laws often protect them. By creating a dataset of public domain books, we help democratize long-content and culturally rich datasets, without incurring legal or ethical challenges other developers face.
Beyond Common Crawl: Multilingualism and Multimodality for Data Commons
Not only is there a lack of open data, but open data isn’t sufficiently multilingual, multimodal, diverse, or high-quality. We considered all of these dimensions during the development of Common Corpus.
Increasing open data availability for languages other than English is essential, as it increases access to language technologies, which are quickly becoming an important part of the global digital landscape. Common Corpus primarily contains English and French data, but there is a significant amount of data in German, Spanish, Italian, and other European languages.
Our data are also multimodal. Our financial dataset, Finance Commons, contains 1.25 million PDFs, covering a wide variety of layouts and formats. Common Corpus will also be enlarged with a large collection of scientific publications in free license, with more than 12 million PDfs. These datasets enable the development of next-generation open multimodal models, especially for applications related to document processing. This could enable us to unlock even more training data, which currently are not in usable formats.
Building Quality: From OCR Correction to Toxicity Filtering
Much of the dataset curation process has focused on creating a high-quality dataset. We consider quality from many perspectives. As much of our data comes from digitized texts, reducing digitization artifacts and OCR errors was a primary focus. We developed spItecialized tools for OCR correction, which we have also released. One of those tools is OCRonos, a very small OCR error correction model. We find that it can correct various errors, even for texts with high error rates. With only 124 million parameters, OCRonos exemplifies our approach of specialized pretraining and makes it possible to correct a large amount of noisy cultural heritage at scale. This allowed us to leverage existing text data, which previously was not of high enough quality to be useful for training language models.
Another dimension of quality we addressed was that of bias and toxicity. Most language mode training datasets contain data that contains harmful content. As our dataset is multilingual and contains digitized text with OCR errors, we developed our own toxicity classifier to identify harmful content. We then created a pipeline to remove or synthetically rewrite. We documented this process in our recent preprint entitled “Toxicity of the Commons: Curating Open-Source Pre-Training Data."
Common Corpus is now available on Hugging Face for training models. We will release the sub-corpora individually in the coming weeks. We will also release a complete report about the creation of this dataset, which will include full details about the curation and filtering procedures we used and the complete provenance of all of our data.