Results for ""
RedPajama is a project that aims to establish a collection of leading, open-source models. The researchers announced the completion of the first stage of this project: the replication of the LLaMA training dataset, which contains approximately 1.2 trillion tokens.
“A computer would deserve to be called intelligent if it could deceive a human into believing that it was human.” – Alan Turing
The most cutting-edge AI foundation models are only partially open-source and accessible via paid APIs. As a result, it limits their use and prevents customization and research. RedPajama, a new initiative, intends to develop industry-leading, open-source solutions. Reproducing the LLaMA training dataset has finished the project's first phase.
Recently, open-source models have advanced significantly, and the current state of AI is comparable to the Linux movement. Stable Diffusion proved that open-source models might be competitive with for-profit products and foster creativity through community involvement. With the introduction of fully open models like Pythia, OpenChatKit, Open Assistant, and Dolly, as well as semi-open models like LLaMA, Alpaca, Vicuna, and Koala, a similar movement has recently taken place around big language models.
The goal of RedPajama's initial step was to replicate the LLaMA training dataset. This dataset includes over 1.2 trillion tokens. It also intends to produce open-source language models. The RedPajama project aims to change the game by creating open-source models that facilitate research and modification. Most strong foundation AI models are currently only partially open-source and only accessible via commercial APIs such as ChatGPT.
Hugging Face lets you download the RedPajama Dataset, which comprises a 1.2 trillion token dataset and a smaller random sample. The collection has seven pieces of data: CommonCrawl, C4, GitHub, arXiv, Books, Wikipedia, and StackExchange. Each piece of data has been carefully pre-processed and filtered to ensure its quality. The quality filters were set up to close to the number of tokens Meta AI reported in the LLaMA study. The CCNet pipeline handled the CommonCrawl data slices, and a linear classifier was used to find pages similar to Wikipedia.
Licences and quality filtered the GitHub data, while the arXiv data comprised science articles with the boilerplate taken out. The Books data was deduplicated based on how similar the content was, the boilerplate was taken out of the Wikipedia subset, and the StackExchange subset was made up of famous websites where the boilerplate had been taken out. The full dataset is about 5TB when opened on a disc and can be downloaded at 3TB when compressed.
The RedPajama project and the Meerkat project are working together to make a dashboard and embeddings for the GitHub subset of the data that can be used for interactive analysis. On GitHub, you can find directions on how to install and use it. After making a copy of the pre-training data, the next step in the job is to train a robust base model. The Oak Ridge Leadership Computing Facility is helping with the project through the INCITE programme. Soon, a complete set of models will be ready for use.
Furthermore, the success of Alpaca with just 50,000 high-quality, different directions has inspired the team to teach and tune the models. The team has gotten hundreds of thousands of natural user instructions through OpenChatKit. These instructions will be used to release versions of the RedPajama models tuned to the instructions.
Image source: Unsplash