Researchers at BigScience AI Introduce Open-Source' BLOOM': An Autoregressive Multilingual Large Language Model Greater Than GPT-3 and OPT-175. 

BigScience is a project in which hundreds of scientists from all over the world work together. It unveils the largest open-science multilingual language model ever trained. 

What are language models?

Language models are artificial intelligence (AI) systems whose main uses involve natural (i.e., human) language. They could answer questions, make sentences, figure out how someone feels, summarize, simplify, or translate text. Most existing models have only been trained with English text and use principles and methods that are hard to copy fully. Nevertheless, big tech companies usually make them. For example, when one of these models answers a question, you can't tell if the answer came from an algorithm or was already in its training database.

Large language models

Large language models (LLM) are algorithms built by deep learning on enormous volumes of data. They are among the most popular fields of AI study. Powerful models such as GPT-3 and LaMDA, which generate writing that reads as a human wrote, can drastically alter how we consume information online. For example, they can be used as chatbots to search for information, online monitor material, summarize books, and generate new text tracts in response to cues. 

BigScience's open-source LLM

Hugging Face, a French-American AI startup, started BigScience in the spring of 2021 to solve these problems by training a new model called BLOOM. 

BLOOM learns from large groups of texts, called corpora, by following a simple rule: it guesses the next word in a sentence and then compares its guess to the actual word. We can then change the model's parameters based on how well it works. 

Furthermore, BLOOM has looked at more than a trillion words, which has led to a model with 176 billion parameters. However, it took several months and hundreds of graphics processing units (GPUs) running side by side to train it, the same as 5 million hours of computer processing time. 

BLOOM's potential

With 176 billion parameters, which is more than OpenAI's GPT-3 and MetaAI's OPT, BLOOM can create text in 46 natural languages and dialects and 13 programming languages. It has been trained on 1.6TB of text data, the same as 320 copies of all of Shakespeare's works. It can handle 46 languages, including French, Vietnamese, Mandarin, Indonesian, Catalan, 13 Indic languages (including Hindi), and 20 African languages. A little more than 30% of the training data was in English. The model is also proficient in thirteen programming languages. 

These languages were taught to BLOOM using literature, scientific articles, sports news, and thirteen programming languages. The input wasn't sorted by language because, surprisingly, BLOOM learns better this way. When you combine the content in different languages, you can train powerful and robust models for all of those languages. These models often give better results than models that only use one language. 

Conclusion

BLOOM is unique in the world of large language models, where English is the most common language. BLOOM is another effect of the fact that LLMs are made by scraping data off the internet: English is the most common language used online.

Including so many languages could greatly help AI researchers in poorer countries, who often have trouble accessing natural-language processing because it requires a lot of expensive computing power. BLOOM lets them skip the costly part of developing and training the models so they can focus on building applications and fine-tuning the models for tasks in their native languages. 

BLOOM is now available for download and research here.

Image source: Unsplash

Want to publish your content?

Publish an article and share your insights to the world.

Get Published Icon
ALSO EXPLORE