Get featured on IndiaAI

Contribute your expertise or opinions and become part of the ecosystem!

Google researchers have developed techniques that enable them to train a language model containing more than a trillion parameters. Their 1.6 trillion parameter model – supposedly the largest to date – has achieved an upto 4 times speed up over the previously developed language model (T5-XXL). This model is about six times bigger than OpenAI's GPT-3, which uses about 175 billion parameters.

Large scale training is the effective path towards powerful models – simpler architectures with large datasets and parameter counts are far superior to complex algorithms. This led to researchers pursuing the Switch Transformer, which builds on a mix of experts, keeping them specialized in different tasks within a gating network and choose experts for a given set of data.

In an experiment, researchers pretrained several different Switch Transformer models using 32 TPU cores on the Colossal Crawled Corpus, a 750GB sized dataset scraped from Reddit, Wikipedia and other sources. They made the models predict missing words in passages where 15% of the words had been masked out in addition to other challenges, like retrieving text or answer tough questions. The researchers claimed that the 1.6 trillion parameter model with 2,408 experts exhibited no training instability compared to smaller models with lesser parameters and experts.

Though this work has focused on extremely large models, we also find that models with as few as two experts improve performance while easily fitting within memory constraints of commonly available GPUs or TPUs,” the researchers wrote in the paper. “We cannot fully preserve the model quality, but compression rates of 10 to 100 times are achievable by distilling our sparse models into dense models while achieving ~30% of the quality gain of the expert model.”

In future work, the researchers plan to apply the Switch Transformer to “new and across different modalities,” including image and text. They believe that model sparsity can confer advantages in a range of different media, as well as multimodal models.

Want to publish your content?

Publish an article and share your insights to the world.

Get Published Icon
ALSO EXPLORE