Large-Scale Multilingual AI Models from Google, Facebook, and Microsoft

Researchers from Google, Facebook, and Microsoft have published their recent work on multilingual AI models. Google and Microsoft have released models that achieve new state-of-the-art performance on NLP tasks measured by the XTREME benchmark, while Facebook has produced a non-English-centric many-to-many translation model.

Teams from Microsoft Research, Google Research, and Facebook AI Research (FAIR) have been working on the problem of single natural-language processing (NLP) models for multiple languages. Microsoft’s Project Turing has developed the second version of the Turing Universal Language Representation (T-ULRv2), a model that can encode text phrases from 94 different languages into the same vector space. T-ULRv2 currently has the top spot on the XTREME benchmark leaderboard, which ranks model performance on a variety of NLP tasks across 40 languages. Google has developed mT5, a multilingual extension of the T5 model, which they have trained on mC4, a new large-scale multilingual a dataset mined from the open Common Crawl repository, containing 6.3T tokens in over 100 languages. Google also claims state-of-the-art results on XTREME, but the model has not been including in the latest leaderboard. Facebook’s translation model, M2M-100, has been trained on CCMatrix, another dataset mined from Common Crawl, containing 100 languages, with 7.5B parallel sentences in 2,200 source-destination combinations. M2M-100 outperforms models that are trained on English-centric datasets. According to FAIR researcher Angela Fan:

A single model that supports all languages, dialects, and modalities will help us better serve more people, keep translations up to date, and create new experiences for billions of people equally. This work brings us closer to this goal.

Much of the recent success in using deep-learning for NLP is due in part to transfer learning: fine-tuning large models that have been pre-trained on a large dataset scraped from the web. Because much of the data is English, this limits the use of the model to English-only tasks. While the models can also be trained on non-English data, many languages are considered “low-resource,” meaning that there is a lack of training data in that language. Experiments have found that pre-training a single NLP model with data from multiple languages can produce a model that performs “surprisingly well” on cross-lingual tasks, possibly by learning universal structures common to several languages. These models are often based on the variations of the BERT model, including Multilingual BERT (mBERT) and FAIR’s XLM-R. To evaluate the performance of cross-lingual models, researchers have developed cross-lingual versions of common NLP benchmarks; for example, the XTREME benchmark measures performance on sentence classification, sentence retrieval, structured prediction, and question answering in 40 languages.

Google researchers applied the concept of training an existing model on multiple languages to their T5 model. T5 set performance records on several NLP benchmarks for language understanding and question answering, including a “near-human score” on the SuperGLUE benchmark. The new model, mT5, was trained on a multilingual version of the Common Crawl dataset, mC4, which contains data in 101 languages scraped from the web. The mT5 model is based on the Transformer architecture and contains 13B parameters and “matches or exceeds state-of-the-art” on all XTREME tasks. Microsoft’s T-ULRv2 is also based on the Transformer architecture, with 550M parameters, and builds on a model called InfoXLM. Although Google’s paper claims that mT5 outperforms InfoXLM on XTREME, Microsoft’s new T-ULRv2 has the top rank on the public XTREME leaderboard, which was previously held by a model developed by Alibaba, and mT5 is not listed on the leaderboard at all.

While Google’s and Microsoft’s models are designed to be fine-tuned for NLP tasks such as question-answering, Facebook has focused on the problem of neural machine translation (NMT). Again, these models are often trained on publicly-available data, consisting of “parallel” texts in two different languages, and again the problem of low-resource languages is common. Most models therefore train on data where one of the languages is English, and although the resulting models can do a “zero-shot” translation between two non-English languages, often the quality of such translations is sub-par.

To address this problem, Facebook’s researchers first collected a dataset of parallel texts by mining Common Crawl data for “sentences that could be potential translations,” mapping sentences into an embedding space using an existing deep-learning model called LASER and finding pairs of sentences from different languages with similar embedding values. The team trained a Transformer model of 15.4B parameters on this data. The resulting model can translate between 100 languages without “pivoting” through English, with performance comparable to dedicated bi-lingual models.

Both Facebook’s M2M-100 and Google’s mT5 code and models are available on GitHub. Facebook’s scripts for downloading and cleaning their multilingual dataset are also available on GitHub, and Google’s mC4 dataset is available as part of the TensorFlow Dataset package. Microsoft’s model is not open-source, but is available as a private preview. Microsoft’s unified language models (ULM) GitHub project contains a folder for InfoXLM, the technology behind T-ULRv2, but it contains only a link to the arXiv paper.

Hannah