There are actually more than 6,900 languages around the world, an incredible nightmare for natural language students. Researchers consider adequate data to train mature models very difficultly because there are little details in most languages. Fortunately, a lot of infrastructure shares many languages.
For example, “desk” in English and “Tisch” in German comes from Latin “discus”. Google has released a natural language processing system benchmark called Xtreme. It includes 9 inference tasks for 12 language families and 40 languages.
Researchers at the technology giant assert that it can evaluate whether artificial intelligence models can learn cross-language knowledge. This will be very useful for more natural language applications. The goal of this benchmark is to promote research in the field of artificial intelligence multilingual learning.
Xtreme was chosen as the benchmark to maximize diversity, expand the coverage of existing tasks, and provide training data. These include some under-studied languages, such as the Tamil language (Tamara language) of southern India, Sri Lanka, and Singapore. Others are the Telugu and Malayalam languages mainly used in southern India, and the Swahili/Yoruba languages of Niger-Congo (Africa).
Xtreme’s 9 tasks cover a series of basic paradigms, including sentence classification (that is, assigning a sentence to one or more classes) and structured prediction (predicting objects such as entities and parts of speech), and sentence retrieval (querying a set of records matching).