FRMT, a benchmark for Machine Translation into language dialects

February 19, 2023

FRMT, which stands for “Few-Shot Region-Aware Machine Translation”, is a benchmark dataset created by the Google Research team to evaluate the capacity of machine translation (MT) systems to translate between different regional language varieties using only a small amount of labeled examples.

Few-shot learning is a type of machine learning that involves training a model on a small amount of data.

Because of the important differences in vocabulary and grammar between different language dialects, current MT systems often produce confusing and unnatural translations, since they do not take into account the specific regional variety being used.

The FRMT benchmark aims to promote the development of MT systems that can accurately handle the different regional language varieties.

*FRMT requires a MT model to adapt its output to every specific region*

The model and test results

The FRMT dataset comprises a training set and a testing set, each containing examples of two language pairs – Brazilian and European Portuguese, and Mainland and Taiwan Mandarin Chinese.

FRMT aims to identify linguistic variations that are specific to certain regions.

In order to achieve this goal, the dataset is divided into three categories (lexical, entity, random), each consisting of human translations of sentences taken from various English Wikipedia articles.

Additionally, the dataset includes a series of labeled examples for each regional variety of the language pairs. The labeled examples are available in two sets:

the few-shot set with 100 labeled examples for each regional variety of each language pair
the full-shot set containing all the available labeled examples for each regional variety

The research team used FRMT to evaluate a handful of new and existing academic MT models that claim capacity for generating controllable region-specific translations with only a small amount of training data. While some of them, like PaLM, show impressive few-shot region control, the evaluated models do not perform as well as humans, especially in Mandarin.

Conclusion, future research

The FRMT dataset addresses a key challenge in MT: the ability to accurately translate text into regional variants of a language, using a small amount of data.

This is particularly important because language variation can significantly impact the accuracy of MT systems, especially for languages like Chinese and Portuguese, which have significant regional variations.

A desirable objective in the coming future would be to have language generation systems, particularly MT, that can assist the language varieties spoken worldwide.