Facebooks largest open source parallel corpus, 4.5 billion

category:Internet
 Facebooks largest open source parallel corpus, 4.5 billion


According to Lei Fengs AI technology review: most of the current natural language processing methods are data-driven, and most multilingual models (especially neural machine translation system) need parallel corpus training. Most of the parallel texts are only applicable to several major languages (such as English and Chinese), and are limited to specific fields.

In order to solve this problem, in July last year, Facebook released its first data set wikimatrix, which deals with all languages (including languages and dialects with poor resources) on Wikipedia. It contains about 100 million parallel corpora, covering 1620 language pairs.

According to Lei Fengs AI technology review, recently Facebook has developed and opened a ccmatrix, the largest parallel corpus data set so far, based on new methods and data sources. This dataset contains 4.5 billion parallel corpora (nearly 50 times that of wikimatrix), covering 576 language pairs.

Thesis: https://arxiv.org/abs/1911.04944

Open source address of dataset: https://github.com/facebookresearch/laser/tree/master/tasks/ccmatrix

1. Corpus building

First, from the source of the corpus. At present, there are several public multilingual parallel corpora, mainly from some international conferences (such as European Parliament, the United Nations), which are professional human translation corpora, and the language is more formal, and only limited to political themes. In addition, there are several corpora formed by volunteer translation, such as news conference, opensub titles, the tedcorpus, etc. In 2019, Facebooks schwenk et al. Used the corpus in Wikipedia for mining, so as to develop the wikimatrix data set.

All of the above are limited in terms of data sources. In order to make the parallel corpus large and cover a wide range of topics, Facebook chooses to use the data in the web as the source of the parallel corpus in ccmatrix. They send the URL randomly every month to obtain the web page snapshot (TB level) containing various languages.

The number of sentences in different languages in ten snapshots (one of which only contains English)

Then, by preprocessing, up to 70% of the duplicate data (such as template files, navigation menus, cookies, etc.) are removed, and fasttext (language recognizer, which can recognize 176 languages) is used to identify the language in the document. Finally, a model trained on Wikipedia is used to filter out the low-quality content, and only the documents with low confusion are retained. This process obtains a CCNET data set with 32.7 billion sentences.

In this work, the underlying idea of the mining method used is to first learn a kind of multi language semantic embedding, that is, sentences with similar semantics in an embedding space will have a relatively close distance, which has nothing to do with the language they use. This means that distance in space can be used as an indicator of whether two sentences translate to each other.

A framework for large-scale training of multilingual sentence embedding

However, the absolute threshold of cosine distance is not the same globally, so schwenk uses margin criterion here:

2. Corpus analysis

Mining parallel corpora from more than 32 billion sentences is computationally expensive. In the current version of ccmatrix corpus, authors are limited to 38 languages.

Ccmatrix: the number of monolingual texts and the number of parallel sentences extracted (in millions), the margin threshold of 1.06, and the Bleu score in the Ted test are given here. (editors note: This is the data in November, when the data set scale was 3.5 billion parallel corpora, the same below)

Ccmatrix: number of parallel corpora per language pair (unit: million), margin threshold is 1.06. For example, the number of Greek / Chinese pairs is 4.7 million.

3. Qualitative assessment

In order to evaluate the quality of the dataset, schwenk et al. Also tested the neural machine translation system using the dataset, and compared it with several public test sets.

1. Test on Ted data set

Schwenk et al. First trained the neural translation system (NMT) with ccmatrix, and then tested it on the Ted data set. The results are as follows:

Only 27 of them are selected here. The average of all Bleu values above is 14.3, the average of English pairs is 26.7, and the highest Bleu value is 42.9.

Of course, the SOTA on Ted is much higher than these; however, it should be noted that the NMT system tested here does not use the latest technologies such as transformer framework.

2. Evaluate on WMT19

Above is the Bleu score on the newstest18 (NT18) and newtest19 (NT19) test sets. As you can see, with ccmatrix, you can provide a very competitive Bleu score.

3. Evaluate on wat19

The test conducted on the Russian / Japanese translation task of the Asian translation seminar using ccmatrix is shown in the figure above. The model used here is the same as before, no transformer, no layerdropout. Although slightly worse than SOTA, it is still at the same level.

4, summary

Ccmatrix enables the NMT research community to take advantage of a larger set of bilingual data sets in just a few dozen languages than before. This can speed up the creation of more effective NMT models, which can use more languages, especially models with relatively limited resources in corpus.

Due to the large scale and the use of a large number of public texts, perhaps ccmatrix will become one of the most commonly used resources for building and evaluating systems in the NMT field.

Of course, the data set building method proposed by Facebook in the process of building ccmatrix is more worthy of promotion, and may help more people to create large-scale data sets.

reference material:

Facebook open source official announcement: https://ai.facebook.com/blog/ccmatrix-a-bill-scale-bitext-data-set-for-training-translation-models/ccmatrix paper: https://arxiv.org/abs/1911.04944ccmatrix open source link: https://github.com/facebook research / laser / tree / Master / tasks / ccmatrix Lei Feng. Extended reading of Facebooks largest open-source parallel corpus, 4.5 billion new coronavirus potential intermediate host or pangolin, complete research data to be published neural network technology to help self driving cars identify phantom objects source: Lei Feng net editor in charge: Liao ziyao_nbjs10040

Official publicity of Facebook open source: https://ai.facebook.com/blog/ccmatrix-a-bill-scale-bitext-data-set-for-training-translation-models/ccmatrix

Thesis: https://arxiv.org/abs/1911.04944ccmatrix

Open source link: https://github.com/facebook research/laser/tree/master/tasks/ccmatrix

Lei Feng reported.