MT Decider Benchmark: BLEU Differences by Language Pair

In the launch post for the MT Decider Benchmark I noted that the difference in machine translation quality, as measured by the BLEU score, can differ as much as 54% or more than 9 BLEU points between the evaluated online MT services Amazon Translate, DeepL, Google Translate, and Microsoft Translator. 

But what are the differences for the individual language pairs/translation directions? Here is the chart of BLEU score differences by language pair sorted from largest to smallest score difference:

Floating bar chart showing BLEU score differences for 48 language combinations

Somewhat unsurprisingly online MT services differ most in quality for languages that are morphologically complex and/or are low-resource. One interesting observation is that for all language pairs the difference between best and worst online MTservice is at the minimum 1.39 BLEU points. A score difference of over one BLEU point is considered significant in academic research. Therefore it is definitely worth to check if you are using the best online MT service(s) for your language pair(s). 

In future posts I will explore how machine translation quality differences affect the post-editing usage scenario as well as scenarios where raw machine translation is published directly for low-impact/perishable content.

You should resist the temptation to compare BLEU scores across language pairs. First, the test data sets are completely different. Second, comparing BLEU scores across language pairs is not recommended because of language differences (e.g. morphological variance). See this Google Cloud documentation for guidance how to interpret BLEU scores ranges.

