Skip to main content

DeepL is a unicorn. How good are its machine translations?

Mausoleum at Bal Samand Lake
With machine translation provider DeepL having closed its latest funding round, it is a good time to see how DeepL's machine translations rank in the MT Decider Benchmark.

We benchmarked four large online MT providers, Amazon, DeepL, Google and Microsoft, with news domain data in 23 language pairs/46 language directions (we are leaving out the MT Decider language pairs English↔Arabic and English↔Korean that only have transcribed speech test data for an apples-to-apples comparison).

Using the evaluation metric COMET, DeepL ranks first for 24 of the 46 language directions! Google Translate is a close second ranking first for 19 out of 46 language directions.

This is an impressive result given the competition. 

DeepL doesn't rank the highest in the Q4/2022 MT Decider Index that reflects which provider is best across the evaluated language pairs. Google Translate does. Why?

  1. We use ranked voting to calculate our index, so it matters which provider ranks 1st, 2nd, 3rd and 4th.
  2. DeepL does not support all 23 language pairs in this comparison. It does not yet support English↔Gujarati, English↔Hindi, English↔Tamil, English↔Pashto, and English↔Kazakh. Language coverage matters for the calculation of the overall MT Decider Index.

Simply put, DeepL ranks very well for the language pairs it supports. 

Can DeepL continue to deliver this quality for language pairs it will add in the future? The funding will certainly help and DeepL has a good track record as a data-centric AI company, building on data from the predecessor company Linguee. However, as we in the MT field all know, good quality, ample translation data to train machine translation gets scarcer the further we move from high-resource languages. The jury is out which provider can compete best on data and technology.

For a detailed snapshot by language pair and comparison of translation output subscribe to the MT Decider Benchmark.

Footnotes:

  • The evaluation data is from the news domain, a broad domain. Therefore the evaluation result reflects how well the MT provider translates a broad range of topics. It might not reflect the translation quality on your specific content, which might be from a specialized domain/topic. We are working the MT Decider Scorer that you can use to evaluate your own data. Of course, if the quality differences between providers are big, as they are for many language pairs, then the general evaluation result holds up.
  • The machine translations for the Q4/2022 benchmark were captured in late September/early October 2022.