Skip to main content


DeepL is a unicorn. How good are its machine translations?

With machine translation provider DeepL having closed its latest funding round , it is a good time to see how DeepL's machine translations rank in the MT Decider Benchmark . We benchmarked four large online MT providers, Amazon, DeepL, Google and Microsoft, with news domain data in 23 language pairs/46 language directions (we are leaving out the MT Decider language pairs English↔Arabic and English↔Korean that only have transcribed speech test data for an apples-to-apples comparison). Using the evaluation metric COMET , DeepL ranks first for 24 of the 46 language directions! Google Translate is a close second ranking first for 19 out of 46 language directions. This is an impressive result given the competition.  DeepL doesn't rank the highest in the Q4/2022 MT Decider Index that reflects which provider is best across the evaluated language pairs. Google Translate does. Why? We use ranked voting to calculate our index, so it matters which provider ranks 1st, 2nd, 3rd and 4th. D

MT Decider Benchmark Q4/2022 now available

The Q4/2022 edition of the MT Decider Benchmark with the latest comparison of machine translation quality with Amazon Translate, DeepL, Google Translate and Microsoft Translator is out! With the addition of English↔Korean the benchmark now covers 25 language pairs. Quote handling differences by online MT services significantly distort BLEU score results . For the benchmark we now apply quote normalization before calculating BLEU scores. Now the metrics COMET and BLEU agree more often on the best service for a language direction, allowing you to confidently choose the best MT service. We kept test data fresh by updating to 2021 data, where available. We used the the latest evaluation libraries sacreBLEU 2.3.1 and COMET 1.1.3, incorporating the latest innovations and bug fixes from academic research. Instead of naming the benchmark with the quarter when machine translations were captured, we now name it with the quarter when the benchmark reports are compiled. This is why MT Decider

mteval - an automation library for automatic machine translation evaluation

For creating the MT Decider Benchmark Polyglot needed a library to automate the evaluation of machine translations for many different languages, with many MT services, with multiple datasets, using multiple automatic evaluation metrics. The tools sacreBLEU and COMET provide a great basis for scoring machine translations relative to human reference translations with the most popular metrics - BLEU, TER, chrF, chrF++ and COMET. Running evaluations from the command line is their focus. We needed automation from Python to run evaluations in Jupyter Notebook environments like Google Colaboratory , which offers free-to-use GPUs for COMET evaluation. We also wanted to translate the test sets with major online MT services and persist test sets, machine translations and the evaluation results. The result of this is the Python library mteval , with the source available under the Apache License 2.0 on github . Feedback is welcome. The plan for December is to publish the MT Decider Scorer Jup

“May I quote you?” – why quotation marks are difficult for machine translation and problematic for BLEU scores

Today a post about typography, nevertheless with big impacts on MT quality evaluation ... Why worry about quotation marks?  You might wonder why you should worry about the handling of quotation marks in machine translation and it's impact on automatic machine translation quality evaluation? Isn't quoting an easy task in one language that can be easily transferred in translation with some simple rules? As we will see, it isn't an easy task - wrong quotation marks can significantly distort BLEU scores and thereby the relative ranking of MT systems. Quotation marks in English Back in the days of character encoding with ASCII we had one character for double-quotes: " (U+0022: Quotation Mark) and another character for single quotes: ' (U+0027: Apostrophe) with the later doing double duty as an apostrophe and single quote. With these we can quote a sentence like: "She said: 'It's getting late.'" This is quite ugly looking typographically, which is

MT Decider Benchmark: BLEU Differences by Language Pair

In the launch post for the MT Decider Benchmark I noted that the difference in machine translation quality, as measured by the BLEU score, can differ as much as 54% or more than 9 BLEU points between the evaluated online MT services Amazon Translate, DeepL, Google Translate, and Microsoft Translator.  But what are the differences for the individual language pairs/translation directions? Here is the chart of BLEU score differences by language pair sorted from largest to smallest score difference: Somewhat unsurprisingly online MT services differ most in quality for languages that are morphologically complex and/or are low-resource. One interesting observation is that for all language pairs the difference between best and worst online MTservice is at the minimum 1.39 BLEU points . A score difference of over one BLEU point is considered significant in academic research. Therefore it is definitely worth to check if you are using the best online MT service(s) for your languag

The Best MT Services for Your Language Pairs

Affordable, High-quality Translations with Online Machine Translation Services Compared to a few years ago we live in fortunate times when we want to to translate from one human language into another using machines: there are many affordable online machine translation (MT) services available delivering high-quality translations.  MT Quality Matters For perishable, low-impact content web publishers can publish machine translated text directly in the languages they need. When high-quality human-edited translations are needed, translation providers can use machine translations as draft translations for post-editing for increased speed and efficiency - provided that the machine translations are of sufficient quality. But MT quality varies as much as 54% or more than 9 BLEU points(!) between different MT services for some language pairs. This is a huge difference! What are the Best MT Services for Your Language Pairs? How then can MT users determine which MT providers offer the best qualit

MT Decider Index Q2/2022

As a machine translation user you want to use the online MT service with the highest translation quality for each language pair you are translating. But evaluating ever changing MT services across many language pairs is hard!  Polyglot Technology solves this challenge by producing the MT Decider Benchmark, a vendor-independent, transparent, and up-to-date evaluation of online MT services every quarter for 24 language pairs.  The MT Decider Index is a cross-language ranking distilled from the MT Decider Benchmark. This is the MT Decider Index for the second quarter of 2022: Google Translate Microsoft Translator DeepL Amazon Translate The MT Decider Benchmark Q2/2022 is now available . To learn about TAUS DeMT Evaluate , an evaluation service and report jointly created by TAUS and Polyglot Technology, the MT Decider Index and the MT Decider Benchmark, please attend this Nimdzi Live August 3rd webcast with Anne-Maj van der Meer (TAUS), Amir Kamran (TAUS), myself, Achim Ruopp (P