Skip to main content

MT Decider

What is best machine translation service for your language combination? The MT Decider Benchmark is a transparent, vendor-independent, quarterly evaluation of online MT services.

MT Decider Benchmark Pro Business Enterprise
Reports English↔Arabic English↔Czech English↔Spanish English↔Estonian English↔Finnish English↔French English↔German English↔Hungarian English↔Italian English↔Lithuanian English↔Latvian English↔Polish English↔Romanian English↔Gujarati English↔Hindi English↔Tamil English↔Korean English↔Japanese English↔Chinese English↔Kazakh English↔Turkish English↔Pashto English↔Russian Czech↔German German↔French All Language Reports + Summary Report for Q4/2022 available for free download All Language Reports + Summary Report
Translation directions 2 50 inquire
Evaluated MT services Amazon Translate, DeepL, Google Translate, Microsoft Translator Amazon Translate, DeepL, Google Translate, Microsoft Translator inquire
Public evaluation sets
BLEU scores
COMET scores
Segment analysis (translation differences)
MT service ranking
Dataset and evaluation results download upcoming
Summary report
Jupyter notebooks for data analysis

Domain-specific evaluation

TAUS DeMT™ Evaluate↗
Users single single multiple

Frequently Asked Questions

What quality gains can I expect by choosing MT services with the MT Decider Benchmark?

Choosing an MT service for a language combination without the information from the MT Decider Benchmark can lead to choosing the MT service with the worst quality. For some language pairs this means a quality loss of 30%-54% or over 9 BLEU points over the best quality service!

On average across all language pairs we can realize the following BLEU score gains:

  • Average BLEU score gain over worst performing system: 4.12
  • Average BLEU percentage gain over worst performing system: 18%
Why does Polyglot Technology benchmark the MT services quarterly? Why is the MT Decider Benchmark a subscription?

Large research & development teams at the MT service suppliers constantly improve the services and add new languages. We monitored the quality of the MT service providers Amazon, Google and Microsoft for 21 language pairs over tree quarters: Q4/2021, Q1/2022, and Q2/2022. The result? Over one fifth, or 21.6%, of top rankings by BLEU score changed from quarter to quarter.

With this rate of change in the top rankings it makes sense conduct the MT Decider Benchmark quarterly. Subscribers receive the most up-to-date rankings, so that they can take advantage of the newest best service for their language pair(s) at any given time.

Can I get the current reports without subscribing? No. However, you can subscribe with the quarterly option, download the current reports and then unsubscribe again.
What does transparent, vendor-independent evaluation mean?

Several machine translation integrators are publishing machine translation quality reports. They are performing these evaluations with proprietary data and largely proprietary methods. For the most part one can only take advantage of the report results by signing up for expensive services offered by the integrators.

The MT Decider Benchmark, on the other hand, is evaluated with publicly available data with published methods and metrics. Any tool (CAT, TMS, CMS) that supports the evaluated, low-priced online machine translation services can be configured with the MT Decider Benchmark results.

Why have you chosen automatic evaluation of machine translation over human evaluation?

For determining machine translation quality human evaluation is best. However, conducting human evaluation across many languages is challenging and expensive. Repeating human evaluation periodically in a consistent way is even more challenging.

Automatic evaluation uses human reference translations to compute numeric machine translation quality scores. Key for meaningful automatic evaluation is high-quality test data, human reference translations, the human component of automatic evaluation. For the MT Decider Benchmark we use high quality human reference translations from the Conference on Machine Translation (WMT) and the International Conference on Spoken Language Translation (IWSLT), two premier academic conferences on machine translation.

Why are you not evaluating with domain-specific test data?

What a domain is is hard to define - texts vary in topic, modality, style, register (politeness level) etc. Evaluating machine translation with domain-specific test sets that do not exactly match the text that you intend to translate yields misleading evaluation results. We offer evaluation with your specific data as part of the Enterprise tier.

What is important for us is to expose the differences in quality that exist between different MT online services with generic, high-quality test data (see the question "What quality gains can I expect by choosing MT services with the MT Decider Benchmark?" above).

Choosing the highest quality engine for a language combination with generic test data is already 90% of the solution that many users cannot implement, because they are lacking the information. The MT Decider Benchmark delivers this information.

Are you offering a trial? No, we do not offer a trial. We do offer a sample report for the translation direction French→English for the second quarter of 2022.
Can I just choose one MT service that works well for all language combinations? There is no MT service that works best for all language pairs - according to our study best MT services vary across language combinations. We did distill our evaluation results into a consensus ranking across language pairs - the MT Decider Index. Still, choosing MT services by language pair with the MT Decider Benchmark leads to much better quality at virtually the same machine translation cost.

References

Barrault, L., Biesialska, M., Bojar, O., Costa-Jussà, M. R., Federmann, C., Graham, Y., . . . Zampieri, M. (2020). Findings of the 2020 Conference on Machine Translation (WMT20). In Proceedings of the Fifth Conference on Machine Translation (pp. 1-55). Association for Computational Linguistics. Retrieved from https://www.aclweb.org/anthology/2020.wmt-1.1

Barrault, L., Bojar, O., Costa-Jussà, M. R., Federmann, C., Fishel, M., Graham, Y., . . . Zampieri, M. (2019). Findings of the 2019 Conference on Machine Translation (WMT19). In Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1) (pp. 1-61). Florence, Italy: Association for Computational Linguistics. Retrieved from http://www.aclweb.org/anthology/W19-5301

Bojar, O., Buck, C., Callison-Burch, C., Federmann, C., Haddow, B., Koehn, P., . . . Specia, L. (2013). Findings of the 2013 Workshop on Statistical Machine Translation. In Proceedings of the Eighth Workshop on Statistical Machine Translation (pp. 1-44). Sofia, Bulgaria: Association for Computational Linguistics. Retrieved from http://www.aclweb.org/anthology/W13-2201

Bojar, O., Buck, C., Federmann, C., Haddow, B., Koehn, P., Leveling, J., . . . Tamchyna, A. (2014). Findings of the 2014 Workshop on Statistical Machine Translation. In Proceedings of the Ninth Workshop on Statistical Machine Translation (pp. 12-58). Baltimore, Maryland, USA: Association for Computational Linguistics. Retrieved from http://www.aclweb.org/anthology/W/W14/W14-3302

Bojar, O., Chatterjee, R., Federmann, C., Graham, Y., Haddow, B., Huang, S., . . . Turchi, M. (2017). Findings of the 2017 Conference on Machine Translation (WMT17). In Proceedings of the Second Conference on Machine Translation, Volume 2: Shared Task Papers (pp. 169-214). Copenhagen, Denmark: Association for Computational Linguistics. Retrieved from http://www.aclweb.org/anthology/W17-4717

Bojar, O., Chatterjee, R., Federmann, C., Graham, Y., Haddow, B., Huck, M., . . . Zampieri, M. (2016). Findings of the 2016 Conference on Machine Translation. In Proceedings of the First Conference on Machine Translation (pp. 131-198). Berlin, Germany: Association for Computational Linguistics. Retrieved from http://www.aclweb.org/anthology/W/W16/W16-2301

Bojar, O., Chatterjee, R., Federmann, C., Haddow, B., Huck, M., Hokamp, C., . . . Turchi, M. (2015). Findings of the 2015 Workshop on Statistical Machine Translation. In Proceedings of the Tenth Workshop on Statistical Machine Translation (pp. 1-46). Lisbon, Portugal: Association for Computational Linguistics. Retrieved from http://aclweb.org/anthology/W15-3001

Bojar, O., Federmann, C., Fishel, M., Graham, Y., Haddow, B., Huck, M., . . . Monz, C. (2018). Findings of the 2018 Conference on Machine Translation (WMT18). In Proceedings of the Third Conference on Machine Translation, Volume 2: Shared Task Papers (pp. 272-307). Brussels, Belgium: Association for Computational Linguistics. Retrieved from http://www.aclweb.org/anthology/W18-6401

Callison-Burch, C., Koehn, P., Monz, C., & Schroeder, J. (2009). Findings of the 2009 Workshop on Statistical Machine Translation. In Proceedings of the Fourth Workshop on Statistical Machine Translation (pp. 1-28). Athens, Greece: Association for Computational Linguistics. Retrieved from http://www.aclweb.org/anthology/W/W09/W09-0401

Post, M. (2018). A Call for Clarity in Reporting BLEU Scores. In Proceedings of the Third Conference on Machine Translation: Research Papers (pp. 186-191). Association for Computational Linguistics. Retrieved from https://www.aclweb.org/anthology/W18-6319

Rei, R., Stewart, C., Farinha, A. C., & Lavie, A. (2020). COMET: A Neural Framework for MT Evaluation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 2685-2702). Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.emnlp-main.213