Skip to main content

mteval - an automation library for automatic machine translation evaluation

For creating the MT Decider Benchmark Polyglot needed a library to automate the evaluation of machine translations for many different languages, with many MT services, with multiple datasets, using multiple automatic evaluation metrics.

The tools sacreBLEU and COMET provide a great basis for scoring machine translations relative to human reference translations with the most popular metrics - BLEU, TER, chrF, chrF++ and COMET. Running evaluations from the command line is their focus. We needed automation from Python to run evaluations in Jupyter Notebook environments like Google Colaboratory, which offers free-to-use GPUs for COMET evaluation. We also wanted to translate the test sets with major online MT services and persist test sets, machine translations and the evaluation results.

The result of this is the Python library mteval, with the source available under the Apache License 2.0 on github. Feedback is welcome.

The plan for December is to publish the MT Decider Scorer Jupyter Notebook that uses mteval and enables you to score your own datasets, with privacy preserved. On your own server, in Google Colaboratory or any other Jupyter Notebook environment with GPU support.