Skip to main content

Bilingual Evaluation Understudy? A Practical Guide to MT Quality Evaluation with BLEU


Automatic Metrics for Machine Translation 

Whether you publish machine translations directly or use them as post-editing input, evaluating their quality is essential. In this blog post we evaluate the quality of the translations in isolation against available human reference translations, rather than evaluating them in a larger context, e.g. in the context of business metrics.
Judging the quality of machine translations is best done by humans. However this is slow, expensive and not easily repeatable each time an MT system is updated. Automatic metrics provide a good way to repeatedly judge the quality of MT output. BLEU (Bilingual Evaluation Understudy) is the prevalent automatic metric for close to two decades now and likely will remain so, at least until document-level evaluation metrics get established (see Läubli, Sennrich and Volk: Has Machine Translation Achieved Human Parity? A Case for Document-level Evaluation).
If you have been reading about machine translation evaluation, you might have heard that BLEU is inadequate and misleading. A lot of this can be attributed the metric not being well understood or not used as intended. To understand the metric better we could go back and read the original paper Bleu: a Method for Automatic Evaluation of Machine Translation by Papieni, Roukos, Ward and Zhu, but Kaggle's Rachael Tatman has recently written a great introduction to BLEU and it's limitations. In conclusion she writes:

I would urge you to use BLEU if and only if:
  1. You’re doing machine translation AND
  2. You’re evaluating across an entire corpus AND
  3. You know the limitations of the metric and you’re prepared to accept them.
And hey, this is exactly what we want to do. We intend to calculate BLEU on a well-defined and well-designed test set for a single language pair with various MT systems to determine the one with the best output quality for our purpose.
For post-editing BLEU is often supplemented with metrics like TER, edit distance and the number of zero-edit segments. See this presentation for a broader discussion of MT evaluation.

Evaluation Environment

Before we go into details, a word on the evaluation setup: the tools used and command line examples assume a Linux environment, specifically Ubuntu 18.04 LTS. We assume you, or someone you work with, know how to set up the tools in this environment. If you are a dedicated Windows user, there are certainly ways to make the evaluation tool chain working there too, but the easiest way might just be to run Ubuntu in the Windows Subsystem for Linux. Mac users can set up and use the tools in their Unix-based shell.

Getting the Source in Plain Text

Exporting Text from TMX Files

For our test set we start off with translation memories in the form of TMX files. To feed the source side of the segments into MT systems for translation, we need to convert the TMX files into UTF-8 encoded plain text. Some GUI tools like the Heartsome TMX Editor support this, as do command line tools like tmxt and tmx2txt.pl. Make sure to check the file character encoding after export and change it if necessary.
From our extraction tool we are getting either single plain text files with source and target separated by a tab or two separate plain text files - one for the source and one for the target. Here is how we can convert one into the other:
cut -f1 source_and_target_separated_by_tab.txt > source.txt
cut -f2 source_and_target_separated_by_tab.txt > target.txt

paste source.txt target.txt > source_and_target_separated_by_tab.txt

If we extract multiple TMX files we can concatenate the separate text files using cat:
cat source1.txt source2.txt source3.txt > source.txt
cat target1.txt target2.txt target3.txt > target.txt

Random Sampling 

If we are using our entire TMX data for evaluation (1000-2000 segments like in the aligned Zoellick speeches), there is no need to do random sampling. If on the other hand we have a larger test set, we might want to randomly sample to create a smaller test set. Using the tab separated version of a large corpus we can use the Python tool/library subsample to sample 1500 segments at random:
subsample -n 1500 large_source_and_target_separated_by_tab.txt > source_and_target_separated_by_tab.txt

Getting Machine Translations 

We now need to get the translations of the plain text source sentences - the key for evaluation is that the translations are also output line-by-line. Virtually all MT providers offer such text translation and we will use two major ones: Bing Microsoft Translator and Google Translate. As you can see from the web pages - the maximum length of text we can translate is 5000 characters. We can use the script split_test.py to split up larger test files on line boundaries.

But copying and pasting portions of test files soon gets tedious and also impractical if we want to run evaluation over and over. For Microsoft Translator there is the Microsoft Document Translator tool which works great for translating plain text files. For Google Translate the options are less clear - I tried using the translate-shell tool, but had some trouble translating a typical sized test set. One work-around is to upload the test file to Google Translator Toolkit, but of course this isn't an easily automate-able command line tool.

Calculating the BLEU Score 

After all this preparation it is quite easy to finally calculate the BLEU score on our Zoellick speeches test set using the multi-bleu-detok.perl script from the Moses MT toolkit:
perl multi-bleu-detok.perl target.txt < mt_microsoft.de.txt
BLEU = 28.56, 59.7/34.6/22.6/15.3 (BP=0.982, ratio=0.982, hyp_len=25937, ref_len=26407)

perl multi-bleu-detok.perl target.txt < mt_google.de.txt
BLEU = 28.53, 60.7/35.5/23.4/15.9 (BP=0.954, ratio=0.955, hyp_len=25225, ref_len=26407)


The first number in the output, the most important one, is the overall BLEU score, followed by individual BLEU scores for 1-, 2-, 3- and 4-word-ngrams. The numbers in the brackets are the brevity penalty (see the articles/papers on BLEU linked above) and some length statistics on reference (the human translation) vs. hypothesis (the MT output).
On our test data Microsoft Translator performs slightly better than Google Translate. What curious is that even though Google Translate scores higher for all 1-, 2-, 3- and 4-word-ngram BLEU scores, it gets dinged overall by the brevity penalty. Only comparing the translations directly would allow us to determine what exactly is going on there.

Upper-/Lowercasing

By default multi-bleu-detok.perl calculates BLEU on the text upper-/lowercased as it is. For most use cases and for most languages this is what we want for a real-world evaluation. By specifying the -lc option, both reference and hypothesis are lower-cased before the calculating BLEU. This is useful for example to track down issues like vocabulary coverage in MT systems.

Tokenization

When reporting BLEU scores publicly, particularly on commonly used test sets, consistent tokenization – separating words from each other and punctuation from words – is very important for comparability of the scores, as Matt Post demonstrates in his paper A Call for Clarity in Reporting BLEU Scores. To encourage the publishing of comparable BLEU scores Matt made the sacreBLEU tool available. Thanks Matt! Compared to multi-bleu-detok.perl sacreBLEU can also download commonly used test sets automatically and it supports some additional languages like Chinese. With our test set the scores are identical with both tools. Yay!
When evaluating languages that are not supported by the tools, particularly non-European languages and languages like Japanese or Thai where words aren't separated by spaces, we need to do the tokenization ourselves and then use multi-bleu.perl to calculate BLEU.

Conclusion

BLEU evaluation provides the basis to consistently and repeatedly:
  1. Compare the output quality of different MT systems to each other
  2. Track the trajectory of output quality of a single MT system over time
... for our specific test set. While, depending on the use case, we still have to supplement this with additional automatic metrics, business metrics and human evaluation, BLEU provides the transparency to confidently use MT.