Skip to main content

“May I quote you?” – why quotation marks are difficult for machine translation and problematic for BLEU scores

Today a post about typography, nevertheless with big impacts on MT quality evaluation ...

Why worry about quotation marks? 

You might wonder why you should worry about the handling of quotation marks in machine translation and it's impact on automatic machine translation quality evaluation? Isn't quoting an easy task in one language that can be easily transferred in translation with some simple rules? As we will see, it isn't an easy task - wrong quotation marks can significantly distort BLEU scores and thereby the relative ranking of MT systems.

Quotation marks in English

Back in the days of character encoding with ASCII we had one character for double-quotes: " (U+0022: Quotation Mark) and another character for single quotes: ' (U+0027: Apostrophe) with the later doing double duty as an apostrophe and single quote. With these we can quote a sentence like:

"She said: 'It's getting late.'"

This is quite ugly looking typographically, which is why characters “ (U+021C, Left Double Quotation Mark), ” (U+021C, Right Double Quotation Mark), ‘ (U+2018, Left Single Quotation Mark) and ’ (U+0219, Right Single Quotation Mark) were introduced:

“She said: ‘It’s getting late.’” 

This looks better. Here the Right Single Quotation Mark does double duty as an apostrophe. Text editing programs like Microsoft Word automatically substitute the simpler, typed ASCII quotation characters into typographically correct quotation characters.

Changing quotation marks in translation 

Let's translate this quote into French. French uses something called guillemets for quotation marks and we follow the rules here to apply them:

«Elle a dit: ‹Il se fait tard.»

Following the general rules for French, the sentence includes non-breaking spaces between the guillemets and the first/last character of the enclosed sentence. Some publications, like Le Figaro, one of the main newspapers in France, omit the spaces. There the sentence would be «Elle a dit: ‹Il se fait tard.›».

Because we don't have the contraction “it’s” in French, we have no apostrophe in the French translation.

We can start to see how quotation marks are a little bit complex for machine translation to handle.

Quotation marks in machine translation

So, let's translate the English sentence “She said: ‘It’s getting late.’” with different online MT systems:

Amazon Translate: « Elle m'a dit : 'Il se fait tard. '»

DeepL:  "Elle a dit : 'Il se fait tard'."

Google Translate: "Elle a dit: 'Il se fait tard.'"

Microsoft Translator: « Elle a dit : 'Il se fait tard'. »

Textually the translations of the different MT services are virtually identical, with the small addition of a reflexive pronoun by Amazon Translate (more equivalent to the English sentence “She told me: ‘It’s getting late.’”).

But the “translation” of quotation marks varies a lot: Amazon Translate and Microsoft make a good effort to use guillemets for the double quotes surrounding the sentence, but fail to use single guillemets for the embedded quotation. DeepL and Google Translate don't even try to use the language conventions and normalize the Right/Left-versions of quotation marks to " (U+0022: Quotation Mark) and ' (U+0027: Apostrophe).

For understanding the meaning of the sentence inappropriate quotation marks have little impact, but for publication quality a human has to correct the errors, which means a lot of tedious editing work!

French and English here are just examples - many languages, particularly European ones, have their own quirky rules for quotation marks.

The problem with quotation marks in BLEU evaluation

The quasi-standard metric for machine translation evaluation is BLEU, measuring translation similarity by calculating the degree of overlap between machine translations and human reference translations. The quasi-standard tool for measuring BLEU is sacreBLEU (thanks to Matt Post and other contributors for making this available!). It uses a tokenizer to separate quotation marks and punctuation from the regular words. To handle language-specific quotation marks like French guillemets we have to use the non-default intl tokenizer.

Provided our human reference translations in the evaluation data contain the correct quotation marks, we get an accurate score representing how well the MT translates words, quotation marks and punctuation. But do we really want such a mixed score? If the MT system, such as Google or DeepL in the example above, get quotation marks wrong and the data contains many quotes, a bad BLEU score mainly indicates that the system is bad at handling quotation marks and not whether the translations are good linguistically.

It gets worse when quotation marks in the evaluation data are inconsistent: the English↔French test data published for WMT15 contains 74 pairs of guillemets («[...]»), 32 pairs of double quotation marks ("[...]") and some instances of two apostrophes marking quotations (' '[...] ' '). BLEU scores become even harder to interpret.

Normalizing quotation marks

The most straightforward solution to evaluate linguistic translation quality alone is to normalize quotation marks in the machine translations and reference translations before evaluation:

  1. Replace the double quote characters „ “ ” « » with "
  2. Replace the single quote characters ‚ ‘ ’ ‹ › with '

Like this differences in quotation mark handling and inconsistent marking in evaluation data are eliminated and the BLEU score more accurately reflects the linguistic quality of content word translations.

Language-specific quotation mark handling does not get evaluated anymore. The score only reflects whether quotes are present or not. We need a separate metric measuring the quality of the language-specific quotation mark handling - this is a topic for a future blog post.

Impact across language pairs

The MT Decider Benchmark uses publicly available evaluation data from the Conference on Machine Translation (WMT) and the International Conference on Spoken Language Translation (IWSLT). We evaluated what impact quotation mark normalization has across 48 language pairs (evaluation data from Q2/2022) and four MT online MT services :

  • 20% of MT service rankings changed, i.e. rankings were distorted before quotation mark normalization
  • 32% of all scores differed by more that one BLEU point
  • BLEU scores changed up to 4.81 points

In order for the MT Decider Benchmark to reflect linguistic translation quality of content words rather than variations in quotation marks we will use quotation mark normalization for the upcoming quarterly evaluations.

Impact of quotation mark handling on COMET score

We also were curious how differences in quotation mark handling and quotation variations in evaluation data affect our other evaluation metric COMET. It turns out much less than BLEU: only 3% of MT system ranks measured in COMET change with quotation mark normalization!

Why is this the case? We suggest that this is because:

  • COMET uses SentencePiece tokenization which does not special case language-specific quotation marks
  • COMET evaluates the semantics of tokens rather than their surface forms, e.g. for quotation marks the machine learning model learned that “ in English text is semantically equivalent to « in French text and that “ and " are semantically equivalent to some degree

Summary

The BLEU metric proves itself to be very sensitive to formatting artifacts for quotations. This might provide additional motivation to phase it out for more robust metrics like COMET. 

As next steps in this area we will pursue a quality metric for quotation marks separate from translation quality metrics, most likely a simple precision/recall metric. It would also be good if MT providers could document whether quotation marks for a language (either as a source language, target language or both) are supported in their services. As this isn't the case today, we are planning to document this in the MT Decider Benchmark.

With MT Decider Benchmark we make the most accurate, comprehensive, open and actionable quality information about MT services available, so that you can confidently choose which MT to use for your content and use case.

 

Comments