Skip to main content

Translating Zoellick - Aligning PDF Files to Create Evaluation Data for MT

Imagine we are a translation provider for an organization like the World Bank, we heard about this new technology of neural machine translation and we would like to try out how well this works for the materials we have to translate. We do have access to some translated PDF files in English and German from years past, but unfortunately no access to a translation memory.
To evaluate machine translation objectively with automated metrics like BLEU we need about 1000 to 2000 aligned, high-quality translated sentences that are representative of the material we intent to translate.
In this blog post we create such evaluation data from the PDFs by extracting the text and manually aligning the sentences. In the next blog post we use this evaluation data to evaluate the translation quality of different MT systems using automated metrics.

Downloading World Bank Open Knowledge Repository PDF Files

Most of the World Bank Open Knowledge Repository is generously licensed under Creative Commons licenses, so we can download, transform, use and publish content as long as we comply with the licenses. From the OKR category "German PDFs Available" we download the German PDFs and, where available, their English source documents.
A first look with a PDF reader like Adobe Acrobat Reader DC shows that some of these PDFs are scanned paper documents, like a 1980 address of the former head of the World Bank president Robert McNamara (who's life is chronicled in the documentary The Fog of War).

Only some of the scanned documents contain text extracted with OCR and quite a bit of the OCRed text contains errors.
We are more lucky with the formatting of more recent PDFs from this century and can extract parallel data from them with the help of alignment tools. As long as we don't have to translate historical documents, the more recent documents are more relevant for our evaluation anyway.

Aligning PDF Files

There are quite a few options to align translated pairs of PDF files. In no particular order:  commercial tools like Terminotix Document Alignment Tools, SDL Trados Studio and Stingray Document Aligner and open source tools like OmegaT, LF Aligner and bitext2tmx.
To manually align the data without spending too much time we need three things from the alignment tool:
  1. High-quality extraction of text and text flow from the PDF
  2. High-quality automated sentence pre-alignment to reduce the manual alignment work
  3. A user interface that allows to efficiently review and correct the automatic pre-alignment
Any of these three factors has a big impact on the work required. Even if an alignment tool is great in all three factors, manual alignment is most likely only cost efficient for small data sets, such as the MT evaluation data we are creating. For the alignment of larger corpora needed for MT training we need fully automated sentence alignment which we'll cover in a future post. Fortunately we can use the data set we produce here to evaluate automated sentence alignment!

LF Aligner

We use LF Aligner, which is quite good for the automated pre-alignment and has a good alignment UI. It falters a bit in the text extraction of our PDF files. To compensate for this, we first convert the PDF files to plain text using Microsoft Word and then align the plain text files using LF Aligner 4.1:

For manual alignment we need to be familiar with both languages. Not to the degree of a translator, but we need to be able to identify translations to correct the pre-alignment by merging, splitting, moving around and deleting sentences.
In most cases we will align one sentence in the source language with another sentence in the target language. Sometimes translators decide to translate one sentence with several sentences or even merge multiple sentences into one sentence in the target language - so it is possible to get 1:M and N:1 alignments. With current MT systems it is quite unlikely that an MT system would do the same, but our evaluation set should reflect what the MT system should do, not what we expect from current technology. We'll discuss the impact of this in the upcoming blog post on evaluation.

We align four different speeches (documents 29633, 29634, 29639 and 29758) by former World Bank president Robert Zoellick.

OmegaT Aligner

We align the most recently added speech by Robert Zoellick - document 31126 - using OmegaT 4.1.5_04_Beta:

OmegaT does very well on the PDF text extraction and pre-alignment. Its alignment UI is very similar to LF Aligner and very functional. The only drawback of the pre-alignment is that the alignment time increases exponentially with the document size, luckily our documents aren't too long. 

Statistics and TMX Files

It took us a little over two hours to align the 1330 segments - the size we are looking for in an MT evaluation set. An experienced aligner can certainly work much faster. The output of the alignment are TMX files that can be downloaded here.


The data set we created has many uses:
  • evaluation set for MT engines - we'll use it for this in the next blog post
  • evaluation set to judge quality of automated sentence alignment - topic of a future blog post
  • translation memory - as such quite small
  • terminology extraction
  • ... 
So even though we have to put a bit of work into aligning documents, we can gain a lot of learning and efficiency out of it over time. We just need make sure to select and align the documents most relevant in short and long term.