Skip to main content


Parallel Corpora of U.S. State Department Press Releases

The United States State Department translates its press releases , depending on the content, to Arabic, Bengali, Spanish, French, Hausa, Hindi, Indonesian, Khmer, Lao, Malay, Persian, Portuguese, Russian, Swahili, Telugu, Thai, Tagalog, Turkish, Urdu, Vietnamese and Chinese.  We commend the department for providing the translations in the public domain and believe that having these translations available as parallel corpora can greatly benefit machine translation, thereby furthering the understanding between nations.  For the time period from February 2017 to October 2020 we collected the translations for languages that have a significant number of translated press releases. We automatically sentence-aligned the texts, cleaned and deduplicated the sentences and randomized the sentence order. We now offer the resulting parallel corpora on the newly launched TAUS Data Marketplace . The following chart shows the English word counts for the available language pairs: The corpora with larg

COVID-19 Translation Data from the CDC and Lessons for Custom MT

  Most articles on the coronavirus are illustrated with some picture of the virus or people wearing masks, but I thought people are weary enough as it is. Instead, I'm starting out with a nice picture of a nearby State Park that I took this summer. Nevertheless this post is about COVID-19 and what we in the language industry/machine translation community can do to address this pandemic. I believe there is still time to make a difference. COVID-19 Translation Challenges Back in May the Gretchen McCulloch wrote an article titled " Covid-19 Is History’s Biggest Translation Challenge " for WIRED magazine. She described how communicating health information in all the languages is key to addressing this crisis and the translation challenges around this. Medical translation usually relies on human translation to achieve the needed accuracy. Post-edited machine translation can help translators to meet the accuracy and speed needed for this crisis.  McCulloch mentions that peopl

An MT Journey in 11 Easy Steps

When I talk to people that are new to machine translation (MT), I often get the question how they can determine whether MT can really help them with their translation needs, be it MT for post-editing or raw MT. This got me thinking what steps are essential to choose an MT solution that satisfies these translation needs from a linguistic quality and business perspective. I came up with this workflow that can serve as a guide through your MT journey. I will describe each of the steps in detail below. The blue steps are required, while the green ones are optional, depending on the quality goals and use case. 1. Choosing an MT Project This first step in the MT Journey is less defined than the ones after. I believe that at the begin of the journey it helps to broaden the perspective to clearly identify the destination of the journey. Deep learning is the foundational technology behind what is called artificial intelligence (AI) these days. Deep learning is what powers neural ma

Healthcare MT with Google AutoML Translation

To help people make the right choices for their healthcare, the U.S. Centers for Medicare & Medicaid Services provide the site as an information hub. The Centers try to reach many language communities , which is especially important for an aging population. With Spanish being the native language of roughly 13% of the US population, the most effort is put into a Spanish version of the site - . Reading information on a website is only the first step to get health insurance - there are navigators, assisters, partners, agents and brokers that assist in signing up for insurance. Wouldn't it be great if these people had a customized MT system available to communicate with people that need insurance? Such an MT system could also provide initial translations for English content that is not (yet) translated, also as post-editing drafts for translators translating healthcare/health insurance information for this site or oth

Bilingual Evaluation Understudy? A Practical Guide to MT Quality Evaluation with BLEU

Automatic Metrics for Machine Translation  Whether you publish machine translations directly or use them as post-editing input, evaluating their quality is essential. In this blog post we evaluate the quality of the translations in isolation against available human reference translations, rather than evaluating them in a larger context, e.g. in the context of business metrics. Judging the quality of machine translations is best done by humans. However this is slow, expensive and not easily repeatable each time an MT system is updated. Automatic metrics provide a good way to repeatedly judge the quality of MT output. BLEU (Bilingual Evaluation Understudy) is the prevalent automatic metric for close to two decades now and likely will remain so, at least until document-level evaluation metrics get established (see Läubli, Sennrich and Volk: Has Machine Translation Achieved Human Parity? A Case for Document-level Evaluation ). If you have been reading about machine translation ev

Translating Zoellick - Aligning PDF Files to Create Evaluation Data for MT

Imagine we are a translation provider for an organization like the World Bank , we heard about this new technology of neural machine translation and we would like to try out how well this works for the materials we have to translate. We do have access to some translated PDF files in English and German from years past, but unfortunately no access to a translation memory. To evaluate machine translation objectively with automated metrics like BLEU we need about 1000 to 2000 aligned, high-quality translated sentences that are representative of the material we intent to translate. In this blog post we create such evaluation data from the PDFs by extracting the text and manually aligning the sentences. In the next blog post we use this evaluation data to evaluate the translation quality of different MT systems using automated metrics. Downloading World Bank Open Knowledge Repository PDF Files Most of the World Bank Open Knowledge Repository is generously licensed under Creative C