Skip to main content

COVID-19 Translation Data from the CDC and Lessons for Custom MT


Most articles on the coronavirus are illustrated with some picture of the virus or people wearing masks, but I thought people are weary enough as it is. Instead, I'm starting out with a nice picture of a nearby State Park that I took this summer.

Nevertheless this post is about COVID-19 and what we in the language industry/machine translation community can do to address this pandemic. I believe there is still time to make a difference.

COVID-19 Translation Challenges

Back in May the Gretchen McCulloch wrote an article titled "Covid-19 Is History’s Biggest Translation Challenge" for WIRED magazine. She described how communicating health information in all the languages is key to addressing this crisis and the translation challenges around this. Medical translation usually relies on human translation to achieve the needed accuracy. Post-edited machine translation can help translators to meet the accuracy and speed needed for this crisis. 

McCulloch mentions that people seek out information beyond what their health authorities publish, which motivates the need for high quality raw, customized machine translation (MT) to gist information in other languages.

She points out some unaddressed issues around MT:

  • Even the largest online MT systems support a little over one hundred languages. This leaves out many languages including ones that have millions of speakers, e.g. Javanese in Indonesia where it isn't the official language. And then there are hundreds of smaller languages with fewer speakers that also have the need to get COVID-19 information in their language.
  • While high-resource languages are supported by MT, there are some issues around terminology and register with this type of content. McCulloch reports that one online MT system translated "Wash your hands." to Japanese in a way that a parent would talk to a child. This might discourage people from washing their hands rather than encourage them. Customized MT could help.

The right language resources are the basis to address these challenges. While there have been great advancements in using monolingual data for MT training and transfer learning across language pairs, parallel data is still very much needed to reach a sufficient level of quality.

COVID-19 Parallel Data

The translation community united in initiatives to provide parallel data - notably the TAUS Corona Crisis Corpora and the Translation Initiative for COVID-19 (aka TICO-19). We found yet another source for COVID-19 parallel data - the translations of the COVID-19 information from the United States Centers for Disease Control and Prevention (CDC).

CDC COVID-19 health information for the general public is translated from English to Spanish, Vietnamese, Korean and Chinese. We crawled these translations and make them available for free download. We checked version 7 of ParaCrawl, a broader web crawl for parallel data, and it does not contain any of these translations, so focused web crawling was necessary to capture this high quality set of relevant data.

The data set is not as large as the TAUS Corona Crisis Corpus which was selected from existing translation data using a COVID-19 specific query corpus. It is is larger than the TICO-19 translation benchmark that was translated from scratch by human translators. We make COVID-19 specific parallel data available for the first time for Korean and Vietnamese and expand the data available for Spanish and Chinese.

The data, available in TMX and TSV (tab-separated text) formats can be used straight away as a translation memory in computer-aided translation (CAT) tools to aid human translators. It is also useful for bilingual terminology extraction and as evaluation and customization data for machine translation. The TMX file contains the segments in original document order which is important for developing novel MT systems that use document context and/or for evaluating translations in document context.

COVID-19 Custom MT

For the NMT customization use case we wanted to evaluate how useful the dataset is for the creating  custom COVID-19 MT models. So we used the data to train Google AutoML Translation custom MT systems for English→Spanish and English→Vietnamese (holding out 2000 segments for evaluation and development).

Using the held out evaluation data (1000 segments) as a test set the results are very encouraging: for English→Spanish we obtained a BLEU score gain of +11.27 and for English→Vietnamese a BLEU gain of +4.1. These are significant increases over the Google NMT baseline and the systems will be beneficial immediately for post-editing translations of CDC COVID-19 content and gisting new/revised English CDC COVID-19 content.

TICO-19's translation benchmark available in English→Spanish (LatAm), so we had the opportunity to evaluate our custom engine with this data. Unfortunately the custom engine performed slightly worse on this data than the Google NMT baseline. The custom MT engine performed worse on all subsets of the TICO-19 benchmark – medical research content, medical conversations and Wikimedia publications on COVID-19. 

We believe that the reason is, that the medical domain for COVID-19 is very wide, as is already evident in the composition of the TICO-19 benchmark. We only used a narrow set of customization data. Further, there are variations in how individual translators and translation teams translate terminology and which style they use. This is usually defined in a translation toolkit available as guidance to the translators in a project.

Lessons for MT Customization

This points to a larger lesson to be learned from this: at the current state of neural machine translation technology it is difficult to beat the baseline of well-trained, optimized, general domain neural machine translation systems, particularly for resource-rich languages (like Spanish). Transformer-based models already resolve translation ambiguities at the sentence level very well and are robust to input from different domains. 

Custom MT systems fill a specific need where the input is from a relatively small well-defined domain and the desired translations are also well-defined from a terminology and style perspective. This is the case with many translation projects, but the cost of the customization has to be weighed with the expected use and benefit. We are not the only ones that observed this – the MT broker Intento  conducted an extensive study of both standard and custom MT for COVID-19 content (based on the TAUS Corona Crisis Corpora and a different evaluation methodology) and came to very similar conclusions.

The suitability of a project for customization and the needed customization data amounts are highly project-specific, so MT suppliers that offer customization should not leave expensive experimentation up to the user without more guidance. Expensive, unsuccessful experimentation for one project can lead to frustration and the conclusion that custom MT is not useful for any project. 

It would be very useful if the MT supplier could, based on a development set supplied by the user, provide an upfront analysis whether the project is suitable for customization and how much customization data would be needed. This would make costs more predictable. As an added benefit the creation of a development set allows to involve human translators and other stakeholders in the MT customization early on, likely improving the acceptance of the MT solution.

We hope that our data will prove useful for the dissemination of COVID-19 related information and that the lessons we learned in MT customization help MT projects beyond this crisis.