Skip to main content

Parallel Corpora of U.S. State Department Press Releases

The United States State Department translates its press releases, depending on the content, to Arabic, Bengali, Spanish, French, Hausa, Hindi, Indonesian, Khmer, Lao, Malay, Persian, Portuguese, Russian, Swahili, Telugu, Thai, Tagalog, Turkish, Urdu, Vietnamese and Chinese. 

We commend the department for providing the translations in the public domain and believe that having these translations available as parallel corpora can greatly benefit machine translation, thereby furthering the understanding between nations. 

For the time period from February 2017 to October 2020 we collected the translations for languages that have a significant number of translated press releases. We automatically sentence-aligned the texts, cleaned and deduplicated the sentences and randomized the sentence order. We now offer the resulting parallel corpora on the newly launched TAUS Data Marketplace. The following chart shows the English word counts for the available language pairs:

The corpora with larger word counts can be used for MT training and customization, while the ones with smaller counts are useful as evaluation sets. The State Department press releases are written diplomatic statements and declarations in response to current events, as well as transcriptions of news conferences. So news translation is the domain that these corpora are likely most advantageous for. 

Unlike many of the publicly available corpora that are compiled from European translations the dialect variants here are from the Americas: US-English, Latin American Spanish and Brazilian Portuguese. Also remarkable are the sizable corpora for Arabic, Russian, Hindi and Urdu.

So isn't this data already available in broader web crawls for parallel data like ParaCrawl? We checked and the English-Spanish ParaCrawl v7 data set contains none of the translations.

While the publication on the TAUS Data Marketplace requires deduplication and segment randomization we can also provide the corpora with preserved document context for experiments and evaluations of novel document-level MT. Please email if you are interested.