Skip to main content

An MT Journey in 11 Easy Steps

When I talk to people that are new to machine translation (MT), I often get the question how they can determine whether MT can really help them with their translation needs, be it MT for post-editing or raw MT. This got me thinking what steps are essential to choose an MT solution that satisfies these translation needs from a linguistic quality and business perspective.

I came up with this workflow that can serve as a guide through your MT journey. I will describe each of the steps in detail below. The blue steps are required, while the green ones are optional, depending on the quality goals and use case.

1. Choosing an MT Project

This first step in the MT Journey is less defined than the ones after. I believe that at the begin of the journey it helps to broaden the perspective to clearly identify the destination of the journey.

Deep learning is the foundational technology behind what is called artificial intelligence (AI) these days. Deep learning is what powers neural machine translation. While applying deep learning is technically complex, its economics, according to the authors of the book "Prediction Machines" are quite simple. By virtue of being at the University of Toronto, one of the birth places of the recent deep learning revolution, the authors Ajay Agrawal, Joshua Gans and Avi Goldfarb had a front row seat to research the economics of this new technology from its infancy.

Economically speaking the effect of applying deep learning is to lower the cost and increase the accuracy of prediction. Prediction is part of virtually any business process and lowering its cost increases productivity. Lower priced, more accurate prediction also opens up new business opportunities.

Is the Biggest Opportunity Project Automation?

What does this mean for the language industry? The language industry is well known for its complex production processes and there is the ever-present demand to cut costs. Lower priced prediction in the form of AI can optimize and reshape these processes with great effect. So the most efficient application of AI might not be neural machine translation at all, but the redesign of business processes.

This blog post however, is about how to apply AI to the central task of the language business - translation. Where and how should we apply neural machine translation (NMT)?

Machine Translation Post-Editing (MTPE)

If we are planning to use MTPE in a human translation process, the "Prediction Machine" authors suggest a straight-forward approach: we identify where in our processes we translate and imagine how the process would change in the presence of high quality, human-level or near human-level machine translation. Weighting this by language pair difficulty, difficulty of source material and available training data allows us to identify process steps and projects with the highest return-on-investment (ROI) for using MTPE.

Before going all in on NMT we should choose a pilot project from the list of projects promising the highest ROIs. We certainly would not want to introduce NMT in the highest risk projects right away. Most important for a pilot is that we maximize learning - we want to gain experience in using, evaluating and integrating MT in a project that is representative of the translation projects the organization is usually working on. Representative in size, source material, language pair(s), style etc. If your organization is working on diverse projects you might want to choose more than one pilot project. Again, maximization of learning is the key goal.

Business Opportunities with Raw MT

Conducting an NMT pilot in existing translation processes will give us information how to optimize our processes using this technology, but it won’t allow us to explore new strategic opportunities.

For this we need to explore process innovation/redesign that is specific to the use case that the translations are used in. This is very specific to industry and context in which the translations are used. Increased accuracy in machine translations can often lead to use of raw machine translations in many scenarios, particularly ones where using human translation was cost-prohibitive. Often this drives additional demand for human translation. Here are a few use cases that can serve as inspiration:
  • Use of raw machine translation for customer support – either in published support content or support conversations with customers
  • Use of raw machine translation for user generated content (e.g. hotel reviews)
  • Use of raw machine for publishing of fast changing content, high demand content can later be post-edited by professional post-editors or the crowd
  • Use of raw machine translations to discover content of interest, e.g. in discovery for law suits. The most relevant documents will translated by professional translators – demand for human translation that only materialized through machine translation
  • Use of raw machine translation in internal communication
  • Use of raw machine translation for cross-border eCommerce (eBay)

2. Choosing Candidate MT Solutions

Choosing which MT solutions to evaluate should be mainly driven by the use case, not just the pilot project use case, but for MT use cases envisioned for the organization on a certain time horizon. If that time horizon is near or far is an important decision that needs to be made by the organization.

The main reason to decide on candidate systems based on the use case is that it can already narrow down the list to a few candidates. Recently there are also MT solutions emerging that are specific to  use cases like eDiscovery or customer support.

The large online MT solutions, in their generic, non-customized form should always be part of the candidate list. First, because they are easily accessible and cheap. More specialized systems can be benchmarked against them. Second, and more importantly, there is still a lot of innovation going on in the NMT area and therefore the online systems, backed by large research teams, evolve quickly and might be the best choice.

Other criteria to choose candidate MT solutions are
  • Feature set
  • Language support
  • MT customization options (more on this later)
  • On-site/dedicated cloud vs shared cloud solutions
  • Cost calculated on projected volumes

3. Defining Metrics

Automatic Metrics

Once you have chosen the pilot project, it is important to decide on project-relevant metrics - measure what matters. Pretty much a given in any project involving machine translation is to measure the BLEU score of the MT systems involved. This score allows to compare different MT systems to each other quickly and track the progress of quality improvement efforts for the project automatically. See this previous blog post about the BLEU score. We will discuss how to create or choose a test set to measure BLEU and other metrics in the next section.

Post-Editing Automatic Metrics

For machine translation post-editing (MTPE) there are some additional automated metrics that people often use:
  • TER - translation edit rate: A metric that reflects the required post-editing effort better than BLEU.
  • Edit-distance: Edit-distance directly reflects the amount of character edits performed during post-editing. In most cases the character-based Levenshtein distance is used.
  • Zero-edit segments: What percentage of machine translated segments can be used as is and does not need any editing at all? Of course post-editors still need to review these segments. This is similar to 100% translation memory matches that need to be reviewed. What is different from 100% TM matches is that unless we also use MT quality estimation, there is no indication which segments could be zero-edit segments.
One thing to keep in mind with the post-editing metrics is that one needs to choose the expected post-editing quality level ("fit for purpose" or "high-quality human translation and revision").
Once post-editing metrics are set, it is a double-edged sword to publish metrics to the post-editors. On one hand it can help them improve and adapt their editing to the desired outcome, on the other hand the metrics could also be gamed, especially if achieving certain metrics mean monetary rewards. E.g. if one pays for the number of edits, there will likely be more edits and likely also unnecessary ones.

Human Evaluation Metrics

If the project allows for it, automated metrics should always be supplemented with human evaluation. Traditional human evaluation metrics are accuracy and fluency, as well as ranking the output of different MT systems. Lately direct assessment has started to replace ranking.
More detailed information can be obtained by annotating errors in the MT output using the MQM/DQF Error Annotation Standard.
It is important to ensure that the results of human evaluation provide actionable metrics for MT implementers. If they aren’t actionable, we will be able to tell whether an MT system is good or bad, but not how to improve it.

Business Success Metrics

For using raw machine translation the metrics are specific to the business use case. Here are some examples:
  • When using raw MT for customer support articles, we can survey customers if the translated articles resolve the problem they had. This can be supplemented with support article popularity metrics to determine which articles should be human translated and other customer satisfaction metrics.
  • When using raw MT for eCommerce the additional revenue generated can be tracked. Last year the National Bureau of Economic Research published research showing that machine translation can positively impact cross-border trade.
The business metrics that should be used for raw machine translation are as diverse as the use cases. There is a whole genre of business literature on the topic of business metrics. I can recommend the book "The Lean Startup" by Eric Ries for a good introduction.

4. Creating Test Data

To be able to compute automatic evaluation metrics and conduct human evaluation and error annotation we need a test set.  A test set is made up of about 1000 to 2000 aligned, high-quality translated sentences that are representative of the material we want to translate.
The importance of creating an appropriate test set for the pilot project is often underappreciated. The test set not only defines the expected translation quality, it is also a way to communicate style, register and terminology to all people involved in an unambiguous way.
We might be lucky and have translated data that represents what we want to translate. Or we might decide to hold out test data from the training data. If we do this, we need to ensure that the training data is fairly homogeneous, we truly sample the test data randomly and that there are no duplicates between the test and training data.
Often the best way is to translate a test set from scratch. There is always the danger that we lull ourselves into believing that the test set is appropriate and by using it in the pilot project choose the wrong MT solution, creating huge costs downstream.
The test set should be created and reviewed by a group of professional translators and be agreed on by all organization stakeholders. Assigning the task of creating the test set to just a single team member or worse the MT supplier is not advisable.
If the translation needs of the organization are homogeneous in topic, terminology and style, creating a single test sufficient. If there are significant variations in topic, terminology and style we can create a larger test set with multiple references or multiple test sets. Which option is the best, depends on the magnitude of the variations and how they are organized in existing translation projects. Test sets should be revised periodically.

5. Baseline MT

We obtain baseline machine translations by translating the source side of our test data with the candidate MT systems/solutions.

If we use our system with post-editing, we still have some more work to do before evaluation.

6. Post-Editing (optional)

To use machine translation with human editing, we need to integrate machine translation both into the process and the tools that the editors use. This will require some process changes and some degree of integration of the MT systems or its translations into the CAT tool. For the pilot we need to integrate MT at least to a degree where we can compute accurate, meaningful post-editing metrics that can inform the use of MTPE for production. We might also consider the use of innovative CAT tools with tighter MT integration like interactive translation prediction (like Lilt).

To monitor the entire translation process, compare post-editing to the traditional human translation process and compare these to statistics from other organizations, we can instrument your translation process using the TAUS DQF Framework.

Modifying the process and tool integration is only the first step. We need to provide training and instructions to the post-editors. Post-editing studies have shown that the level of post-editor training and experience is a very important factor determining post-editing metrics. For the pilot we might want to choose post-editors that are already experienced in this mode of translation.

7. Evaluation

Automatic Evaluation

With the test set translations coming out of our MT system candidates, we can calculate our automatic metrics right away. This already gives us a much better picture for our use case than any publicly available benchmarks - the test data represents much better what we would like to machine translate than any public test set.

Interesting can also be to check the translations using translation QA tool like LexiQA or Okapi Checkmate to determine the amount of pattern, spelling and punctuation mistakes the MT systems make.

Human Evaluation

Particularly if we are using machine translations in their raw form, we should not just rely on automatic metrics, but also conduct human evaluation. We already discussed earlier which human evaluation metrics are typically used. We can evaluate the translations in tools like TAUS DQF Tools (a standalone tool separate from the DQF Framework described above), the Appraise tool or another direct assessment tool. Clear and unambiguous instructions for the evaluators are crucial for evaluation success.

Error Annotation (optional)

If we want and need more detailed annotation of errors in the translations using the MQM/DQF error hierarchy, we can use DQF, ContentQuo or linguistic quality review features built into some CAT tools.

8. Customization (optional)

Now that we have a complete picture of the quality of our candidate MT solutions with automatic and human evaluation metrics as well as error annotation, we can decide on the next step in our MT journey.
If one or more of our candidate systems already meets the quality expectations, we can skip customization. This is a good thing because customization always comes with opportunity costs: we have to create, maintain and update the customization while the underlying MT systems evolve. Customized systems likely only work for a specific set of projects. Switching costs increase.
If we do need to customize, the metrics and particularly human evaluation and error annotation identify the problem areas we should address in the customization. E.g. if terminology of named entities is a problem, it might be enough to provide a bilingual dictionary to the MT solution - a feature that many MT solutions provide. Other issues might require a more in-depth customization.


Data in the form of terminology databases and translation memories is essential for MT customization. Ideally we already have such resources, cleaned and well-organized from previous translation projects. If these aren't available, we can create these resources by aligning translated documents or web pages and by extracting terminology. Even if we have well organized resources, the topic and style of the resources might not fit the project, so we might have to sub-select data and/or merge data from different resources.

Custom MT Training

Once we have translation memory and terminology databases organized, customizing MT systems is fortunately quite straightforward and we can re-evaluate the MT output of the customized system using our test data and evaluation methods.

Based on the re-evaluation we can determine whether the customized MT meets our quality goals. If not, we can gather more data or tweak the system in other ways. How many times we have to go through this customization/re-evaluation cycle depends on the quality goals, but also on the context in which MT is expected to be used later in production. If we can reach our quality goal for our small pilot project only with difficulty and with great effort (i.e. cost) and in production we anticipate to translate many small, low-revenue projects from different domains/sources, MT use might not be cost-efficient. This reinforces the importance of business metrics in addition to common MT quality metrics.

9. MT Selection

Based on our metrics and evaluation results we can now decide on the best MT solution to put into production. It might be worth it to take a step back to check if our assumptions that we set at the start of the pilot project still hold: Is there anything that we learned during the pilot project that makes us re-evaluate some of our metrics or at least re-weight some of them? Has the business environment changed in the mean time? Have new and improved MT solutions become available to bring into the mix?

10. MT Integration

While we verified that we can integrate the MT solution during the pilot and we might have done some shallow integration into our workflows during the pilot (e.g. for post-editing), now comes the hard work of actual integration. Most likely we need to some software development or configuration/setup for the MT connector. We also want to make sure that the license keys and automatic payments (including cost monitoring) are in place so that we have uninterrupted access to our chosen MT solution. We want to do this without disrupting our existing translation workflows.
Based on the pilot metrics we should also decide on a set of production metrics to monitor and ensure that these can be tracked. We might even consider creating a metrics dashboard allowing all interested stakeholders to view the production metrics.

11. MT Rollout

With the integration work we have laid the groundwork for the technical rollout. But we still need to prepare the organization for the rollout. Stakeholders that have not been yet been involved need to be informed about the pilot, its results and the planned production rollout. If we are using MT plus post-editing, additional post-editors need to be trained. A successful pilot program provides valuable organization-relevant information and internal advocates for MT, making the rollout easier.

After rollout we can monitor our production metrics, the acceptance of MT in the organization and among the organization's customers. Over time we will gather metrics data, feedback on translations and corrected translations that will allow us to further improve and optimize MT quality and use.