Legal, Data Storage and Security

Artificial Intelligence and Machine Learning


This article explains how machine translation and artificial intelligence is used at Memsource and how it relates to data privacy and processing.

Data Privacy and Security

Data in Machine Learning Models

Data uploaded by Memsource clients (including metadata) can potentially be used for training machine learning models. This data is not shared with other users, nor is it possible to extract it from the models as they do not produce any content.

All content is treated as if containing personal information, so all data for ML is handled in accordance with the rules imposed by GDPR. Client data (or anyone's data) is not resold for profit. This data can't be reconstructed or reverse engineered. None of the AI features generate any textual content, only the labeling of content with metadata (e.g. MT quality category, non-translatable, etc).

When training models, all relevant data is aggregated from Memsource and models are constructed from it. After no more than 90 days (as required by GDPR), all data is deleted and only the models remain. These models do not contain customer data, as they do not store sentences. The neural network model is a complex mathematical formula that calculates a quality score based on the source sentence and its translation. Training the model involves adjusting the parameters of the formula until it provides desired results.

Data Anonymization

Training data is processed by the machine learning algorithm to create (train) a model. This model is used in the feature to predict non-translatables, MT quality (MTQE) or to recommend an optimal MT engine.

While the training data may contain personal data, the resulting model does not. Any personal data is anonymized during the training process.

Machine Learning Models and New Content

When new content is processed with MT a numerical representation is created, fed to the formula and a score is calculated. If identical to training sentences, it indicates the formula is already optimized to generate the correct translation.

The formula is designed to learn and identify patterns in data; to generalize. It will, for example, learn that when it sees cat in the English source, the MT should contain gatto in Italian to be considered good. MT that doesn't contain it is bad. These learned patterns are then applied to any newly submitted sentences and MT output.

Machine Learning

For general information on the engines supported by Memsource, their performance, and factors to consider when getting started with machine translation, see these resources:

Data Security and Privacy with MT Providers

When submitting source content to machine translation providers, Memsource encrypts the data in transit. When processed by the MT engine, the data is subject to the MT providers terms of service and privacy policies.


MTQE Calculation

  • Memsource collects training data from already-processed segments (source, MT output, postedit).

  • Based on the similarity between MT output and post-edit, a quality category is calculated (100, 99, 75, 0).

  • The similarity is calculated using a combination of an in-house metric, which is partially based on chrf3 (a popular MT evaluation metric).

  • The tuples are fed (source, MT output, score) into a deep neural network and taught to predict the quality category. One neural net per language pair is in use.

  • In production, the neural net gets a source sentence and the MT output as input, and it predicts the most probable quality category.

chrf3 vs BLUE or TER

Better results has been observed with chrf3, and it is more reliable when scoring individual sentences. It also handles various language types, such as morphologically rich languages or CJK languages.

No MTQE scores

If a score is not provided, the segment is likely not worth post-editing but it may also mean that the model may not be confident enough to answer. Separating these two situations is in development.

Memsource Translate

Defining and Identifying Domains

Memsource collects training data from existing segments. An algorithm is applied that identifies different domains making up the training data. Although obtained automatically, the detected domains correspond well to standardized domains (e.g. medical, travel/hospitality, software, etc).

For each language pair and domain, the performance of machine translation engines is monitored. When a new document is uploaded for translation, the model detects which of these domains are present. The most relevant domain is selected (e.g. If a document is 60% legal and 40% medical, it is categorized as legal) and the current best engine for the legal domain and corresponding language pair is recommended.

Was this article helpful?

Sorry about that! In what way was it not helpful?

The article didn’t address my problem.
I couldn’t understand the article.
The feature doesn’t do what I need.
Other reason.

Note that feedback is provided anonymously so we aren't able to reply to questions.
If you'd like to ask a question you can leave a public comment below or Submit a request to our Support team.
Thank you for your feedback.



Article is closed for comments.