This article explains how machine translation and artificial intelligence is used at Memsource and how it relates to data privacy and processing.
Data uploaded by Memsource clients (including metadata) can potentially be used for training machine learning models. This data is not shared with other users, nor is it possible to extract it from the models as they do not produce any content.
All content is treated as if containing personal information, so all data for ML is handled in accordance with the rules imposed by GDPR. Client data (or anyone's data) is not resold for profit. This data can't be reconstructed or reverse engineered. None of the AI features generate any textual content, only the labeling of content with metadata (e.g. MT quality category, non-translatable, etc).
When training models, all relevant data is aggregated from Memsource and models are constructed from it. After no more than 90 days (as required by GDPR), all data is deleted and only the models remain. These models do not contain customer data, as they do not store sentences. The neural network model is a complex mathematical formula that calculates a quality score based on the source sentence and its translation. Training the model involves adjusting the parameters of the formula until it provides desired results.
Training data is processed by the machine learning algorithm to create (train) a model. This model is used in the feature to predict non-translatables, MT quality (MTQE) or to recommend an optimal MT engine.
While the training data may contain personal data, the resulting model does not. Any personal data is anonymized during the training process.
When new content is processed with MT a numerical representation is created, fed to the formula and a score is calculated. If identical to training sentences, it indicates the formula is already optimized to generate the correct translation.
The formula is designed to learn and identify patterns in data; to generalize. It will, for example, learn that when it sees cat in the English source, the MT should contain gatto in Italian to be considered good. MT that doesn't contain it is bad. These learned patterns are then applied to any newly submitted sentences and MT output.
For general information on the engines supported by Memsource, their performance, and factors to consider when getting started with machine translation, see these resources:
When submitting source content to machine translation providers, Memsource encrypts the data in transit. When processed by the MT engine, the data is subject to the MT providers terms of service and privacy policies.
Memsource collects training data from already-processed segments (source, MT output, postedit).
Based on the similarity between MT output and post-edit, a quality category is calculated (100, 99, 75, 0).
The similarity is calculated using a combination of an in-house metric, which is partially based on chrf3 (a popular MT evaluation metric).
The tuples are fed (source, MT output, score) into a deep neural network and taught to predict the quality category. One neural net per language pair is in use.
In production, the neural net gets a source sentence and the MT output as input, and it predicts the most probable quality category.
Better results has been observed with chrf3, and it is more reliable when scoring individual sentences. It also handles various language types, such as morphologically rich languages or CJK languages.
If a score is not provided, the segment is likely not worth post-editing but it may also mean that the model may not be confident enough to answer. Separating these two situations is in development.
Memsource collects training data from existing segments. An algorithm is applied that identifies different domains making up the training data. Although obtained automatically, the detected domains correspond well to standardized domains (e.g. medical, travel/hospitality, software, etc).
For each language pair and domain, the performance of machine translation engines is monitored. When a new document is uploaded for translation, the model detects which of these domains are present. The most relevant domain is selected (e.g. If a document is 60% legal and 40% medical, it is categorized as legal) and the current best engine for the legal domain and corresponding language pair is recommended.