What is segmentation?
The segmentation process will split the original text into smaller parts - such as sentences or titles - to make it easier to retrieve the previously translated text from Translation Memory. The default Segmentation rules in Memsource Cloud correspond with specifics of each supported language and can be customized if needed.
Importing Jobs with poor segmentation (e.g. poorly formatted Word files) or applying customized segmentation can affect retrieved TM match value. Such example can be seen on the picture below. The sentence in the second and third segment was manually broken into two lines. As you can see, the CAT pane shows only 63% match exactly because the second half of the sentence is missing in the segment no. 2.
Customizing the segmentation rules
There are two types of segmentation rules: abbreviations to the XLSX file and regular expression of SRX file.
To edit the segmentation rules, you need to download the default rules into external file, edit it, upload it as a new rules and use it for importing of the new job.
Download the default segmentation rules
In the main Setup > Project Settings > Segmentation download the default segmentation rules for the specific language (either abbreviation list in XLSX or regular expression rules in SRX file.
Edit the downloaded files
Edit Abbreviations in XLSX file
This option allows specifying abbreviations for individual language, after which a new segment should not be created. The XLSX file must have 2 columns with no heading:
- The first column in the XLSX file specifies an abbreviation.
- The second column further specifies the segmentation behavior:
- ABBR_UPPER_NUM means that a new segment will not be created if the abbreviation is followed by a whitespace and then by a number or word with the first letter in the upper case or a symbol (math symbols, currency signs, dingbats, box-drawing characters, etc.).
- ABBR_NUM means that a new segment will not be created if the abbreviation is followed by a whitespace and then by a number.
- Save the edited XLSX
Edit regular expression in SRX file
Editing the SRX files is a complex process suitable only for users experienced in using a Regular Expression.
What rules can be changed in SRX file? For example:
- Import text from excel without segmentation > one cell = one segment
- Import text with new line to one segment instead of two
- Don't use semicolon (or any other character) as segment separator
- Use colon (or any other character) as segment separator
- Removing the abbreviation from the list (text will be segmented)
The rules are 'character based', which means that only one character can be used as segment separator (group of characters for example <p> cannot be used as segment separator).
Edit the SRX file:
- Open it in text editor (for example Notepad ++ - free for download)
- Edit using Regular Expression or remove the inner segmentation completely (see this example)
- <rule break="no"> is the list of rules, where segment will not be broken ie list of abbreviations
- <rule> <beforebreak> - regular expression for character before the break (for example end of the sentence ". ? ! :") - if you for example don't want segment text after colon, simply delete : from every <rule> <beforebreak> code.
- <rule> <afterbreak> - regular expression for character after the break (for example start of the new sentence - space and capital letter)
- Save the modified SRX file.
Upload the new segmentation rules to Memsource
- Go to Setup - Segmentation and hit the New button
- Select Language, Name (e.g. "New Segment after Semicolon") and choose the modified SRX file. Check the Primary check box only if you want to make the custom segmentation your primary segmentation for the language. Hit the Create button
- If everything goes well, a message will appear "The segmentation file has been uploaded successfully." And the new file will get listed on the Segmentation page.
Use new segmentation rules for job import
- Now go to your project and hit the "New" button to create a new job.
- In File Import Settings expand Segmentation and select your custom segmentation rule.
- Hit the Create button to add the job(s) to your project, segmented with your custom segmentation rules.
Setting the Custom segmentation as default
If you set the new segmentation rule as Primary (in Setup) it will be automatically selected (by default) for all new jobs that will be imported for the source language in question.
See details on our Memsource Cloud User Manual - File Import Settings.
You can set specific segmentation rules for specific projects - by creating project templates - Memsource Cloud User Manual - Project Templates.