Project Management

Segmentation Rules Overview

Segmentation

Segmentation is the process that splits original texts into smaller parts. This improves the retrieval of previously translated text from a translation memory.

Default segmentation rules in Memsource correspond with specifics of each supported language and can be customized.

Jobs imported with bad segmentation such as poorly formatted Word files or the application of inappropriate segmentation customization can affect TM match values.

Example:

Good Segmentation:

  • Translation memories with multilingual target languages are supported and can be used bidirectionally.

    Match value of 100%.

Poor Segmentation

  • Translation memories with multilingual target languages are supported. Match value of 100%.

  • and can be used bidirectionally.

    Match value of 63%.

Customize Segmentation Rules

Customized segmentation rules can be applied to jobs and project templates. When set as primary, they are applied to all new jobs imported for that source language.

There are two types of segmentation rules: 

  • Abbreviations to the XLSX file

  • Regular expression of SRX files

To use customized rules, download default rules, modify them, upload the modified file and apply them to specified jobs.

Download Default Segmentation Rules

To download the default segmentation rules, follow these steps:

  1. From the Setup Setup_gear.png page, scroll down to the Project Settings section and click on Segmentation.

    The Segmentation page opens.

  2. Select the language to be customized and click Export XLSX/SRX.

    The Export XLSX/SRX window opens.

  3. Select format:

    • XLSX provides an abbreviation list.

    • SRX provides regular expression rules.

  4. Select a language from the dropdown list.

  5. Click Download.

    The file is downloaded to your system.

Edit Abbreviations in an XLSX File

Abbreviations can be specified for individual languages after which new segments should not be created.

To edit abbreviations, follow these steps:

  1. Open the downloaded XLSX file in an editor.

  2. Change the contents with the following formatting:

    The XLSX file must have two columns with no headings.

    • Column 1: Abbreviation to be specified

    • Column 2: Specification of segmentation behavior

      • ABBR_UPPER_NUM

        A new segment is not be created if the abbreviation is followed by white-space and then by a number, a symbol (math, currency signs, dingbats, etc.) or a word with the first letter in upper case.

      • ABBR_NUM

        A new segment will not be created if the abbreviation is followed by white-space and then by a number.

  3. Save the edited XLSX file.

Edit Regular Expressions in an SRX File

Editing SRX files is a complex process suitable only for users experienced in using Regular Expressions. 

There are several rules that can be changed in an SRX file:

  • Import text from an XLSX file without segmentation; one cell is equal to one segment.

  • Import text with a new line in order to split one segment into two.

  • Use a colon (or any other character) as a segment separator.

  • Forbid the use of a semicolon (or any other character) as a segment separator.

  • Removing an abbreviation from the list (the text will be segmented).

These rules are character-based; only a single character can be used as segment separator. Group of characters (for example: <p>) cannot be used as a segment separator.

To edit the SRX file, follow these steps:

  1. Open the file in a text editor such as Notepad ++.

  2. Edit using regular expressions or remove the inner segmentation completely.

    Example:

    BreakRules_Example.png
    • <rule break="no">

      The list of rules, where the segment will not be broken. I.E. a list of abbreviations

    • <rule> <beforebreak>

      A regular expression for a character before a break (for example, at the end of a sentence ". ? ! :"). If you, for example, don't want segment text after a colon, simply delete : from every <rule><beforebreak> code.

    • <rule> <afterbreak>

      A regular expression for a character after a break (for example, at the start of a new sentence; a space and capital letter).

  3. Save the modified SRX file.

Upload New Segmentation Rules

To upload modified or new segmentation rules, follow these steps:

  1. From the Setup Setup_gear.png page, scroll down to the Project Settings section and click on Segmentation.

    The Segmentation page opens.

  2. Click New.

    The Upload Custom XLSX or SRX Segmentation File page opens.

  3. Select a Language from the dropdown list.

  4. Provide a Name for the rule.

  5. Click Choose File.

    A file selection window opens.

  6. Select the modified rules file for upload.

  7. Check Primary if the custom segmentation rules will be the primary segmentation rules for the selected language.

  8. Click Create.

    The Segmentation page opens and the rule has been added to the list.

Use Custom Segmentation Rules on Job Import

To use custom rules on a job import, follow these steps:

  1. At step 8 of Creating a Job, click Segmentation and Segment Length from the File Import Settings.

    The Segmentation and Segment Length options dropdown opens.

  2. Select the modified rules from the Source segmentation rules dropdown list.

  3. Click Create.

    The job is created and added to the list using the specified segmentation rules.

Changing Segmentation Example (1 Cell 1 Segment)

This is how to remove all inner segmentation rules are removed from an SRX file leaving only the basic segmentation of the whole paragraph, element, or cell being applied. This segmentation rule can be applied to every file type (MS Word, XML, HTML, Excel, etc.).

Example:

A

B

1

Peter! Wait!

2

Hello.

3

This XLSX example imported with default segmentation will have 3 segments: Peter!, Wait!, and Hello.

If all inner segmentation is removed leaving only the basic segmentation based on the Cell, then there are only two segments: Peter! Wait! and Hello.

Edit the SRX file to remove all the default segmentation and remove all the code between <!-- break rules --> and </language rules>.

Was this article helpful?

Sorry about that! In what way was it not helpful?

The article didn’t address my problem.
I couldn’t understand the article.
The feature doesn’t do what I need.
Other reason.

Note that feedback is provided anonymously so we aren't able to reply to questions.
If you'd like to ask a question you can leave a public comment below or Submit a request to our Support team.
Thank you for your feedback.

Comments

0 comments

Please sign in to leave a comment.