Project Management

Segmentation

What is Segmentation?

The segmentation process will split the original text into smaller parts—such as sentences or titles—to make it easier to retrieve the previously translated text from a translation memory. The default segmentation rules in Memsource correspond with specifics of each supported language and can be customized if needed.

Importing Jobs with poor segmentation (e.g. poorly formatted Word files) or applying customized segmentation can affect retrieved TM match values. Such an example can be seen in the picture below. The sentence in the second and third segments was manually broken into two lines. As you can see, the CAT Pane shows only a 63% match because the second half of the sentence is missing in the second segment.

segmentation_match.jpg

Customizing the Segmentation Rules

There are two types of segmentation rules: abbreviations to the XLSX file and the regular expression of SRX files.

To edit the segmentation rules, you need to download the default rules into an external file, edit it, and upload the file as new rules. You can then use these new rules to new jobs.

Download the Default Segmentation Rules

In the main Setup, go to the Project Settings section and select Segmentation. Download the default segmentation rules for the specific language (either as an abbreviation list in XLSX or as regular expression rules in an SRX file).

segmentation_setup.png

Edit the Downloaded Files

Once you have the file downloaded, you'll be able to edit it. This will be done in two different ways, depending on whether you downloaded an XLSX file or an SRX file.

Editing Abbreviations in an XLSX File

This option allows you to specify abbreviations for individual languages, after which a new segment should not be created. The XLSX file must have 2 columns with no heading:

  • The first column in the XLSX file specifies an abbreviation.
  • The second column further specifies the segmentation behavior:
    • ABBR_UPPER_NUM means that a new segment will not be created if the abbreviation is followed by whitespace and then by a number or word with the first letter in the upper case or a symbol (math symbols, currency signs, dingbats, box-drawing characters, etc.).
    • ABBR_NUM means that a new segment will not be created if the abbreviation is followed by whitespace and then by a number.
  • Once you've made the desired changes, save the edited XLSX.

Segmentation_abbrv.png

Editing Regular Expressions in an SRX File

Editing SRX files is a complex process suitable only for users experienced in using Regular Expressions. There are several rules that can be changed in an SRX file. You can, for example:

  • Import text from Excel without segmentation. Here, one cell is equal to one segment.
  • Import text with a new line in order to split one segment into two.
  • Use a colon (or any other character) as a segment separator.
  • Forbid the use of a semicolon (or any other character) as a segment separator.
  • Removing an abbreviation from the list (the text will be segmented).

Note that these rules are "character-based", meaning that only a single character can be used as segment separator. That is, a group of characters (for example <p>) cannot be used as segment separator.

Edit the SRX file:

  1. Open the file in a text editor (for example Notepad ++ which is free for download).
  2. Edit using regular expressions or remove the inner segmentation completely (see this example).
    • <rule break="no"> is the list of rules, where the segment will not be broken. I.E. a list of abbreviations
    • <rule> <beforebreak> is a regular expression for a character before a break (for example, at the end of a sentence ". ? ! :"). If you, for example, don't want segment text after a colon, simply delete : from every <rule> <beforebreak> code.
    •  <rule> <afterbreak> is a regular expression for a character after a break (for example, at the start of a new sentence—a space and capital letter).
  3. Save the modified SRX file.

Upload the New Segmentation Rules to Memsource

  1. Go to Setup and click Segmentation in the Project Settings section. Then, hit the New button.
  2. Select the Language, and assign a Name (e.g. "New Segment after Semicolon"). Then, choose the modified SRX file. Check the Primary checkbox only if you want to make the custom segmentation your primary segmentation for the language. Hit the Create button.
  3. If everything goes well, the message "The segmentation file has been uploaded successfully" will appear, and the new file will get listed on the Segmentation page.

Use New Segmentation Rules for Job Import

In order to use your new segmentation rules, go to your project and select the New button to create a new job. In File Import Settings, expand Segmentation and select your custom segmentation rule. Hit the Create button to add job(s) to your project, segmented with your custom segmentation rules.

 

Setting the Custom Segmentation as Default

If you set the new segmentation rule as Primary (in Setup), it will be automatically selected by default for all new jobs that will be imported for the source language in question. See details on File Import Settings for more information. You can also set specific segmentation rules for specific projects by creating Project Templates.