File Import settings

Regexp

The regular expression (abbreviated regex or regexp) is a sequence of characters that form a search pattern, mainly for use in pattern-matching with strings or string-matching. That is, it functions similar to "find and replace" operations. You can find more details on Wikipedia Regexp. 

Regexps can be used in the Search and Replace fields in the Memsource Editor and in the general Setup page of Memsource by scrolling down to the Project Settings section and clicking on File Import Settings.  They can also be used for customizing the segmentation rules (see chapters below).

Basic Special Characters and General Examples:

Character Description
\ Escaping the above-mentioned metacharacters allows you to search the literal meaning instead of its special meaning. That is, \[ will search for the [ bracket.
. Matches any single character.
[ ] A bracket expression; matches a single character that is contained within the brackets.
[^ ] Matches a single character that is not contained within the brackets.
* Matches the preceding element zero or more times.
? Matches the preceding element zero or one time.
+ Matches the preceding element one or more times.
| The choice operator OR. Matches the first or second condition. Is used for combining several regexps together. For example, <[^>]+>|\{[^\}]+\} will match all <html> code and all {variables}
& The AND operator matches the expression before and the expression after the operator.
{ } Range quantifiers.
( ) Grouping of expressions.
< > Anchors that specify a left or right word boundary.
- Range in a character class (for example [A-Z])
$ End of a line.

 

The following are some examples for converting text into tags when importing files and also for using regexp in the Memsource Editor for Search and Replace functions:

Example Description
<[^>]+> represents <html_tag>
\{[^\}]+\} represents {variable},
\[[^\]]+\] represents [variable],
\[\[.+?\]\] represents [[aa[11]bb]].
\$[^\$]+\$ represents $operator_Name1$.
\d+ represents numbers. Also, [0-9]+
[A-Za-z0-9] represents any alphanumeric character.
.+\@.+\..+ email address name@domain.com
\d{4}[-]\d{2}[-]\d{2} the date 2018-08-01
\s$ a whitespace at the end of the segment
^\s a whitespace at the beginning of the segment
\s\s a double whitespace
^\d a digit at the beginning of the segment
\w+\s\s\w+ a double whitespace between words
\s\n a newline preceded by any whitespace character
\S\n a newline preceded by any non-whitespace character
<[^>]+>|\$[^=]+= converts php variables and html code ($svariable['name'] =)
^\s*\'[^:]+: converts javascript's field key with added whitespaces at the beginning of the line ( 'key' :)

See more details and examples on Wikipedia.

 

Important: Memsource supports Java regexp, but rejects complex regular expressions to protect our system from overloading. Complex regexps are those with quantifiers (except possessives) on groups which contain other quantifiers (except possessives).

Convert to Memsource tags

This feature is available in many File Import settings (MS Word, TXT, etc.) It allows converting a special text—for example {variable}, to non-translatable tags. Please note that only "stand-alone" tags can be created. Paired tags are not available.

TXT Import

Examples of error messages when importing only the specific text:

  1. ## ErrorMessage ##1## The number must be higher then 0. ##Z##
    To import text between ##1## and ##Z## use regexp: (?<=##1## ).*(?= ##Z##)

  2. ErrorMessage ("The number must be higher then 0.")
    To import text between (" and ") use regexp: (?<=\(").*(?="\))

  3. 'errorMessage' = 'The number must be higher than 0.'
    To import text after the = sign and between ' and ' use regexp: (?<=\= ').*(?=')

  4. msgstr ("The number must be higher than 0.")
    To import msgstr strings in so-called monolingual PO files using a TXT filter, use regexp: (?<=msgstr ").*(?=")

  5. # Note: This is a note
    To exclude lines starting with # use regexp (^[^#].*)

  6. values '126', 'DCeT', 'Text (en)'
    to import only text in quotes and with (en), such as Text (en)' use regexp (?<=')[^']*\(en\)(?=')

JSON Import

If the JSON structure is:

{
"list": {
		"id": "1",
		"value": "text 1 for translation."
	},
"text": {
		"id": "2",
		"value": "text 2 for translation."
	},
"menu": {
		"id": "3",
		"value": "text 3 for translation."
	     }
}
  • for importing every value regardless of the level, use: (^|.*/)value
  • for importing only one value from a list use: list/value
  • for importing a value from a list and/or menu use, the | (OR) operator: list/value|menu/value
  • for importing only the first instance of a value from a menu, use: menu\[1\]/value
  • for importing the content of a JSON array following a certain key, use (^|.*/)key\[.*\].

Yaml Import

Yaml file example:

title: A
text: translate A
categories:
  title: B
  text: translate B
categories:
  title: C
  text: translate C
categories:
  content:
      title: D
      text: translate D

regexp for importing:

  • only 'translate A' : text
  • only 'translate C': categories\[2\]/text
  • only 'translate D': categories\[\d+\]/content[\1\]/text
  • all text: text|categories\[\d+\]/text|categories\[\d+\]/content[\d+\]/text

Segmentation Rules

For segmentation rules in SRX files, we use regexp OkapiJava, and Unicode.

Working with regexp in an SRX file is complex and requires at least basic knowledge of regular expressions.

There are Nobreak rules (Abbreviations etc.) and Break rules (End of the sentence with a dot, etc) in the SRX file.

Example Description
[\p{C}] Invisible control character.
[\p{Z}] Whitespace
[\p{Lu}] An uppercase letter that has a lowercase variant.
[\p{N}] Any kind of numeric character.
\Q ... \E Start and end of a quotation - (\QApprox.\E). This is used for Abbreviations.
\t Tabulator
\n Newline
\u2029 Paragraph separator
\u200B Zero-width space
\u3002 Ideographic full stop
\ufe52 Small full stop
\uff0e Fullwidth full stop
\uff61 Halfwidth ideographic full stop
\ufe56 Small question mark
\uff1f Fullwidth question mark
\u203c Double exclamation mark
\u2048 Question exclamation mark
\u2762 Heavy exclamation mark ornament
\u2763 Heavy heart exclamation mark ornament
\ufe57 Small exclamation mark
\uff01 Fullwidth exclamation mark