Sometimes, when you upload an MS Word document, you'll notice that the source is bloated with numerous inline tags that you are unable to interpret (i.e. when you look at the text in MS Word, you do not see any specific format changes). You may even get an error message saying "TOO_LARGE_INPUT: Too large tag metadata of source text of segment no. ..."
In most cases, this problem is the result of working with converted pdf files.
Generally, tags are used to mark parts of a text that have different formatting. These formatting changes may be obvious, like different fonts, different font sizes, or different colors. However, sometimes they are just minor (or even invisible) changes. An example of this would be style changes—they're not easily interpreted, but they are still there. Here are some suggestions on how to deal with these tags:
- When you convert a PDF file using an OCR program, always use the option to save plain or unformatted text, and do all extra formatting in MS Word manually (such as adjusting fonts, font size, colors, columns, spacing, etc.)
- There are some tools available on the market that you can use to clean your Word files before importing them into Memsource. We recommend using CodeZapper by David Turner (http://asap-traduction.com/CodeZapper).
- You can use the option Minimize number of tags when importing your Word document. This can be found in the MS Word section under File Import Settings when creating a new job in Memsource. This option will try and remove all tags that we consider unnecessary from the source text imported for translation. However, always bear in mind that this is done by a robot based on an automated algorithm, and sometimes this may also remove some formatting which should be preserved. Therefore, if this option is used, it is vital that you then compare the completed file against the original document to make sure that no important formatting is lost.