- Sometimes you will upload an MS Word document and find out that the source is bloated with many inline tags, which you are unable to interpret (i.e. when you look at the text in MS Word, you do not see any specific format changes).
- You may even get an error message saying "TOO_LARGE_INPUT: Too large tag metadata of source text of segment no. ..."
In most cases, this problem is the result of working with converted pdf files.
Generally, tags are used to mark parts of text with different formatting. The formatting changes may be obvious, like different fonts, different font sizes, different colours, but sometimes they are just minor or even invisible changes, like style changes, which you cannot easily interpret but they are still there.
Here are some suggestions how to deal with this:
- When you convert a PDF file using an OCR programme, always use the option of saving plain or unformatted text and do all extra formatting in MS Word manually (like fonts, font size, colours, columns, spacing, ...)
- There are some tools available on the market that you can use to clean your Word files before importing them into Memsource. We can especially recommend CodeZapper by David Turner (http://asap-traduction.com/CodeZapper)
- You can use the option to "Minimize number of tags" which you will find under File Import Settings - MS Word when creating a new job in Memsource, i.e. when you select your file/files for translations. This option will try and remove all tags that we consider unnecessary from the source text imported for translation. However, always bear in mind that this is done by a robot based on an automated algorithm, and sometimes this may also remove some formatting which should be preserved. Therefore, if this option is used, it is vital that you then compare the completed file against the original document to make sure that no important formatting is lost.