Document to Structure History of Changes

    You might want to also see the changes in Name to Structure, since they also affect Document to Structure.

    Document to Structure 23.5

    Improvements

    • Standardization has been improved for the structures in the naming dictionary, e.g. consistent numbering for R-groups, and unified representation for common functional groups like "nitro" and for attachment points of groups.
    • The naming dictionary coverage has been greatly extended, and existing entries has been revised and improved. The dictionary now contains 266K name-structure pairs instead of 213K.
    • Spelling and lower/upper case handling is now unified for names in the naming dictionary.

    Document to Structure 22.1

    Removal

    • Chinese Naming and Japanese Naming are retired as standalone products
    • Chinese and Japanese Name to Structure conversion function now requires a valid Name to Structure or Document to Structure license

    Document to Structure 21.4

    Deprecation and removal

    • The DocumentExtractor class (which has been deprecated for many years) has been removed. We have a detailed guide to help you in migrating to the MolImporter API.
    • The DocumentToStructure class has been moved to a new package (from chemaxon.naming to chemaxon.naming.document). It is still available in its old package but has become deprecated and will be removed in the future.

    Document to Structure 21.3

    Improvements

    • Major dependency updates, improved document processing

    Document to Structure 20.15

    Improvements

    • Common names followed by ' and " are now recognized.

    Document to Structure 20.13

    Improvements

    • When the text of a document contains an invalid InChI string, such as InChI=<INCHI_KEY>) it is now logged as a simple WARNING event, instead of one WARNING and one ERROR, each with long stacktraces.

    Document to Structure 20.9

    Bug fixes

    • Deuterium and Tritium atoms that were previously detected by Optical Structure Recognition tools but represented by simple aliases are now represented as chemically meaningful atoms.

    Document to Structure 20.6

    Improvements

    • Experimental support has been added for the Molvec Optical Structure Recognition tool. It can be activated using the "d2s:+molvec" format option. This support is expected to work on individual chemical images (e.g. a structure image in a text PDF), but not on mixed images (e.g. a full page scanned PDF).

    Document to Structure 20.6

    Bug fixes

    • Annotation was failing on some documents when using OSRA for image recognition because of incorrect coordinates returned by OSRA.

    Document to Structure 20.5

    Bug fixes

    • OSRA execution was failing on Windows.

    Document to Structure 19.25

    Bug fixes

    • PDF documents preprocessed by ABBY OCR were appearing as blurry.

    Document to Structure 19.14

    Bug fixes

    • Some CDX and MOL files linked from patent XML were ignored.

    • When annotating patent XML, the size of some images had an invalid unit.

    • Some annotated patent XML contained incorrect HTML markup for chemical structures.

    Document to Structure 19.12

    Bug fixes

    • When TextToStructure reaches a timeout, processing would sometimes continue in the background, still using some CPU and memory resources longer than necessary.

    Document to Structure 19.9

    Improvements

    • OCR is not performed anymore when text processing is disabled in the Document to Structure options.

    Bug fixes

    • Some OSR hits were displayed on the wrong page of PDF documents in version 19.8.

    Document to Structure 19.7

    Improvements

    • Optical Structure Recognition is now supported on PDF documents with structures drawn using vector graphics.

    Bug fixes

    • Invalid HTML was generated in rare cases.

    Document to Structure 18.30

    Bug fixes

    • Document annotator could abort on some rare situations.

    Document to Structure 18.26

    Improvements

    • Names can now be recognized inside double brackets.

    Bug fixes

    • Name with multiple ending character after them had an incorrect "end" position property.

    Document to Structure 18.23

    Bug fixes

    • Temporary files were sometimes left over when using the Imago OSR tool.

    • Some images where not OCR'ed in PDF documents that have already been partially OCR'ed.

    Document to Structure 18.20

    Bug fixes

    • OCR was not performed on some scanned PDF documents.

    Document to Structure 18.1

    Improvements

    • The processing of images in the JBIG2 format in PDF documents for OCR and OSR has been re-enabled.

    Bug fixes

    • Asian names with Unicode superscripts and Zero Width Space characters were not recognized.

    Document to Structure 17.25

    Bug fixes

    • The timeout option was ignored when using the chemaxon.naming.document.TextToStructure API.

    Document to Structure 17.24

    Bug fixes

    • Processing was getting stuck in the presence of some names with a very large number of brackets.

    Document to Structure 17.23

    Bug fixes

    • The processing of images in the JBIG2 format in PDF documents for OCR and OSR has been temporarily disabled. It will be re-enabled by default once we require Java 8, which we expect to do in a few months time. In the mean time, if this feature is desired, a Java ImageIO JBIG2 plugin can be added to the classpath to enable this feature. Please contact us if you have any issue with this.

    Document to Structure 17.16

    Bug fixes

    • Some incorrect SMILES strings were detected.

    Document to Structure 17.14

    Improvements

    • Complex IUPAC names are now detected even inside a bracketed subsentence.

    Document to Structure 17.13

    Improvements

    • The new version of method TextToStructure.extract accepts a DocumentToStructureOptions instead of an option string, allowing higher-level usage and faster processing.

    Document to Structure 17.12

    Improvements

    • Processing of short text using chemaxon.naming.document.TextToStructure is about 10% faster.

    • Class chemaxon.naming.document.D2S.Options is renamed to chemaxon.naming.document.DocumentToStructureOptions.

    Document to Structure 17.11

    Improvements

    • When processing short XML fragments, passing the option content-type=text/xml now leads to higher performance by skipping format detection entirely.

    Bug fixes

    • Option content-type was ignored in chemaxon.naming.document.TextToStructure

    Document to Structure 17.03.13

    New features

    • The "Preparation N" identifier is now automatically detected in patents, in additions to exemplified compounds.

    Document to Structure 17.02.20

    Improvements

    • PDF documents OCRed by PDF-XChange are not OCRed a second time, avoiding duplicate results and resulting in much faster processing.

    Document to Structure 17.01.30

    Bug fixes

    • The 'end' character position property was missing or incorrect in names delimited by ":" characters.

    Document to Structure 16.09.26

    Improvements

    • Structures mentioned in plural, for instance "pyrimidines", are now tagged with the "generic" type.

    • The format of HTML documents can now be detected from the file contents only.

    Document to Structure 16.08.29

    Bug fixes

    • The -groups option did not work for common names.

    Document to Structure 16.08.08

    Improvements

    • The format of Microsoft Office documents can now be detected from the file contents only.

    Document to Structure 16.08.01

    Bug fixes

    • A trailing name between brackets, often a synonym of the previous name, for instance "retinoic acid (vitamin A)", was sometimes misrecognized as a part of the previous name.

    Document to Structure 16.07.25

    Improvements

    • Example numbers are now detected in Chinese language patents.

    Bug fixes

    • ? (water) is now recognized as a vernacular term in Chinese and Japanese.

    Document to Structure 16.07.11

    Improvements

    • More syntaxes for example numbers are detected in patents.

    Bug fixes

    • A spurious warning was logged when recognizing an InChI string in a document.

    Document to Structure 16.05.02

    Bug fixes

    • Some Japanese names were not recognized without the Chinese Name to Structure license.

    Document to Structure 16.04.11

    Bug fixes

    • Some Asian names were wrongly starting with a closing bracket.

    Document to Structure 16.03.07

    Improvements

    • The detection of Asian names in documents is improved.

    Bug fixes

    • Some invalid structures returned by an OSR tool could interrupt document to structure. They are now simply logged and processing continues.

    Document to Structure 16.01.18

    Bug fixes

    • Asian names followed by some number formats were not recognized.

    Document to Structure 16.01.04

    Improvements

    • Some names with spurious whitespace are better detected in documents.

    Document to Structure 15.12.14

    Improvements

    • When detecting exemplified compound numbers in patents, more syntaxes are now detected.

    Document to Structure 15.10.26

    Bug fixes

    • When processing Office documents with Optical Structure Recognition enabled, the following log message was issued: WARNING: Skipping non-existing image.

    Document to Structure 15.09.07

    Bug fixes

    • The warning "JBIG2ReadParam not specified. Default will be used.", which appeared in the logs, has been fixed.

    Document to Structure 15.07.20

    Improvements

    • The detection of asian names is improved in some documents.

    Document to Structure 15.06.29

    Bug fixes

    • Processing of scanned PDFs (using OCR - Optical Character Recognition) was failing on Mac OS X.

    Document to Structure 15.06.01

    Improvements

    • Names that contain extra spaces (for instance because of OCR or new lines without a - character before the break) are now better supported.

    • Patent PDFs provided by LexisNexis's Univentio can now be processed using the built in OCR text instead of doing the OCR again, leading to about 10 times faster processing. This was already working for older Univentio PDFs but is now also supported for recent ones.

    Document to Structure 15.05.25

    Improvements

    • Improvements in OCR error correction.

    Document to Structure 15.05.18

    Improvements

    • Significant improvements are included in OCR error correction, especially for patents based on scanned images.

    Document to Structure 15.05.04

    New features

    • Exemplified compound numbers are now extracted automatically when they are mentioned directly before the IUPAC name.

    Improvements

    • Improved detection of asian names in documents.

    Document Annotator 15.04.20

    Bug fixes

    • Some TIFF images failed to be processed.

    Document to Structure 15.03.09

    Bug fixes

    • Certain "custom" image types in PDFs failed to be processed by OSR tools.

    Document to Structure 15.02.23

    New features

    • The location of OSR tools can now be specified by using Java system properties: chemaxon.naming.clide.path, chemaxon.naming.osra.path and chemaxon.naming.imago.path. These take precedence over the environment variables (CLIDE, OSRA and IMAGO), which are also supported.

    Improvements

    • The detection of Japanese names in documents has been improved.

    Document to Structure 15.01.26

    Improvements

    • In the output of Optical Structure Recognizers, some non-standard labels such as X2 are now interpreted as R-groups. Those structures also become representable as SMILES.

    • Some aliases detected by Optical Structure Recognition programs are expanded to the corresponding chemical group.

    Document to Structure 15.01.19

    Improvements

    • Encrypted documents are now reported with a clearer message and a specific single exception: chemaxon.naming.document.EncryptedDocumentException

    Bug fixes

    • Failure to start OCR could lead to multiple entries in the log.

    Document to Structure 15.01.12

    Bug fixes

    • A failure in writing temporary images to disk lead to duplicate entries in the log.

    Document to Structure 15.01.05

    Improvements

    • Unknown format options now lead to a failure instead of being ignored.

    • Numbers in subscript and superscript HTML and XML tags ( and ) are now interpreted.

    Bug fixes

    • Japanese and Chinese names were not recognized in documents when followed by a fullwidth number inside brackets.

    Document to Structure 14.12.15

    Bug fixes

    • Assignee metadata was not extracted from some recent USPTO XML patents.

    Document to Structure 14.12.01

    Improvements

    • Numeric character entities in metadata of XML documents are now decoded.

    Document to Structure 14.11.24

    Bug fixes

    • Some SMILES-like words containing "Br" or "Cl" were wrongly detected as SMILES.

    Document to Structure 14.11.17

    New Feature

    • The structures recognized by OSR tools (CLiDE and OSRA) are now represented in a uniform way, for instance using real R-group atoms instead of just aliases.

    Bug fixes

    • Some OLE structures embedded by JChem for Office were extracted as empty structures.

    Document to Structure 14.09.29

    Bug fixes

    • A failure to process some Powerpoint documents was fixed.

    Document to Structure 14.09.22

    Improvements

    • Logging messages are improved to facilitate investigating problematic documents.

    Document to Structure 14.09.15

    Improvements

    • Detection of Chinese chemical names in documents was improved.

    • Names (especially Chinese) are better detected in XML patent files from IFI Claims.

    Document to Structure 14.09.08

    Bug fixes

    • The molecule name extracted from Chinese and Japanese documents was in a few cases not the actual name but another close-by word.

    • Some invalid structures were generated from some chinese and japanese documents with unusual character combinations.

    Document to Structure 14.09.01

    New Features and Improvements

    • The embedded OCR information in PDFs from the TotalPatent system (by LexisNexis) is now detected and used automatically. This leads to massively faster processing for such PDFs, from 10 to 30 times faster!

    Document to Structure 14.8.25

    New Features and Improvements

    • Names split between two paragraphs can now be detected and converted to structure. While ideally this situation should not occur in proper semantic documents, it does occur in automatically generated ones, for instance in patent documents distributed by patent providers.

    • The context field can now include text from a previous paragraph. This is useful in particular in patents, where this sometimes includes the example number.

    Bug fixes

    • Google patent PDFs were only processed until page 20.

    Document to Structure 14.8.18

    Bug fixes

    • Processing an HTML file with CLiDE enabled failed when an image contained no structure.

    Document to Structure 14.8.4

    New Features and Improvements

    • Added the osrResolution=N option to specify the resolution in DPI of images processed by OSRA. This is only needed for single image files, not for images inside PDF, Office or HTML documents.

    Bug fixes

    • The character encoding of HTML5 documents was not detected automatically.

    Document to Structure 14.7.7

    New Features and Improvements

    • Chinese and Japanese name are better detected in documents.

    • Optical Structure Recognition is now performed on images referenced in HTML documents. This is supported when the image is accessible as a file (relatively to the HTML document file), and when the image is represented by a data URI.

    • The BMP image format is now supported for calling Optical Structure Recognition tools using the document to structure API.

    • The ending position of the names in text documents (HTML, XML, TXT) is now also included in the properties of returned structures.

    Bug fixes

    • Some embedded structures in Powerpoint documents were not extracted.

    Document to Structure 6.3.0

    New Features and Improvements

    • OSRA 2.0 is supported.

    • A format option has been introduced allowing the user to select an optical structure recognition tool: CLiDE, OSRA or Imago when more than one is installed on the computer. For instance, d2s:osra will request OSRA to be used.

    Bug fixes

    • The encoding of HTML files was not always detected from the META tag.

    • When the extraction of OLE embedded structures from Office documents was disabled using the d2s:-ole format option, the optical structure recognition of images was disabled as well.

    • The character position field of structures was sometimes higher by a few characters when extracted from HTML documents with CRLF (\r\n) line endings.

    • The CLiDE optical structure recognition tool was not automatically detected on 64 bit versions of Windows.

    • When using OSRA, some structures with implicit hydrogens were wrongly filtered out.

    • After processing only a part of an HTML document, the processing of the next HTML document could give some incorrect results.