Document to Structure is a toolkit for extracting chemical structures out of text, HTML and PDF documents. Currently, it recognizes names, SMILES, and InChI. Its API class is chemaxon.naming.DocumentExtractor. Below is a list of real life use-cases and code examples that showcase the various ways to use it:
processPlainText() method to process a string.
Downloads a live webpage and processes it using
DocumentExtractor instance that reads the text from the PDF document.
Finds the recognized names in the HTML code and wraps them with a special element for highlighting.
Saves the results and related information into a multi-molecule file for use in chemical editors.
Sets up a database connection and stores the hits in a chemical structure database for searching.
Uses multithreading and breaks HTML pages into fragments.