Document to Structure processes PDF, HTML, XML, text files and office file formats: DOC, DOCX, PPT, PPTX, XLS, XLSX, ODT. It recognizes and converts the chemical names (IUPAC, CAS, common and drug names), SMILES and InChI found in the document into chemical structures.
Document to Structure conversion uses the Name to Structure converter. For the supported names and current limitation, see the Name to Structure documentation. You can extend the Document to Structure conversion by creating a custom dictionary file.
Document to Structure can be used via API, command line application (MolConverter), or MarvinView. Text mining can also be automatized by using Document to Structure integrated into Knime or into Pipeline Pilot.
Chemaxon's Document to Structure toolkit is able to correct several simple OCR and syntax errors. For instance, given the incorrect name "3-rnethyl-l-me-thoxynaphthalene", it automatically corrects the name to "3-methyl-1-methoxynaphthalene" and generates the corresponding structure.
Open a supported file containing chemical names. MarvinView will display all the structures corresponding to the recognized names. The structures can then be saved, copy-pasted, or opened in the MarvinSketch editor.
As a commandline tool, you can use MolConverter for Document to Structure conversion. Example:
Additional formatting options can be found on the Document to Structure Format Options
Additionally, in most cases, passing an input file through our tools can correctly auto-detect the input format. E.g. for a PDF file, it is certainly D2S. However, a TXT file can contain names, smiles, CAS numbers, whatever. So in these cases, the input format should be specified explicitly as "d2s" by using the -f option.
Document to Structure converts the chemical structures from OLE objects – created by various chemical sketchers such as Marvin, ChemDraw, ISIS/DRAW, SYMYX DRAW, and Accelrys Draw – embedded in office documents.
For structures represented as images in PDF or Office documents, Document to Structure can make use of several Image to Structure tools (also called Optical Structure Recognition or Chemical OCR ). When such a tool is installed and successfully recognizes an image, the chemical structure becomes part of the output of Document to Structure; it can be visualized, edited, indexed and searched just like any other structure.
Currently, the supported Image to Structure tools are:
See the configuration instructions on how to make those tools recognized by Document to Structure.
Note that structures present as vector graphics rather than bitmap are not converted, unless the osraRendered
format option is used.
The "Document to Structure" license is needed.