Document to Structure Developer Guide

    Introduction

    The Document to Structure product finds chemical structures in documents. Chemical names in the text of document, structures embedded in Office documents, or image drawings of structure are all supported (see the user documentation for more details). The structures can then be exported to any supported molecule format, or manipulated in memory.

    Basic API usage

    Document to Structure plugs into the generic IO API of ChemAxon. This means that documents can be used exactly as other molecular formats (sdf, ...) as a source for importing structures.

    Example usage:

    // We have a document to process
    
    File document = new File("document.pdf");
    
    try (MolImporter importer = new MolImporter(document, "d2s")) {
    
      // Iterate through the hits
    
      for (Molecule m : importer) {
          String smiles = MolExporter.exportToFormat(m, "smiles");
    
          String name = m.getName();
    
          String sourceText = MPropHandler.convertToString(m.properties(), DocumentToStructure.SOURCE_TEXT);
    
          // If the type of the property is known
          Integer page = (Integer) m.getPropertyObject(DocumentToStructure.PAGE);
    
          //...
    
      }
    }

    The exact same code can be used to import an XML file, a Microsoft Office document, ... The format is detected automatically.

    The list of all available properties can be found in the API. Which property is available depends on the format. For instance, in text formats like xml, html and txt, the number of characters since the beginning of the file is available as DocumentToStructure.CHARACTER, while this has no value in a binary format.

    Note that SOURCE_TEXT contains the name as it appears in the source document. A cleaned version (of possible OCR errors, typos, ...) can be retrieved with m.getName().

    Processing text directly

    When the text to convert is given as a String object, either in plain text or in HTML, the MolImporter object can be constructed with:

    String text = ...;
    
    try (MolImporter importer = DocumentToStructure.process(text)) {
    
        // ...
    
    }

    Or as we are not opening any external resources here, simply:

    String text = ...;
    
    MolImporter importer = DocumentToStructure.process(text);
    
    // ...
    

    Configuring behavior

    Document to Structure accepts options to configure how it behaves. All name to structure format options can be used with document to structure as well, to configure which name conversions are attempted. For instance, by default elements and ions are not converted when using d2s, as they may occur often in documents and are not always useful. However, their conversion can be enabled, using:

    try (MolImporter importer = new MolImporter(document, "d2s:+elements,+ions")) {
    
        // ...
    
    }

    Document to Structure has specific format options as well.

    Monitoring progress

    For estimating the progress of converting a document, you can use the standard method MolImporter.estimateNumRecords().

    Command line usage

    Document to Structure can be used as any other import file format. For instance, command line usage can be achieved by using MolConverter on a format supported by Document to Structure:

     molconvert sdf document.doc -o structures.sdf

    See also