Document to Structure Developer's Guide - Document Extractor

    Introduction

    The DocumentExtractor class extracts chemical names from text documents and converts them to chemical structures.

    Basic API usage

    Example usage:

    // We have a document to process
    
    java.io.Reader document = ...;
    
    DocumentExtractor x = new DocumentExtractor();
    
    x.processHTML(document); // or processPlainText(document) for input in plain text format
    
    // Iterate through the hits
    
    for (Hit hit : x.getHits()) {
    
     System.out.println(hit.position + ": " + hit.text + ": " + hit.structure.toFormat("smiles"));
    
    }

    The field hit.position contains the position of the first character of the name in the document.

    Note that hit.text contains the name as it appears in the source document. A cleaned version (of possible OCR errors, typos, ...) can be retrieved with hit.structure.getName().

    This class can also be called on the command-line. It then expects the name of a plain text file as the first argument (or from the standard input when absent). The list of hits is printed on the standard output.

    See also