Document to Structure Migration Guide

    21.4: DocumentExtractor has been removed

    In version 21.4, the chemaxon.naming.DocumentExtractor class has been removed. The following guide helps in migrating to its alternatives, MolImporter and DocumentToStructure.

    Creating an instance

    Instead of calling the constructors of DocumentExtractor or the readPDF method:

    • If the input is stored in a String, call the process method of DocumentToStructure to receive a MolImporter instance.
    String params = ...; // D2S options. Optional parameter.
    try (MolImporter importer = DocumentToStructure.process(text, params)) {
       // ...
    }
    • If the input is an external file, pass a String (file name), File or InputStream object to the constructor of MolImporter.
    File file = ...;
    String format = ...; // D2S format and options. Optional parameter.
    String encoding = ...; // Character encoding. Optional parameter.
    try (MolImporter importer = new MolImporter(file, format, encoding)) {
       // ...
    }

    When using the constructor of MolImporter, the format must be specified as d2s, or d2s:, followed by the required format options. If the format is omitted entirely, it is automatically detected based on the type of the file.

    The constructors of DocumentExtractor which received an URL or URLConnection parameter have no counterpart on MolImporter or DocumentToStructure. In these cases, the input must be converted to one of the applicable input types.

    Options

    Instead of the configuration methods of DocumentExtractor, MolImporter has format options that can be passed at creation time, separated by commas.

    • setCasNumberLookup(boolean value)+cas or -cas
    • acceptElements(boolean on)+elements or -elements
    • acceptIons(boolean on)+ions or -ions
    • acceptGroups(boolean on)+groups or -groups
    • acceptGenericNames(boolean on)+vernacular or -vernacular

    Processing

    The processPlainText() and processHTML() methods of DocumentExtractor have no direct counterpart on MolImporter, as the results of MolImporter can be read immediately, and the content type is automatically detected.

    The ProgressListener support of DocumentExtractor is a removed feature, it has no alternative in case of MolImporter.

    Reading results

    To collect the results in a list, similarly to getHits():

    try (MolImporter importer = new MolImporter(file)) {
        List<Molecule> molecules = importer.getMolStream()
                .collect(Collectors.toList());
    
        // ...
    }

    The returned Molecules are the same objects that were previously stored in the structure field of the returned Hits. The information stored in the other fields of Hits are stored as properties in the Molecules:

    • hit.text(String) mol.getPropertyObject(DocumentToStructure.SOURCE_TEXT)
    • hit.position(Integer) mol.getPropertyObject(DocumentToStructure.CHARACTER)
    • hit.getPageNumber()(Integer) mol.getPropertyObject(DocumentToStructure.PAGE)
    • hit.getAllPositions()no alternative
    • hit.getPositionsString()no alternative

    Note that all properties can be null if the information is not provided for the current input type.

    Main method

    The main method of DocumentExtractor has no direct alternative but its results can be reproduced with MolImporter and DocumentToStructure.

    See also