A typical system will consist of three server computers:
The document repository is supposed to be existing and working. It is not needed when indexing documents from a filesystem.
The database server should also be already installed and running.
This documentation concerns the installation and setup of the crawler server. It is recommended to dedicate a Linux machine for this task.
cp -a conf.sample conf
You are now ready to start the configuration.
conf directory contains all configuration. You need to edit at least the
d2db.conf file, which contains an example configuration and comments for each options.
If you are using Documentum as a document repository, you also need to edit
documentum/dfc.properties to configure access to the Documentum server (host, username and password).
Document to Database command-line actions all have the following form:
./d2db <command> <parameters...>
At this point you should be ready to run the first d2db command to initialize the database. This will create the necessary tables. Note that if you created the database schema yourself and only want d2db to populate it, you should have used the
d2db.fixedSchema = true option (in configuration file
schema.conf) and can skip this section.
To create the database tables, run this command once:
Anytime after using
d2db create, you can use the
stats command to query some basic statics about the number of documents, chemical structures and hits in the d2db database. This is also a good way to check that the database is properly created and accessible.
For instance, running it just after
create should give this output:
$ ./d2db stats
Documents : 0
Unique structures : 0
Hits : 0
index command should be used to tell d2db which documents to index. For indexing a document folder, use:
./d2db index documentum:<folder>
For indexing a directory on a local or shared filesystem, use:
./d2db index <folder>
Note that d2db will automatically detect documents that have already been indexed in a previous run and have not been modified, in which case it will skip over them quickly. This means that the index command can be used both once for an initial indexing of a set of documents, and also later to update the index (add new documents, remove deleted documents, refresh modified documents). You can use the
reindex command to force reindexing all documents even when they have not changed.
Once indexing has been done successfully, you might want to set up a cron job to run the index command regularly.