Pears Frequently Asked Questions
1. Why do I get the java.lang.OutOfMemoryError error when I try to build/update my Pears database?
The problem is probably
with your command line arguments to java.
Java uses a very small default memory size. You need to increase it.
I use this to run Bartlett:
java -Xmx800m
-Xms700m ORG.oclc.pears.Bartlett.Bartlett
This gives java
700MB of memory to start with and allows it to go up to 800MB. Those numbers can be increased if you
continue to run out of memory. You
could run the maximum up to a significant percentage of the 2GB available. I would not increase the smaller number;
there is no point in requiring excessive memory that you might not need.
Building the
database in several passes is very slow.
If a significant number of records are added to the database, the index
for the database is pretty much completely rebuilt. If it takes ten passes to load the database, then the time spent
building the index in the first nine passes was wasted. Bartlett allows you to specify multiple
input files on its command line. You
should list all your input files and build the database in one pass. Simply repeat the -i parameter for every
file. They will be loaded in the order
specified.
Bartlett
collects extracted index terms into an internal buffer. The size of that buffer is controlled by the
-m parameter. That defaults to 500K
terms, but I usually increase that to at least 1M. As you increase the size of the buffer, you increase your memory
requirements and may need to increase the -Xmx parameter.
When the internal
index term buffer fills, Bartlett must then process those terms. The default behavior is to add them to the
index by running Bosc. For small
database updates, this is fine. But,
when a significant number of records are being added to the database, these
repeated index rebuilds are wasted effort.
Instead, you can have Bartlett write the contents of the buffer to a
file and have it merge the files after all the records have been added and
build the index in a single pass. This
is VERY efficient. You make this happen
by added a -w parameter to Bartlett and follow that switch with a root name for
the files. Bartlett will add a sequence
number to that root name. So, if you
add the parameter -wterms to the Bartlett command line, you will see files
named term0, term1, etc. written to your file system. Bartlett does not automatically delete those files, so you'll
want to do that after the database is built.
It doesn't hurt to leave them; Bartlett will write over them if you add
the same -w parm to a subsequent run.