Pears Frequently Asked Questions

 

1. Why do I get the java.lang.OutOfMemoryError error when I try to build/update my Pears database?

 

2. Iím building a database from multiple input files.The first file went in fine, but the build runs slower with every file I add.Whatís going on?

 

3. When I build a large database, I see lots of Bosc runs.They seem to run slower every time.Whatís going on?

1. Why do I get the java.lang.OutOfMemoryError error when I try to build/update my Pears database?

 

The problem is probably with your command line arguments to java.Java uses a very small default memory size.You need to increase it.I use this to run Bartlett:

 

java -Xmx800m -Xms700m ORG.oclc.pears.Bartlett.Bartlett

 

This gives java 700MB of memory to start with and allows it to go up to 800MB.Those numbers can be increased if you continue to run out of memory.You could run the maximum up to a significant percentage of the 2GB available.I would not increase the smaller number; there is no point in requiring excessive memory that you might not need.

 

2. Iím building a database from multiple input files.The first file went in fine, but the build runs slower with every file I add.Whatís going on?

 

Building the database in several passes is very slow.If a significant number of records are added to the database, the index for the database is pretty much completely rebuilt.If it takes ten passes to load the database, then the time spent building the index in the first nine passes was wasted.Bartlett allows you to specify multiple input files on its command line.You should list all your input files and build the database in one pass.Simply repeat the -i parameter for every file.They will be loaded in the order specified.

 

3. When I build a large database, I see lots of Bosc runs.They seem to run slower every time.Whatís going on?

 

Bartlett collects extracted index terms into an internal buffer.The size of that buffer is controlled by the -m parameter.That defaults to 500K terms, but I usually increase that to at least 1M.As you increase the size of the buffer, you increase your memory requirements and may need to increase the -Xmx parameter.

 

When the internal index term buffer fills, Bartlett must then process those terms.The default behavior is to add them to the index by running Bosc.For small database updates, this is fine.But, when a significant number of records are being added to the database, these repeated index rebuilds are wasted effort.Instead, you can have Bartlett write the contents of the buffer to a file and have it merge the files after all the records have been added and build the index in a single pass.This is VERY efficient.You make this happen by added a -w parameter to Bartlett and follow that switch with a root name for the files.Bartlett will add a sequence number to that root name.So, if you add the parameter -wterms to the Bartlett command line, you will see files named term0, term1, etc. written to your file system.Bartlett does not automatically delete those files, so you'll want to do that after the database is built.It doesn't hurt to leave them; Bartlett will write over them if you add the same -w parm to a subsequent run.