|
Sequence Database Setup: Top Tips
- Always test a new database configuration
- Check the statistics file after a new database has been compressed
- Be selective when locking databases into memory
- Move sequence databases to an empty drive
- Include a date or version stamp in the database filename
- Choose accession strings carefully
- Don't forget the taxonomy files
- Always test a new database configuration
Use the test facility in the database maintenance utility before applying the changes
to mascot.dat. This checks the parse rules against the first five and last five entries in the database.
It will also pick up problems with paths, illegal characters in names, etc.
- Check the statistics file after a new database has been compressed
In particular, verify that the number of entries is reasonable and that there are no entries
reported as "too long". (If so, you need to increase the value of MaxSequenceLen in
mascot.dat). If
taxonomy is defined, look at the fraction of entries with no taxonomy. It is rare
to have 100% success with taxonomy, but a failure rate greater than 1% is a cause for concern.
Maybe the taxonomy indexes are out of date?
- Be selective when locking databases into memory
All databases should be memory mapped, because this makes access much faster. But, unless you have bucket loads of RAM,
only the smaller databases, which are searched regularly, should be locked in memory. If you try to lock a
database in memory and there isn't enough room, the operation fails, and everything is OK. The real problem is
when there is just enough RAM to lock the database, but very little left over for Mascot searches and other
applications. Searches will be very slow, the disk will thrash, and eventually the the system
is likely to crash or hang.
- Move sequence databases to an empty drive
Running out of disk space can be a problem when you have several large databases plus a growing collection of search result files.
The databases don't have to be in subdirectories of mascot/sequence, new ones can easily be placed on other drives. If space is
running low, and you want to move your existing database files, the general procedure is:
- Stop the Mascot Monitor service / daemon
- Move the mascot/sequence directory or selected databases subdirectories to the new drive
- Using the database maintenance utility, update the affected database paths
- Start the Mascot Monitor service / daemon
- Include a date or version stamp in the database filename
Most sequence databases are growing rapidly, and it can be useful to have a record of which database version
a particular search was run against. Another reason is that Mascot requires the old and new copies of a
database to have different filenames if it is going to perform an automatic update without interrupting
ongoing searches.
- Choose accession strings carefully
Some database accession strings have a constant component. For example, NCBI unique identifiers look like
"gi|123456" and IPI accessions look like "IPI:IPI00140098.1". Its not a good idea to include
unnecessary constant components, such as the "IPI:" prefix, because this contains no useful
information. It just makes the Mascot files a bit larger and the reports a bit longer. (In Mascot 2.0 and earlier, whenever
you setup a new database with long accession strings, remember to check that the worst-case length is less than value
of MaxAccessionLen in mascot.dat. If you have to increase MaxAccessionLen, you must then rebuild the
compressed files for all databases)
On the other hand, it can be risky to remove prefixes altogether. If you plan to reduce
"IPI:IPI00140098.1" to "00140098.1" or "140098", you must make sure that this is still a unique
identifier within the database. Also, whenever you merge two databases together, having a purely numeric
accession greatly increases the chance of having duplicates.
Another consideration is linking to external sources for full text reports. In the IPI case,
you need to choose "IPI00140098" if you plan to link to the EBI SRS server.
- Don't forget the taxonomy files
If database entries contain taxonomy information, Mascot can use this as a filter during a search. Many of the
most popular databases, such as Swiss-Prot and NCBI nr, include taxonomy. To determine taxonomy accurately,
Mascot requires database specific supporting files. Details of these can be found in the help pages
for the individual databases. Note that these supporting files have to be downloaded into the taxonomy
directory, not into the sequence database directory. Also, some files need to be unpacked (using tar) as well as
uncompressed.
|
|
|
|
|