Matrix Science
Home Mascot Help  
   
  Help > Sequence Database Setup > Top Tips   
 
 

Sequence Database Setup: Top Tips

  1. Always test a new database configuration
  2. Check the statistics file after a new database has been compressed
  3. Be selective when locking databases into memory
  4. Move sequence databases to an empty drive
  5. Include a date or version stamp in the database filename
  6. Choose accession strings carefully
  7. Don't forget the taxonomy files

  1. Always test a new database configuration

    Use the test facility in the database maintenance utility before applying the changes to mascot.dat. This checks the parse rules against the first five and last five entries in the database. It will also pick up problems with paths, illegal characters in names, etc.


  2. Check the statistics file after a new database has been compressed

    In particular, verify that the number of entries is reasonable and that there are no entries reported as "too long". (If so, you need to increase the value of MaxSequenceLen in mascot.dat). If taxonomy is defined, look at the fraction of entries with no taxonomy. It is rare to have 100% success with taxonomy, but a failure rate greater than 1% is a cause for concern. Maybe the taxonomy indexes are out of date?


  3. Be selective when locking databases into memory

    All databases should be memory mapped, because this makes access much faster. But, unless you have bucket loads of RAM, only the smaller databases, which are searched regularly, should be locked in memory. If you try to lock a database in memory and there isn't enough room, the operation fails, and everything is OK. The real problem is when there is just enough RAM to lock the database, but very little left over for Mascot searches and other applications. Searches will be very slow, the disk will thrash, and eventually the the system is likely to crash or hang.


  4. Move sequence databases to an empty drive

    Running out of disk space can be a problem when you have several large databases plus a growing collection of search result files. The databases don't have to be in subdirectories of mascot/sequence, new ones can easily be placed on other drives. If space is running low, and you want to move your existing database files, the general procedure is:

    - Stop the Mascot Monitor service / daemon
    - Move the mascot/sequence directory or selected databases subdirectories to the new drive
    - Using the database maintenance utility, update the affected database paths
    - Start the Mascot Monitor service / daemon

  5. Include a date or version stamp in the database filename

    Most sequence databases are growing rapidly, and it can be useful to have a record of which database version a particular search was run against. Another reason is that Mascot requires the old and new copies of a database to have different filenames if it is going to perform an automatic update without interrupting ongoing searches.


  6. Choose accession strings carefully

    Some database accession strings have a constant component. For example, NCBI unique identifiers look like "gi|123456" and IPI accessions look like "IPI:IPI00140098.1". Its not a good idea to include unnecessary constant components, such as the "IPI:" prefix, because this contains no useful information. It just makes the Mascot files a bit larger and the reports a bit longer. (In Mascot 2.0 and earlier, whenever you setup a new database with long accession strings, remember to check that the worst-case length is less than value of MaxAccessionLen in mascot.dat. If you have to increase MaxAccessionLen, you must then rebuild the compressed files for all databases)

    On the other hand, it can be risky to remove prefixes altogether. If you plan to reduce "IPI:IPI00140098.1" to "00140098.1" or "140098", you must make sure that this is still a unique identifier within the database. Also, whenever you merge two databases together, having a purely numeric accession greatly increases the chance of having duplicates.

    Another consideration is linking to external sources for full text reports. In the IPI case, you need to choose "IPI00140098" if you plan to link to the EBI SRS server.


  7. Don't forget the taxonomy files

    If database entries contain taxonomy information, Mascot can use this as a filter during a search. Many of the most popular databases, such as Swiss-Prot and NCBI nr, include taxonomy. To determine taxonomy accurately, Mascot requires database specific supporting files. Details of these can be found in the help pages for the individual databases. Note that these supporting files have to be downloaded into the taxonomy directory, not into the sequence database directory. Also, some files need to be unpacked (using tar) as well as uncompressed.

 
 
Copyright © 2007 Matrix Science Ltd. All Rights Reserved.