Matrix Science
Home Mascot Help  
   
  Help > Sequence Database Setup > Database Maintenance   
 
 

Sequence Database Setup: Database Maintenance

The Database Maintenance utility provides an interface to the main configuration file, mascot.dat. Note that no changes are made to mascot.dat until you explicitly choose to 'Apply' the new definitions.

Whenever changes are applied, the Database Maintenance script will automatically make backup copies of mascot.dat. These are numbered sequentially from mascot.dat.1 onwards. If you have problems after making changes, just revert to the most recent backup.

You can also edit mascot.dat using any text editor, but first familiarise yourself with Chapter 6 in the Setup and Administration manual, Configuration and Log Files, which contains a comprehensive description of the configuration parameters.

Databases definitions form

This form can be used to modify existing database definitions or add new ones. When Mascot is first installed, the only active database is MSDB, but several other databases are pre-defined. Individual database configurations can be displayed by selecting them from the drop-down list near the top of the form.

Mascot database maintenance utility

Name: Each database must have a unique name. Ideally, the name should be short and descriptive. Note that these names are case sensitive, and much confusion can be caused by creating (say) Sprot and SPROT.

Active / Inactive: A database that is no longer required can be marked as inactive so as to retain the definition in mascot.dat. If you are certain that you will never need the definition again, it can be deleted using the button at the bottom of the definition block.

Path: The location of the FASTA file is defined in the Path field. This must be the fully qualified path to the FASTA file, with a wild card in the filename, (not the extension). The delimiters between directories must always be forward slashes, even if Mascot is running on a Windows system. The filename does not have to be based on the database name, but doing so can make administration easier.

AA / NA: A pair of radio buttons is used to specify whether the database is amino acid (protein) or nucleic acid.

Mem map: Database files should always be memory mapped. Unlike memory locking, this does not consume physical RAM.

Mem lock: Memory mapped files can be locked in memory, but only if the computer has sufficient RAM. Having a database locked in memory means that it can never be swapped out to disk, ensuring maximum possible search speed. If you try to lock databases into RAM when there isn't room, this will not be a major problem. The locking will fail, generate an error message, and Mascot will carry on regardless. A more serious problem is when there is just sufficient RAM to lock the databases, but none left over for searches or other applications. In this case, the whole system will slow down and the hard disk will be observed to be "thrashing". Eventually, the system is likely to hang or crash.

Threads: A Mascot search can use multiple threads. If you are running in cluster mode, 'Threads' must be set to 1. Otherwise, specify the same number of threads as the number of processors in your license.

Local ref file: The FASTA database file must be available locally. For certain databases (e.g. MSDB, IPI, Trembl, SwissProt), it is also possible to have a local reference file, from which full text information can be taken for a 'Protein View' report. If you have a local copy of the reference file, check the 'Local ref file' box. Otherwise, clear it.

Taxonomy source: Taxonomy rules have been pre-defined for many popular databases. If you are adding a new database, and appropriate taxonomy rules are not defined, refer to Chapter 9 of the Mascot Setup and Administration manual for further information. If you don't want to use taxonomy, select '--- None ---'. This will speed up the time taken to bring the database on-line.

Parse Rules: Apart from starting with a 'greater than' symbol, the precise syntax of the FASTA title line varies from database to database. For this reason, Mascot uses Basic Regular Expressions to define how the accession string and the description text should be parsed from the FASTA title line.

Two or three fields are used to select the regular expressions used to parse information from the database files. The first rule defines how to parse an accession string from the FASTA file title line. The second, how to parse a description string from the FASTA file title line. The third, how to parse an accession string from a local reference file. If there is no local reference file, the third rule should be set to '--- no local reference file ---'.

Basic regular expressions are described in Appendix A of the Mascot Setup and Administration manual.

Source and parse rules for full text report:
This section defines how Mascot can retrieve a full text report (references, feature table, etc.) for use in 'Protein View'. If full text is not required or not available, just leave all fields blank, and choose '--- no full text report ---' in the drop down list.

Host: The report can come from a local reference file, or from a remote resource such as the NCBI Entrez server. In the case of a local file, specify localhost. Otherwise, specify the remote host (webserver) from which a report can be obtained by HTTP.

Port: Always specify 80 unless a remote host (webserver) is known to operate on a non-standard port.

Path: For a remote host, this path completes the URL required to retrieve a report. For localhost, specify the path to ms-getseq.exe followed by the arguments required to retrieve a report. #ACCESSION# is a placeholder for an accession string. Folders must be delimited by forward slashes, even on a Windows system.

Parse rule: This parse rule is used to select the text that will appear in 'Protein View'. If the host is a remote host, you'll probably want to remove the HTML tags.

Testing the definition

When the database definition appears to be complete, press the 'Test this definition' button. This will run a number of tests on the consistency and syntax of the various settings. The script will also read the first five and last five entries from the FASTA file (and local reference file if defined) and test the parse rules. The test report for SwissProt should look something like this:

Mascot database maintenance utility

If a remote source was specified for the full text report, this will also be tested for the first entry in the database. If any problems are encountered, there will either be explicit error messages, or the failure will be apparent from the test results.

Saving the changes

Use the button at the bottom of the test report to return to the database definitions form and choose 'Apply'. All being well, you can then follow the link to the Mascot database status page and monitor progress as the new database is brought on-line.
 
 
Copyright © 2007 Matrix Science Ltd. All Rights Reserved.