Sequence Database Setup: Database Update
Overview
The purpose of the database update script, db_update.pl, is to download database updates for Mascot,
so that the whole process can be automated using Unix Cron or Windows Scheduled Tasks.
There is some complexity to achieving this, and functionality includes:
- downloading a fasta file plus optional associated files such as a full text reference file or
release notes
- optionally downloading one or more taxonomy indexes
- handling variable filenames via wild cards
- uncompressing, unpacking, renaming and moving the files
- time or version stamping
- downloading a file only if a new one is available; resuming an interrupted download
- passive FTP through a firewall; HTTP proxy server authentication
- special processing, such as splice variant expansion of Swiss-prot using
varsplic
Usage
db_update.pl DB
Where DB is keyword predefined in the script header. For example
db_update.pl NCBInr_from_NCBI
Installation
db_update.pl should be placed in the Mascot bin directory. The following utilities are required:
tar and gzip are likely to be present on any Unix system. All three utilities should be installed into a
directory on the system search path, so they can be executed from any directory, without having to provide
path information.
Configuration
General
Open db_update.pl in a text editor and read through the header. You may need to modify the path to the
Mascot root directory. Note that several definitions are specified independently for Unix and Windows,
to minimise the need for editing. You only need to change the definitions for the platform you are using.
If there is an HTTP proxy server between the Mascot server and the internet, and this proxy server
requires authentication, you should uncomment and modify the $wget_options definitions for --proxy-user
and --proxy-passwd. On Unix systems, the proxy server URL will normally be found from the environment.
On Windows systems, you may need to specify it with additional $wget_options.
Detailed configuration information can be found in the Mascot Setup & Installation manual.
Database Update Definition Blocks
Several common database update definition blocks are pre-configured, and you may not need to add or
change anything before using the script. A particular definition block is chosen by means of a keyword
argument when the script is executed.
If the name or location of a download file changes, you will need to update the corresponding definition
block. If you want to add a new database, the easiest way is to make a copy of a similar looking definition
block and then modify it.
You can download files from HTTP servers, but this should only be done if no FTP server is available.
Downloads from HTTP servers do not allow for resumption of failed downloads, do not allow wild cards to
be used in the filename, and will always proceed, even if the file has been downloaded on a previous occasion.
Testing a Database Definition
Before adding a new db_update.pl entry to Unix Cron or Windows Scheduled Tasks, it is essential to test it.
Unix
To properly test the functionality, you should execute the script at a shell prompt when logged on as the
proposed owner of the Cron job, and from a directory other than the one in which the script is located.
This will ensure that directory permissions are correct and the paths can be resolved. For security and
safety, the owner of the Cron job should not be root.
Windows
To properly test the functionality, you should execute the script from a command prompt and from a
directory other than the one in which the script is located. This will ensure that directory permissions
are correct and the paths can be resolved.
Automation
Unix
Once the script has been found to function correctly for a particular definition block, an entry can be
added to Cron. As a rule, you should stagger database updates through Mascot server quiet periods.
Trying to update all the databases simultaneously will prolong the download times and may slow down
any Mascot searches currently in progress.
Windows 2000 / XP
Windows Scheduled Tasks provides a mechanism for executing db_update.pl at a predetermined time.
You can locate the Scheduled Tasks folder from the Start menu; Programs; Accessories; System Tools.
There is a wizard to add new tasks, but it is easier to add a new task from the file menu, then edit
the properties.
It's a good idea to use the database keyword for the task name. Right click the new task, and
choose properties. In the Run field, enter perl followed by the script name followed by the database
keyword. Set the Start in directory to the location of the script:
The schedule can be whatever you wish. For example:
Press OK to save the task. You will be asked for the password of the user who owns the task.
The same process can be repeated to update other databases. As a rule, you should stagger database
updates during Mascot server quiet periods. Trying to update all the databases simultaneously will
prolong the download times and may slow down any Mascot searches currently in progress.
Windows NT
Windows NT doesn't have Scheduled Tasks. A mechanism for executing jobs automatically is provided
by the Cron section of mascot.dat, which emulates Unix Cron. For further details, refer to Chapter
6 of the Mascot Setup and Administration Manual.
Miscellaneous
A single log file is maintained for all instances of db_update.pl. The location is defined by
$local_log_file in the script header.
Each file downloaded by FTP is listed in a file called .history, located in the corresponding
incoming directory. This is used to prevent a given file being downloaded more than once. If you
want to defeat this mechanism, simply delete or edit the .history file.
|