Matrix Science
Home Mascot Help  
   
  Help > Sequence Database Setup > Database Update   
 
 

Sequence Database Setup: Database Update

Overview

The purpose of the database update script, db_update.pl, is to download database updates for Mascot, so that the whole process can be automated using Unix Cron or Windows Scheduled Tasks. There is some complexity to achieving this, and functionality includes:
  • downloading a fasta file plus optional associated files such as a full text reference file or release notes
  • optionally downloading one or more taxonomy indexes
  • handling variable filenames via wild cards
  • uncompressing, unpacking, renaming and moving the files
  • time or version stamping
  • downloading a file only if a new one is available; resuming an interrupted download
  • passive FTP through a firewall; HTTP proxy server authentication
  • special processing, such as splice variant expansion of Swiss-prot using varsplic

Usage

db_update.pl DB

Where DB is keyword predefined in the script header. For example

db_update.pl NCBInr_from_NCBI

Installation

db_update.pl should be placed in the Mascot bin directory. The following utilities are required:

tar and gzip are likely to be present on any Unix system. All three utilities should be installed into a directory on the system search path, so they can be executed from any directory, without having to provide path information.

Configuration

General

Open db_update.pl in a text editor and read through the header. You may need to modify the path to the Mascot root directory. Note that several definitions are specified independently for Unix and Windows, to minimise the need for editing. You only need to change the definitions for the platform you are using.

If there is an HTTP proxy server between the Mascot server and the internet, and this proxy server requires authentication, you should uncomment and modify the $wget_options definitions for --proxy-user and --proxy-passwd. On Unix systems, the proxy server URL will normally be found from the environment. On Windows systems, you may need to specify it with additional $wget_options.

Detailed configuration information can be found in the Mascot Setup & Installation manual.

Database Update Definition Blocks

Several common database update definition blocks are pre-configured, and you may not need to add or change anything before using the script. A particular definition block is chosen by means of a keyword argument when the script is executed.

If the name or location of a download file changes, you will need to update the corresponding definition block. If you want to add a new database, the easiest way is to make a copy of a similar looking definition block and then modify it.

You can download files from HTTP servers, but this should only be done if no FTP server is available. Downloads from HTTP servers do not allow for resumption of failed downloads, do not allow wild cards to be used in the filename, and will always proceed, even if the file has been downloaded on a previous occasion.

Testing a Database Definition

Before adding a new db_update.pl entry to Unix Cron or Windows Scheduled Tasks, it is essential to test it.

Unix

To properly test the functionality, you should execute the script at a shell prompt when logged on as the proposed owner of the Cron job, and from a directory other than the one in which the script is located. This will ensure that directory permissions are correct and the paths can be resolved. For security and safety, the owner of the Cron job should not be root.

Windows

To properly test the functionality, you should execute the script from a command prompt and from a directory other than the one in which the script is located. This will ensure that directory permissions are correct and the paths can be resolved.

Automation

Unix

Once the script has been found to function correctly for a particular definition block, an entry can be added to Cron. As a rule, you should stagger database updates through Mascot server quiet periods. Trying to update all the databases simultaneously will prolong the download times and may slow down any Mascot searches currently in progress.

Windows 2000 / XP

Windows Scheduled Tasks provides a mechanism for executing db_update.pl at a predetermined time. You can locate the Scheduled Tasks folder from the Start menu; Programs; Accessories; System Tools. There is a wizard to add new tasks, but it is easier to add a new task from the file menu, then edit the properties.

It's a good idea to use the database keyword for the task name. Right click the new task, and choose properties. In the Run field, enter perl followed by the script name followed by the database keyword. Set the Start in directory to the location of the script:

Scheduled Tasks

The schedule can be whatever you wish. For example:

Scheduled Tasks

Press OK to save the task. You will be asked for the password of the user who owns the task. The same process can be repeated to update other databases. As a rule, you should stagger database updates during Mascot server quiet periods. Trying to update all the databases simultaneously will prolong the download times and may slow down any Mascot searches currently in progress.

Windows NT

Windows NT doesn't have Scheduled Tasks. A mechanism for executing jobs automatically is provided by the Cron section of mascot.dat, which emulates Unix Cron. For further details, refer to Chapter 6 of the Mascot Setup and Administration Manual.

Miscellaneous

A single log file is maintained for all instances of db_update.pl. The location is defined by $local_log_file in the script header.

Each file downloaded by FTP is listed in a file called .history, located in the corresponding incoming directory. This is used to prevent a given file being downloaded more than once. If you want to defeat this mechanism, simply delete or edit the .history file.

 
 
Copyright © 2007 Matrix Science Ltd. All Rights Reserved.