Matrix Science
Home Mascot Help  
   
  Help > Sequence Database Setup > Utilities   
 
 

Sequence Database Setup: Utilities

Miscellaneous

  • WinZip (Windows only)
    Besides support for zip files, recent versions of WinZip can expand files compressed by Unix compress (.Z) and gzip (.gz). It can also unpack tar archives. The main limitation of WinZip is that it has no command line interface, so cannot easily be scripted.
  • gzip
    gzip (GNU zip) is a compression / decompression utility that creates .gz files. It can also decompress Unix compress files (.Z). A native Win32 port of gzip can be downloaded from here
  • bzip2
    bzip2 can achieve very high compression ratios, and is standard on Linux. The default extension is bz2. A native Win32 port of bzip2 can be downloaded from here
  • tar
    tar creates archives that allow multiple files, optionally including a directory structure, to be stored in a single file. Usually, the resulting tar archive is compressed for distribution. A native Win32 port of tar can be downloaded from here
  • ftp
    All operating systems include an ftp client. However, the command line client supplied with Windows is very limited. In particular, it does not support passive mode transfers, which are usually necessary if you are behind a firewall. The same is true of the Solaris FTP client. For command line ftp on all platforms, we recommend wget (below). For interactive ftp under Windows, there are many excellent shareware options, such as SmartFTP. WS_FTP is inexpensive and has a good reputation.
  • wget
    wget is a powerful, non-interactive command line tool for retrieving files using HTTP, HTTPS and FTP. A native Win32 port of wget can be downloaded from here
  • grep
    grep searches one or more input files for lines containing a match to a specified pattern. By default, grep prints the matching lines. This is a powerful way to extract information from database files that are too large to open in a text editor. A native Win32 port of grep can be downloaded from here

varsplic

varsplic is an EBI utility that generates additional sequences representing the splice isoforms, variants and conflicts described in Swiss-Prot annotations. varsplic is a perl script, and requires the supporting module, Swissknife. (Ensure that you have the latest version of varsplic.pl: version 2.0, file date 26 March 2003, or later). Download: varsplic, Swissknife.

The standard varsplic_sprot.fas file, available for download from EBI or Expasy, corresponds to:

varsplic.pl -input sprot.dat -fasta varsplic.fas -dbcode
This contains just splice variants, and doesn't include the original Swiss-Prot entries. To produce a database suitable for Mascot searches, complete with an equivalent dat file, db_update.pl executes varsplic.pl with the following parameters:
varsplic.pl -input sprot.dat -pseudo varsplic.dat -fasta varsplic.fas
-which full -uniqids 1 -count -varsplic
-variant -conflict -showdesc
This produces a database with the original entries plus all splice isoforms, variants and conflicts expanded. The Fasta title line for an unmodified entry looks like this:
>P80438 (12KD_MYCSM) 12 kDa protein (Fragment) 12 kDa protein (Fragment)
The title line for a new, expanded entry looks like this:
>P15455-00-00-00 (12S1_ARATH) Splice isoform Displayed; Variant Displayed; Conflict Displayed; from P15455 12S seed storage protein precursor
To use varsplic, the location of the perl script must be defined in the db_update.pl header. The Swissknife package is also required, and must be placed on the system $PERLLIB path. (e.g. C:\Perl\site\lib on a Windows system).

To configure the expanded database in Mascot, it is necessary to use AC, rather than ID, as the unique accession string. The rule to parse the accession string from the local reference file should be changed to:

"^AC   \([^ ;]*\)"
While the parse rule for the full text report should be similar to:
"\*.*\(AC   [-A-Z0-9_]*;.*\)"
 
 
Copyright © 2007 Matrix Science Ltd. All Rights Reserved.