Sequence Database Setup: Utilities
Miscellaneous
- WinZip (Windows only)
Besides support for zip files, recent versions of WinZip can expand files compressed
by Unix compress (.Z) and gzip (.gz). It can also unpack tar archives. The main limitation
of WinZip is that it has no command line interface, so cannot easily be scripted.
- gzip
gzip (GNU zip) is a compression / decompression utility that creates .gz files. It can also
decompress Unix compress files (.Z). A native Win32 port of gzip can be downloaded
from here
- bzip2
bzip2 can achieve very high compression ratios, and is standard on Linux.
The default extension is bz2. A native Win32 port of bzip2 can be downloaded
from here
- tar
tar creates archives that allow multiple files, optionally including a directory
structure, to be stored in a single file. Usually, the resulting tar archive is compressed
for distribution. A native Win32 port of tar can be downloaded
from here
- ftp
All operating systems include an ftp client. However, the command line client supplied with
Windows is very limited. In particular, it does not support passive mode transfers,
which are usually necessary if you are behind a firewall. The same is true of the Solaris
FTP client. For command line ftp on all platforms, we recommend
wget (below). For interactive ftp under Windows, there are many excellent
shareware options, such as
SmartFTP.
WS_FTP is inexpensive and has a good
reputation.
- wget
wget is a powerful, non-interactive command line tool for retrieving files using HTTP, HTTPS
and FTP. A native Win32 port of wget can be downloaded
from here
- grep
grep searches one or more input files for lines containing a match to a specified pattern.
By default, grep prints the matching lines. This is a powerful way to extract
information from database files that are too large to open in a text editor. A native Win32
port of grep can be downloaded
from here
varsplic is an EBI utility
that generates additional sequences representing the splice isoforms, variants and conflicts described in
Swiss-Prot annotations. varsplic is a perl script, and requires the supporting module, Swissknife. (Ensure
that you have the latest version of varsplic.pl: version 2.0, file date 26 March 2003, or later). Download:
varsplic,
Swissknife.
The standard varsplic_sprot.fas file, available for download from
EBI or
Expasy,
corresponds to:
varsplic.pl -input sprot.dat -fasta varsplic.fas -dbcode
This contains just splice variants, and doesn't include the original Swiss-Prot entries.
To produce a database suitable for Mascot searches, complete with an equivalent dat file, db_update.pl
executes varsplic.pl with the following parameters:
varsplic.pl -input sprot.dat -pseudo varsplic.dat -fasta varsplic.fas
-which full -uniqids 1 -count -varsplic
-variant -conflict -showdesc
This produces a database with the original entries plus all splice isoforms, variants and conflicts expanded.
The Fasta title line for an unmodified entry looks like this:
>P80438 (12KD_MYCSM) 12 kDa protein (Fragment)
12 kDa protein (Fragment)
The title line for a new, expanded entry looks like this:
>P15455-00-00-00 (12S1_ARATH) Splice isoform Displayed; Variant Displayed;
Conflict Displayed; from P15455 12S seed storage protein precursor
To use varsplic, the location of the perl script must be defined in the db_update.pl header.
The Swissknife package is also required, and must be placed on the system $PERLLIB path.
(e.g. C:\Perl\site\lib on a Windows system).
To configure the expanded database in Mascot, it is necessary to use AC, rather than ID,
as the unique accession string. The rule to parse the accession string from the local
reference file should be changed to:
"^AC \([^ ;]*\)"
While the parse rule for the full text report should be similar to:
"\*.*\(AC [-A-Z0-9_]*;.*\)"
|