Matrix Science
Home Mascot Help  
   
  Help > Sequence Database Setup > UniRef   
 
 

Sequence Database Setup: UniRef

Overview

UniRef, also known as UniProt NREF, is a set of comprehensive protein databases curated by the Universal Protein Resource consortium. There are three versions of UniRef: UniRef100, UniRef90, and UniRef50. UniRef100 is non-identical, while UniRef90 and UniRef50 are non-redundant at a sequence similarity level of 90% and 50% respectively. Searching with mass spectrometry data requires the exact sequence to be present in the database, so UniRef100 is the version to choose.

Download

PIR: ftp://ftp.uniprot.org/pub/databases/uniprot/uniref/uniref100/
EBI: ftp://ftp.ebi.ac.uk/pub/databases/uniprot/uniref/uniref100/
Expasy: ftp://ftp.expasy.org/databases/uniprot/current_release/uniref/uniref100/

The files are:

  • Version info: uniref100.release_note
  • Fasta file: uniref100.fasta.gz

Note that the XML file, uniref100.xml.gz, contains essentially the same information as the Fasta file. It is not a full text reference file.

To download updates automatically, the relevant definition block in db_update.pl is UniRef100_Fasta_from_EBI.

Taxonomy

If you have Mascot 2.0 or earlier, add the following taxonomy definition to mascot.dat, changing the taxonomy block number so as to be consecutive with the existing blocks. If you have Mascot 2.1, you may need to update this taxonomy block, because the database curators recently made changes to the fasta title syntax.
# TAXONOMY FOR UniRef
Taxonomy_12
Identifier UniRef
Enabled 1 # 0 to disable it
FromRefFile 0
ErrorLevel 0
SpeciesFiles NCBI:names.dmp
NodesFiles NCBI:nodes.dmp
DefaultRule NCBI, CHOP:W ".*- \([^(]*\)" # anything between the last - and (
end

The following taxonomy file is required:
ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz

Remember that the taxonomy files go into the taxonomy directory, not into the sequence database directory. Also, these files need to be unpacked (using tar) as well as uncompressed.

Parse Rules

A typical UniRef Fasta title line is:

>UniRef100_Q4U9M9 Cluster: 104 kDa microneme-rhoptry antigen precursor; n=1; Theileria annulata|Rep: 104 kDa microneme-rhoptry antigen precursor - Theileria annulata

The literal text, UniRef100_, should be dropped from the accession string, to make linking easier.

Accession from Fasta title: ">UniRef100_\([^ ]*\)"
Description from Fasta title: ">[^ ]* \(.*\)"

Configuration

For this example, the fasta file was downloaded to C:\Inetpub\MASCOT\sequence\uniref100\current, decompressed using gzip, and renamed to uniref100_9.6.fasta. Note that the rule numbers in your copy of mascot.dat may differ from those in the screen shot

Mascot database maintenance utility

There isn't a downloadable reference file for UniRef, but full text for individual entries can be retrieved across the web from the EBI SRS server. For an SRS7 server, the syntax for the Path field is:

HTML: /srsbin/cgi-bin/wgetz?-e+[uniprot-AccNumber:#ACCESSION#]
Plain text: /srsbin/cgi-bin/wgetz?-e+[uniprot-AccNumber:#ACCESSION#]+-vn+2

If you don't require full text in a Mascot Protein View report, simply leave the Host, Port, and Path fields blank and choose
--- no full text report ---
in the drop down list.

Always test a new definition before applying the changes to mascot.dat.

 
 
Copyright © 2007 Matrix Science Ltd. All Rights Reserved.