Matrix Science
Home Mascot Help  
   
  Help > Sequence Database Setup > nr   
 
 

Sequence Database Setup: nr

Overview

The nr database is compiled by the NCBI (National Center for Biotechnology Information) as a protein database for Blast searches. It contains non-identical sequences from GenBank CDS translations, PDB, Swiss-Prot, PIR, and PRF.

One of the main advantages of nr is that it is updated very frequently.

Download

ftp://ftp.ncbi.nih.gov/blast/db/FASTA/nr.gz for the current release.

To download updates automatically, the relevant definition block in db_update.pl is NCBInr_from_NCBI.

Taxonomy

Taxonomy for nr is predefined in mascot.dat, choose "NCBI nr FASTA using GI2TAXID". The following taxonomy files are required:

ftp://ftp.ncbi.nih.gov/pub/taxonomy/gi_taxid_prot.dmp.gz
ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz

Note that the taxonomy files go into the taxonomy directory, not into the sequence database directory. Also, some files need to be unpacked (using tar) as well as uncompressed.

Parse Rules

A typical Fasta title line is:

>gi|21305377|gb|AAM45611.1|AF384285_1 (AF384285) envelope protein [Human immunodeficiency virus type 1]

The gi number is the most reliable identifier. Suitable parse rules are:

Accession from Fasta title: ">\(gi|[0-9]*\)"
Description from Fasta title: ">[^ ]* \(.*\)"

If an entry in nr represents multiple source database entries, the Fasta title lines are concatenated together with CTRL+A as the delimiter.

Configuration

For this example, nr.gz was downloaded to a folder named C:\Inetpub\MASCOT\sequence\NCBInr\current. The file was decompressed using gzip, and renamed to NCBInr_20020601.fasta.

Mascot database maintenance utility

There is no downloadable full text file for nr, but full text for individual entries can be retrieved across the web from the NCBI Entrez server. The syntax for the Path field is:

/entrez/eutils/efetch.fcgi?rettype=gp&retmode=text&db=protein&tool=mascot&id=#ACCESSION#

If you don't require full text in a Mascot Protein View report, simply leave the Host, Port, and Path fields blank and choose
--- no full text report ---
in the drop down list.

Always test a new definition before applying the changes to mascot.dat.

 
 
Copyright © 2007 Matrix Science Ltd. All Rights Reserved.