Sequence Databases

Information on relevant sequence databases can be found by following the links below. Additionally, the first issue every year of Nucleic Acids Research contains status reports from the curators of the major databases.

dbEST

dbEST is the division of GenBank that contains "single-pass" cDNA sequences, or Expressed Sequence Tags, from a number of organisms.

DDBJ

Entries from the DNA Databank of Japan (DDBJ) are wholly incorporated into GenBank.

EMBL

The EMBL Nucleotide Sequence Database is a comprehensive database of DNA and RNA sequences collected from the scientific literature and patent applications and directly submitted from researchers and sequencing groups. Data collection is done in collaboration with GenBank (USA) and the DNA Databank of Japan (DDBJ).

GenBank

GenBank is the NIH genetic sequence database, an annotated collection of all publicly available DNA sequences. There are approximately 1,622,000,000 bases in 2,356,000 sequence records as of June 1998. The complete release notes for the current version of GenBank are available by FTP. A new release is made every two months. GenBank is part of the International Nucleotide Sequence Database Collaboration, which is comprised of the DNA DataBank of Japan (DDBJ), the European Molecular Biology Laboratory (EMBL), and GenBank at NCBI. These three organizations exchange data on a daily basis.

MSDB

MSDB is a non-identical protein sequence database maintained by the Proteomics Department at the Hammersmith Campus of Imperial College London. MSDB is designed specifically for mass spectrometry applications.

NCBInr

NCBI maintains composite, non-identical protein and nucleic acid databases for their search tools BLAST and Entrez. The entries in the protein database, nr , have been compiled from GenBank CDS translations, PIR, SWISS-PROT, PRF, and PDB. NCBI has made strong efforts to cross-reference the sequences in these databases in order to avoid duplication.

OWL

OWL is a non-identical composite of four publicly-available protein databases: SWISS-PROT, PIR (1-3), GenBank (translation) and NRL-3D. OWL has not been updated since May 1999, and should be considered obsolete.

PDB

The Brookhaven Protein Data Bank (PDB) is a database of three-dimensional structures. This means that entries are invariably well characterised, with reliable sequence data which can also be found in the other databases. Entries which are unique to PDB tend to be variant proteins, with distorted structures, which were used to refine a structural determination.

PIR

The PIR (Protein Information Resource) database was initiated at the NBRF in the early 1960's by the late Margaret O. Dayhoff as a collection of sequences for the study of evolutionary relationships among proteins. The database is now an international collaboration of three data centers: the NBRF, the Munich Information Center for Protein Sequences (MIPS), and the Japan International Protein Information Database (JIPID). The three centers cooperate to produce and distribute a single database of `wild-type' protein sequences.

PRF

The Protein Research Foundation of Japan database contains protein sequences abstracted from scientific publications.

Swiss-Prot

Swiss-Prot is a curated protein sequence database which strives to provide a high level of annotations (such as the description of the function of a protein, its domains structure, post-translational modifications, variants, etc), a minimal level of redundancy and high level of integration with other databases. It was established in 1986 and has been maintained collaboratively, since 1987, by the Department of Medical Biochemistry of the University of Geneva and the EMBL Data Library (now the EMBL Outstation of The European Bioinformatics Institute - EBI).