. .
 


Sequence Identification Numbers
Overview
Biological sequence data is being created at an exponential rate, and access is not only publicly available but often very friendly. It was not always so. Early technical decisions (and indecisions) have left a legacy of multi-formed record identifiers. After reading this article, you should be able to use any record ID to identify the type and origin of any public sequence data.

The National Center for Biotechnology Information (NCBI, http://www.ncbi.nlm.nih.gov/) provides the Entrez search engine for life sciences, through which any user can access data from nearly every database ever available.

GenBank is the official NIH (USA) database of all publicly available genetic sequences, and is available from NCBI. GenBank has been collaborating with both the DDBJ (Japan) and EMBL (Europe) since the late 1980's through the INSDC (http://www.insdc.org/), so identical data can usually be found by using any of the three interfaces.

GI Numbers
The GI (GenInfo Identifier) number is simply a series of digits that are consecutively assigned to each sequence record processed by NCBI. It is the lowest level of record identification and has no logical representation or implied meaning other than the general chronological relationship that can be inferred between any two GI numbers. It is the row index for all physical records.

GI numbers are used independently by GenBank and EMBL only, but neither uses the others' numbering. GI numbers were originally tracked separately by GenBank as NID and PID for nucleotides and proteins respectively, but were combined into the GI field in 1999.

Accession Numbers
An accession number represents and uniquely identifies one logical sequence. Accession numbers do not change, even if information in the record is changed at the author's request. However, an original accession number might become secondary to a newer accession number, if the authors make a new submission that combines previous sequences, or if for some reason a new submission is selected to supercede an earlier record.

Two differing versions of the same logical sequence (i.e. identical chromosomal locus) may have the same accession number but different GI numbers (and different Version numbers too, as described below). It is the row index for all sequence records.

The phrase "accession number" comes from library/museum science, and is the name given as the unique ID for every collected work. Accession numbers are expected to be consistent across all databases at NCBI, EMBL, and DDBJ.

GenBank Accession Numbers
A GenBank accession applies to the complete record, and is usually a combination of a letter(s) and numbers, such as a single letter followed by five digits (e.g., U12345) or two letters followed by six digits (e.g., AF123456). Accession numbers originating in other databases are maintained in GenBank.

Accession No. Molecule Note
X99999 Nucleotide 1 Alpha plus 5 Numeric: Oldest GenBank originated nucleotide records with original accession numbering scheme.
XX999999 Nucleotide 2 Alpha plus 6 Numeric: GenBank originated nucleotide records entered since change in February 1999.
XXX99999 Protein 3 Alpha plus 5 Numeric: GenBank originated protein translations of nucleotide records.
A_B

Ex:
PSBP_CUCSA
Protein Up to 5 alpha plus underscore plus up to 5 alpha: Not strictly an accession, but rather a SwissProt originated protein "entry name", where A is mnemonic code for protein name and B is mnemonic code for species. NCBI Entrez will return results for this ID.
A_B

Ex:
O95417_HUMAN
Protein Up to 12 upper case alphanumeric with underscore separator: Not strictly an accession, but rather a UniProt/TrEMBL originated protein "entry name", where A is the accession number and B is mnemonic code for species. NOTE: NCBI Entrez will not return results for this format. You must remove the '_B' portion (underscore and species name) and use only the accession.
XXXXX Protein Up to 5 alpha: A shortened version of above originating with one of the protein databases SwissProt/UniPort/pir. Usually also listed as the locus.


Version Numbers
Where an accession number represents and uniquely identifies one logical sequence, the version number does the same for one specific sequence. The version number is tracked in addition to accession numbers, and is incremented every time the sequence is changed in any way (e.g. corrected by the author). The version number uses the accession.version format (implemented by GenBank/EMBL/DDBJ in February 1999), making it always possible to identify the accession, given the version.

The accession.version system of sequence identifiers runs parallel to the GI number system, i.e., when any change is made to a sequence, it receives a new GI number and an increase to its version number. It is the row index for all sequence revision records.

RefSeq Accession Numbers
RefSeq accession numbers can be distinguished from GenBank accessions by their distinct prefix format of two alpha characters followed by an underscore. Further information about the sequence can be inferred from the two characters used:

Accession No. Molecule Method* Note
AC_123456 Genomic Mixed Alternate complete genomic molecule. This prefix is used for records that are provided to reflect an alternate assembly or annotation. Primarily used for viral, prokaryotic records.
AP_123456 Protein Mixed Protein products; alternate protein record. This prefix is used for records that are provided to reflect an alternate assembly or annotation. The AP_ prefix was originally designated for bacterial proteins but this usage was changed.
NC_123456 Genomic Mixed Complete genomic molecules including genomes, chromosomes, organelles, plasmids.
NG_123456 Genomic Mixed Incomplete genomic region; supplied to support the NCBI Genome Annotation pipeline. Represents either non-transcribed pseudogenes, or larger regions representing a gene cluster that is difficult to annotate via automatic methods.
NM_123456 mRNA Mixed Transcript products; Mature RNA (mRNA) protein-coding transcripts.
NM_123456789 mRNA Mixed Transcript products; 9-digit expansion of accession series.
NP_123456 Protein Mixed Protein products; primarily full-length precursor products but may include some partial proteins and mature peptide products.
NP_123456789 Protein Curation Protein products; 9-digit expansion of accession series.
NR_123456 RNA Mixed Non-coding transcripts including structural RNAs, transcribed pseudogenes, and others.
NT_123456 Genomic Automated Intermediate genomic assemblies of BAC and/or Whole Genome Shotgun sequence data.
NW_123456 Genomic Automated Intermediate genomic assemblies of BAC or Whole Genome Shotgun sequence data.
NZ_ABCD12345678 Genomic Automated A collection of whole genome shotgun sequence data for a project.Accessions are not tracked between releases. The first four characters following the underscore (e.g. 'ABCD') identifies a genome project.
XM_123456 mRNA Automated Transcript products; model mRNA provided by the Genome Annotation process; sequence corresponds to the genomic contig.
XP_123456 Protein Automated Protein products; model proteins provided by the Genome Annotation process; sequence corresponds to the genomic contig.
XR_123456 RNA Automated Transcript products; model non-coding transcripts provided by the Genome Annotation process; sequence corresponds to the genomic contig.
YP_123456 Protein Automated and Curated Protein products; no corresponding transcript record provided. Primarily used for bacterial, viral, and mitochondrial records.
ZP_12345678 Protein Automated Protein products; annotated on NZ_ accessions (often via computational methods).
*Method
Mixed: indicates the process flow includes both automated processing and expert review for some of the records; curation analysis may be provided either by NCBI staff or collaborators.
Automated: indicates records that are not individually reviewed; updates are released in bulk for a genome.
Locus Numbers
An ID type originally designed to help group entries with similar sequences: the first three characters usually designated the organism; the fourth and fifth characters were used to show other group designations, such as gene product; for segmented entries, the last character was one of a series of sequential integers.

However, the 10 characters in the locus name are no longer sufficient to represent the amount of information originally intended to be contained in the locus name. The only rule now applied in assigning a locus name is that it must be unique. For example, for GenBank records that have 6-character accessions (e.g., U12345), the locus name is usually the first letter of the genus and species names, followed by the accession number. For 8-character character accessions (e.g., AF123456), the locus name is just the accession number.

The RefSeq database of reference sequences assigns formal locus names to each record, based on gene symbol. RefSeq is separate from the GenBank database, but contains cross-references to corresponding GenBank records.

Sequence Revision History
An NCBI revision history tool can be used to track all changes to an NCBI sequence given its accession or GI number: http://www.ncbi.nlm.nih.gov/entrez/sutils/girevhist.cgi.

 




. . Site map | Privacy policy | Webmaster | AR BRIN
Copyright © 2004-2010 Donaghey College of Information Science and Systems Engineering
Created and maintained by MidSouth Bioinformatics Center
 
This page was created on April 7, 2010.