UniProtKB/Swiss-Prot Protein Knowledgebase
Swiss-Prot headline

Release 56.1 of 02-Sep-2008

Release 56.1 of 02-Sep-2008

First draft of the complete human proteome available in UniProtKB/Swiss-Prot

The UniProt consortium is pleased to announce that a manually annotated representation of all the currently known human protein-coding genes is available in this release of UniProtKB/Swiss-Prot. This represents 20,325 entries. More than a third of these contain additional sequences representing isoforms generated by alternative splicing, alternative promoter usage and/or alternative translation initiation, resulting in close to 34,000 human protein sequences. Approximately 46,000 single amino acid polymorphisms (SAPs), mostly disease-linked, are also described, as well as 60,000 post-translational modifications (PTMs) (for additional statistics, click here).

It is not the first time that UniProtKB/Swiss-Prot has provided a fully annotated proteome set for a model organism (for example E.coli or S.cerevisiae) and there are many more planned in the near and more distant future (A.thaliana, B.subtilis, D.discoideum, mouse, rice, S.aureus, S.pombe, etc). But we do not expect that there will never be anything as important as this proteome. For the first time, we can present to the life sciences community a clean set of what we believe to be a full (although still imperfect!) representation of human proteins. It is the ultimate goal of the life sciences to fully understand Homo sapiens at the molecular level and we hope this set will significantly contribute to this extraordinary adventure.

There are still many challenging tasks in front of us. We will create entries for newly discovered human proteins, review and update the existing set, increase the number of splice variants, explore the full range of PTMs and continue to build a comprehensive view of protein variation in the human population. The characterization at the molecular level will need to be placed in its physiological context: subcellular location, tissue expression, protein/protein interaction, etc. And last but not the least, we all want to understand the role of all these actors of our life processes.

The way is paved, but the road will be long before we fully understand life at a molecular level.

Release 56.0 of 22-Jul-2008

A New major release is available (56.0)

Release 56.0 of 22-Jul-08 of UniProtKB/Swiss-Prot contains 392'667 sequence entries, comprising 141'217'034 amino acids abstracted from 172'036 references. 36'631 sequences have been added since release 55.0, the sequence data of 605 existing entries has been updated and the annotations of 356'036 entries have been revised.

The following improvements were carried out in the last 5 months:

Release 56.0 and TrEMBL release 39.0 are included in UniProt Knowledgebase release 14.0.

After almost one year of beta testing, the UniProt consortium is proud to announce the release of its new official unified website: a new interface, a new search engine and many new options to serve you better. The content of the various databases we provide is unchanged, except for all the improvements we keep carrying out with each new release. Many documents are available on the Documentation/help page, including FAQs. However, don't hesitate to contact us for any further questions, remarks or update requests.

Release 55.6 of 01-Jul-2008

Transient pleasures of the mind

Symmetry and round objects, including round numbers, easily fascinate the human mind. Thus, UniProtKB is happy to announce that we have a double set of round numbers to celebrate: UniProtKB/Swiss-Prot now contains over 50'000 cross- references to PDB and over 5'000 mammalian entries with experimental 3D-structures.

It is deeply satisfactory to see the 3D-structure of a protein. 3D-structures show the interactions between proteins and other macromolecules, and between proteins and small ligands, such as metal ions, substrates and inhibitors. Determining the 3D-structure is an important step for elucidating the mode of action of a well-characterized protein, and it provides a starting point for the classification of an uncharacterized protein and the prediction of its physiological role.

UniProtKB provides access to protein 3D-structures via cross-references to PDB (see for example P00734). The number of structures is constantly increasing, and quite frequently several structures have been determined for a given protein. Thus, the 50'000 cross-references to PDB in UniProtKB/Swiss-Prot correspond to more than 12'700 individual entries. Over 5'000 of these (about 40%) are from mammalian model organisms, including close to 3'300 human entries, while bacteria and archaea account for over 4'500 of the entries with links to PDB. Escherichica coli strain K12 is currently the best studied organism at the structural level, with 1035 out of its 4'339 proteins (almost 25%) having at least one link to a PDB entry. Close to 6'000 additional links to PDB are in UniProtKB/TrEMBL, corresponding to another 3'500 entries.

Thanks to the efforts of individual laboratories and structural proteomics groups, the number of experimental 3D-structures is rapidly increasing, and so the symmetrical roundness of the present numbers is a very transient phenomenon. Soon for every new protein there may be a family member with an experimental 3D-structure, even for membrane proteins. That is definitely something to look forward to.

Release 55.5 of 10-Jun-2008

Over 100 cross-references in UniProtKB/Swiss-Prot

UniProtKB/Swiss-Prot was the first biomolecular database to include cross- references in its entries. As of this release, we provide our users with 101 explicit links (stored in the various distributed file formats, flat text, XML and RDF/XML) and 23 implicit links (available only from web servers, such as UniProt and ExPASy). Most cross-references can be found in the 'Cross-references' section of the entry (see for example Q9FK25), some are in the 'Sequence annotation' section (the Feature table in the flat file) (see for example cross-references to dbSNP in Q969T7). The dbxref.txt document provides a list of the databases cross-referenced in UniProtKB/Swiss-Prot. This document is available on the UniProt website and by ftp.

Additional links pepper almost every section of a UniProtKB/Swiss-Prot entry. They include cross-references to PubMed which are located in the 'References' section (see for example P0A790) and cross-references to the ENZYME database available through the EC numbers in the 'Names and origin' section (see for example Q00955). Moreover the 'Web resources' section is dedicated to databases or web pages that are specific for a single protein (see for example P04637). Note that the dbxref.txt document does not list these 'special' links.

Historically, a 'hundred' was a geographic division referring to the amount of land sufficient to sustain one hundred families. With over 120 cross-references, we hope to sustain many more research groups in quest of protein information.

Release 55.4 of 20-May-2008

Swiss-Prot in the Wonderland of protein names

Successful basic research requires various skills from scientists, not only creativity, but also precision, critical analysis of experimental results, reconsideration of the starting hypotheses, continuous controls and days, nights and weekends of - sometimes tedious - work in the lab. Thus when proteins are eventually purified, genes are cloned and a nice story is wrapped around the data, one of the rewards is to name the proteins/genes. There lies the fun.

Telling names can be useful for remembering a function or a phenotype. Interaction of Drosophila Cleopatra mutants with the asp gene product is lethal. Indeed, Cleopatra, Ancient Egypt's queen, allegedly committed suicide by way of an asp bite. Groucho mutants have more bristles than the norm on their face, much like Groucho Marx. Ken and Barbie protein mutants lack external genitalia... In Arabidopsis thaliana, Superman mutants have extra stamens (male genitals) in their flowers, and fans of the famous cartoon will not be surprised to learn that Kryptonite protein suppresses the function of Superman.

Acronyms are another part of the naming game. You would expect the RING1 protein to have a specific 3D structure related to its name, round, for instance. Actually, RING stands for "Really Interesting New Gene". In the same vein, you would not expect POSH to be any ordinary protein and yet all it contains are "Plenty Of SH3" domains! JAK1 kinase has two phosphate-transferring domains and was named after Janus, the Roman god of gates, usually depicted with two heads looking in opposite directions. However, JAK is also said to be 'Just Another Kinase', one among the hundreds of essential kinases described so far. And last, but not least, the Drosophila INDY protein refers to the movie "Monty Python and the Holy Grail", in which a live person about to be buried rightly protests: 'I'm Not Dead Yet!', which is hardly surprising since mutations in this gene result in a near doubling of the average adult life-span. For more amazing protein/gene names, see the excellent website established by Mikael Niku and Mikko Taipale.

Scientific creativity can be somewhat hampered by economical actors. The Pokemon oncogene for instance - which stands for POK erythroid myeloid ontogenic factor - had to be withdrawn after the US branch of Japanese video-game franchise Pokémon threatened researchers with legal action. The protein ended up with the far more sober - not to say boring - name of 'Zinc finger and BTB domain-containing protein 7A' (ZBTB7A).

Much ink has been spilled over the lack of standardization of protein names. Inconsistency among orthologs, family members and so on makes the systematic search through the literature a complicated task. UniProt provides a few guidelines for protein naming. Such a document should help to improve consistency, keeping a given protein's 'hypokeimenon', while not curbing creativity!

Release 55.3 of 29-Apr-2008

6 million entries in UniProtKB

Once upon a long, long time... This is how all fairy tales start, but was it really so long ago? No, it was December 2003 -just 4 and a half years ago, but it seems like ages -that UniProtKB was born. It was a beautiful baby, 1,220,020 entries fat and well supported by its 2 legs: the large TrEMBL and the small but knowledgeable Swiss-Prot. And the baby put on weight: on average 1,500 protein sequence entries per day in 2004 and during the first half of 2005. The more you have, the more you want, and from the middle of 2005 up to beginning of 2007, UniProtKB was integrating about 3,500 new entries per day. And it hasn't stopped since: currently we are integrating approximately 5,000 entries per day and this number keeps growing. As a result, we are happy to announce that UniProtKB has reached the significant milestone of 6,074,524 entries. Note that this tremendous growth is not due to the submission of environmental samples that are stored in another UniProt database: UniMES.

May we all live happily ever after and extract knowledge from this flood of data!

Release 55.2 of 08-Apr-2008

Dictyostelium discoideum on the move

Dictyostelium discoideum is a social amoeba known for its ability to alternate between unicellular and multicellular forms. Thanks to the availability of powerful molecular genetic tools, it is a convenient model to study fundamental cellular processes, such as cytokinesis, motility, phagocytosis, chemotaxis, signal transduction and aspects of development, including cell sorting, pattern formation and cell-type determination. It is one of 9 nonmammalian model organisms recognized by the National Institutes of Health (NIH) for their utility in the study of fundamental molecular processes of medical importance.

The 34 Mb genome of Dictyostelium discoideum was sequenced and assembled by an international consortium in 2005. Its gene-dense chromosomes encode approximately 12,500 predicted proteins, a high proportion of which have long, repetitive amino acid tracts.

In order to improve the coverage of functional annotation of Dictyostelium discoideum proteins, the UniProt consortium and dictyBase jointly organized a one-week Dictyostelium discoideum protein annotation jamboree in the Swiss Institute of Bioinformatics in Geneva last month. During this special event, more than 1,000 proteins were annotated by UniProtKB curators and about 30 gene models were corrected by dictyBase curators. In addition, more than 300 gene and protein names were standardized.

The close collaboration between UniProtKB and dictyBase will continue until the completion of Dictyostelium discoideum proteome annotation, planned for 2010.

UniProtKB/Swiss-Prot current release contains 1,803 fully annotated Dictyostelium discoideum entries, which represents about 15% of the complete proteome. A complete non-redundant set of Dictyostelium discoideum proteins can be retrieved from UniProtKB with the keyword 'Complete proteome'.

Release 55.1 of 18-Mar-2008

A small but deadly pathogen: Hepatitis B virus

The Hepatitis B virus (HBV) causes transient and chronic infections of the liver and constitutes a major cause of human disease. It is estimated that more than 5% of the global population carries the virus, and deaths from liver cancer caused by HBV probably exceed one million per year (see WHO factsheet). An effective vaccine has been available for nearly 20 years, but its high cost still hampers disease control in the developing world.

This killer virus has a surprisingly small genome, about 3.2 kb, which nevertheless encodes for 5 proteins through overlapping open reading frames. It replicates by reverse-transcribing genomic RNA to partial dsDNA through a unique mechanism, and thus belongs to a particular family: the hepadnaviridae.

The virus specifically infects hepatocytes, and most symptoms in an acute infection result from the killing of infected cells by the host immune system. In a few cases, the virus manages to down-regulate the host immunity and establishes a chronic infection. A viral protein secreted in blood is suspected to be involved in chronicity: the HbeAg protein may specifically deplete T-helper lymphocytes, thereby suppressing the ability to mount a strong cytotoxic response against infected hepatocytes.

Our current knowledge of the virus is rather poor due to the lack of cell culture systems allowing in vitro viral propagation. Much of what we know is derived from the study of other closely related hepadnaviridae, such as the woodchuck hepatitis virus (WHV) and the ground squirrel hepatitis virus (GSHV).

In the current UniProtKB/Swiss-Prot release, all hepatitis B virus entries have been updated, and 51 strains representative of the 8 genotypes infecting humans have been annotated. Animal hepatitis B viruses entries have also been revisited, notably WHV and GSHV.

Release 55.0 of 26-Feb-2008

New major release is available (55.0)

Release 55.0 of 26-Feb-08 of UniProtKB/Swiss-Prot contains 356'194 sequence entries, comprising 127'836'513 amino acids abstracted from 165776 references. 80'183 sequences have been added since release 54.0, the sequence data of 1'411 existing entries has been updated and the annotations of 262'009 entries have been revised.

The following improvements were carried out in the last 7 months:

UniProt Knowledgebase release 13.0 includes Swiss-Prot release 55.0 and TrEMBL release 38.0.

Release 54.8 of 05-Feb-2008

Over 20,000 fungal proteins manually annotated in UniProtKB/Swiss-Prot

Almost exactly one year after the integration of the complete proteome of Saccharomyces cerevisiae into UniProtKB/Swiss-Prot (see news), we have increased the number of manually annotated fungal entries to more than 20 000.

The fungal kingdom includes very diverse organisms, from unicellular to multicellular, from microscopic to macroscopic. Fungi have essential roles in many ecological processes. They are required for nutrient cycling within ecosystems, since they recycle dead organic matter into useful nutrients. Many plants would not survive without symbiotic fungi called mycorrhizae, which live in their roots and supply essential nutrients. They are also economically important as they provide numerous drugs (such as penicillin), food (such as mushrooms) and are used for their ability to ferment different sugars to produce bread, wine, beer and even soy sauce.

Fungi are also responsible for a great number of severe plant and animal diseases. Fungal infections, also called mycotic infections, may affect the skin or the internal organs of the body. Severe mycotic infections, such as histoplasmosis and candidiasis, are potentially life-threatening. Fungal diseases are very difficult to treat since fungi are eukaryotic organisms that share many properties with animal or human cells. Plant diseases caused by fungi include rusts and smuts, as well as leaf, root, and stem rot. They can cause severe damage to crop production.

Moreover, many fungi are important model organisms for studying the genetics and molecular biology of eukaryotes.

It is therefore not surprising that many fungi were targeted for the complete genome sequencing. No less that 32 complete fungal genomes have been submitted to public sequence databases to date. Using the S. cerevisiae and Schizosaccharomyces pombe fully annotated proteomes as templates, we are progressively annotating orthologous proteins in these newcomers, in order to provide our users with a high-quality fungal protein dataset that will better reflect the diversity of this kingdom.

Release 54.7 of 15-Jan-2008

Addition of more than 40'000 microbial entries derived from automated annotation

Thanks to genome sequencing efforts, there has been a tremendous rise in the number of submitted protein sequences. And this is only the beginning, as faster and cheaper sequencing methods will greatly increase the rate at which new genomes are sequenced.

Semi-automated annotation methods are necessary in order to provide the users with a maximum number of annotated protein sequences. The approach used by UniProtKB/Swiss-Prot differs from most other automated methods as the bulk of the annotation procedure is still performed manually, since we want to make sure that we produce high quality annotation with a minimal amount of incorrect inferences.

Our first automatic annotation project is called HAMAP, which stands for High-quality Automated and Manual Annotation of microbial Proteomes. In the context of this project, proteins from complete bacterial and archaeal proteomes, together with the related plastid proteins, are automatically annotated based on manually created family rules for complete protein annotation, with template-based feature propagation. We are very aware of the danger posed by automatic annotation procedures and have been extremely careful in the implementation of the pipeline, establishing many checks and conditional propagation in order to ensure that automatic annotation will produce data of a quality up to that of manual curation.

At this release, we have begun the procedure to integrate automatically into UniProtKB/Swiss-Prot the entries annotated by the HAMAP automated pipeline; over 40'000 bacterial and archaeal entries were integrated. This is the largest number of entries ever integrated at one release.

It must be noted that the planned introduction of 'evidence tags' should allow us to unambiguously flag whether an information item has been derived manually or automatically. For the time being, all entries annotated by the HAMAP pipeline have a cross-reference to HAMAP (for an example see entry Q02JM4).

Release 54.6 of 04-Dec-2007

Complete proteome for Arabidopsis thaliana in UniProtKB

Arabidopsis thaliana was the first plant to have its genome completely sequenced. A first round of annotation was performed in 2001 by the Arabidopsis Genome Initiative. The genome was later reannotated and is now maintained by The Arabidopsis Information Resource (TAIR) which assumes primary responsibility for Arabidopsis genome annotation.

As the genome sequencing was being completed, Swiss-Prot initiated the Plant Proteome Annotation Program (PPAP) whose main focus is the annotation of Arabidopsis (and rice) plant-specific proteins and protein families.

This ongoing program has so far produced more than 6'200 manually annotated Arabidopsis protein sequences in UniProtKB/Swiss-Prot. In addition, close to 44'000 Arabidopsis entries are available in UniProtKB/TrEMBL with a certain level of redundancy. Thus, the total number of protein sequences in UniProtKB for this model plant is much higher than the current estimate of 27'029 protein-encoding genes (see TAIR7 release of April 2007). To get around this problem, a non-redundant set of Arabidopsis proteins, including nuclear, mitochondrial and chloroplastic proteins, was created as of this release and the selected entries have been labelled with the keyword 'Complete proteome' to allow easy retrieval.

The current complete proteome set contains a total of 29'315 entries: 6'241 Arabidopsis thaliana in UniProtKB/Swiss-Prot and 23'074 in UniProtKB/TrEMBL.

Arabidopsis thaliana is the third 'green plant' (Viridiplantae) for which a complete nonredundant protein set has been created in UniProtKB. The other two are the unicellular green algae Ostreococcus tauri and Ostreococcus lucimarinus.

Release 54.5 of 13-Nov-2007

Acanthamoeba polyphaga mimivirus, a "giant" virus in UniProtKB/Swiss-Prot

Mimivirus (for mimicking microbe) is a new viral genus containing a single identified species, Acanthamoeba polyphaga mimivirus (APMV), discovered by Didier Raoult's lab in 1992 within the amoeba Acanthamoeba polyphaga while working on Legionellosis. The virion has a non-enveloped, icosahedral capsid with a diameter of 400 nm and protein filaments projecting from its surface. The capsid contains the internal core surrounded by an internal lipid layer. Its linear, double- stranded DNA genome is roughly 1.2 million bp in length, the largest viral genome known so far. Its replication cycle, genome and capsid structure place it into the nucleocytoplasmic large DNA viruses (NCLDVs), which include amongst others the poxviruses and iridoviruses.

This virus is amazing in many ways. It is the largest virus ever isolated, with a genome size and complexity comparable to that of a small bacterium. A thorough bioinformatics analysis carried out by the group of Jean-Michel Claverie uncovered 909 potential protein-coding genes. Some of these proteins belong to families that are shared with all or some NCLDVs, many have eukaryotic counterparts and there are quite a number of ORFans (no sequence similarity to proteins from other genomes). It was a surprise to find an appreciable number of genes coding for proteins involved in metabolism, DNA repair pathways and, most surprising, genes encoding a partially functional protein translation apparatus. Mimivirus does indeed encode four aminoacyl-tRNA synthetases (ArgRS, CysRS, MetRS, TyrRS), as well as various translation initiation, elongation and termination factors. It is very intriguing to find, in a virus, genes corresponding to central components of the protein translation machinery, a biochemical process widely thought to be an exclusive signature of cellular organisms.

The discovery of this amazing virus has lead to the concept of "giant" virus and implies that there is an overlap in terms of particle dimension, genome size, and genetic complexity between the viral and cellular organism worlds.

A special effort has been made in UniProtKB/Swiss-Prot database to provide the complete, fully annotated mimivirus proteome. We have also integrated all proteomics and structural information that has been made available by the groups of Jean-Michel Claverie and Chantal Abergel.

To get all UniProtKB mimivirus entries, click here.

Release 54.4 of 23-Oct-2007

More controlled vocabulary in the 'Subcellular location' subsection

Over 160'000 UniProtKB/Swiss-Prot entries (56%) contain a subcellular location description in the General Annotation section (CC lines in the flat file). We have standardized the content of these comments with the concomitant creation of a controlled vocabulary and a new, parsable flat-file format.

The subcellular location controlled vocabularies are stored in a new document (subcell.txt) which provides, for each individual UniProtKB location, topology or orientation term, the corresponding definition, as well as other relevant information, such as synonyms, hierarchies or mapped GO terms.

The format of the subcellular location subtopic has changed from free text to a more structured format. When required for the accurate description of a complex biological situation, free text is still used in the 'Note' (see for example O43918). In addition, since release 53.0, this subsection can occur more than once per entry, allowing specific annotation for each isoform, chain or peptide in separate subsections.

Release 54.3 of 02-Oct-2007

Oryza sativa (rice) species separated into japonica and indica subspecies in UniProtKB/Swiss-Prot entries

Although it has been a rule in UniProtKB/Swiss-Prot to merge all protein sequences encoded by the same gene in one species into a single record to avoid redundancy, this rule sometimes has to be adapted to specific cases. For example, this rule applied to rice entries, causing sequences from various rice cultivars to be merged and entries tagged with the unique taxonomic identifier (ID) for Oryza sativa species: 4530.

However, O.sativa comprises 2 subspecies: japonica and indica. A classification at subspecies level is already effective in several databases, including UniProtKB/TrEMBL, and most scientists use it when submitting new sequences. In EMBL/DDBJ/GenBank, there is over 1.2 million japonica and almost 360,000 indica sequences, coming mainly from large scale genome, cDNA or EST sequencing projects. The completion of both japonica and indica genomes and the analysis of multiple sets of subspecies-specific transcripts revealed a significant number of sequence variations and a divergence of expression pattern between japonica and indica subspecies. In order to provide a clear information to its users, UniProtKB/Swiss-Prot had to adopt this classification and separate indica and japonica subpecies in rice entries.

Most rice entries contained exclusively japonica sequences and were quickly updated with the appropriate taxonomic ID. But over 220 rice entries contained merged sequences of japonica and indica subspecies and had to be "de-merged". This task was undertaken by the PPAP (Plant Proteome Annotation Program) team. Common information was kept in both japonica and indica entries, while expression patterns or other subspecies-specific experimental evidences were transferred where they belong. Today all rice entries are classified into either japonica or indica subspecies, with the exception of very few entries where subspecies was not specified. When available, cultivars are indicated in the reference section. Each entry also provides cross-references to either japonica (cultivar nipponbare) or indica (cultivar 93-11) genomic sequences.

The gene nomenclature system ('Os' code) defined by RAP-DB and/or TIGR for the japonica cultivar nipponbare can be found in japonica entries in the gene names subsection (Ordered Locus Names). RAP-DB locus identifiers are listed in the rice.txt file.

To get all UniProtKB Japonica entries, click here.

To get all UniProtKB Indica entries, click here.

To get all UniProtKB rice entries, click here.

The mnemonic species identification code in the entry name allows to quickly identify to which subspecies the protein belongs: ORYSJ is the code for japonica, ORYSI for indica and the old ORYSA code indicates that the subspecies is not specified. The list of rice cultivars can be found in the strains.txt file.

Release 54.2 of 11-Sep-2007

Yeast PDR5: the first adopted protein in UniProtKB/Swiss-Prot

While progress in laboratory techniques allows the production of an ever- increasing flood of data, these data are still insufficiently exploited. One reason for this bottleneck is the lack of efficient integration into databases, making data more difficult, sometimes almost impossible, to access. The current information flow consists in two steps. First, scientists providing knowledge encode it in the format of a given journal. Then database curators have to decode and standardize it to make it computer-parsable and usable for the further research.

In order to reduce this time-consuming and error-prone process and to make the most of expert scientists, UniProtKB/Swiss-Prot proposes a new strategy called 'Adopt a Protein', where researchers can adopt one or more specific proteins. 'Foster parents' make sure that the information concerning their favourite protein(s) is up-to-date. UniProtKB/Swiss-Prot provides them with a draft with the correct sequence, up-to-date sequence analysis predictions and a description of the main topics that require annotation, such as protein names, bibliographic references, comments and protein features. The input of 'foster parents' is acknowledged in the entry.

The yeast Saccharomyces cerevisiae is a popular model organism used in hundreds of laboratories around the world and its genome has been fully sequenced and extensively studied over past a decade. Moreover, the yeast community has a long tradition of sharing information. Therefore, the yeast proteome has been chosen as a test platform to initiate the 'Adopt a Protein' scheme.

This release contains the first fully annotated adopted protein: PDR5. PDR5 is a 160-kDa yeast pleiotropic ABC efflux transporter of multiple drugs localized in the plasma membrane. It belongs to the ABC (ATP-binding cassette) transporter family, PDR subfamily. The PDR subfamily is specific to fungi and plants and exhibits distinctive structural features, such as an unusual alternation of nucleotide binding and membrane domains, a pair of extended extracellular loops and a degenerate ATP binding domain. Yeast strains lacking PDR5 are used for toxicity tests, whereas those overexpressing PDR5 are used for screening antifungal sensitizers.

PDR5 has been adopted by Professor André Goffeau from the Catholic University of Louvain (Belgium). We are grateful to him for committing precious time to help producing an annotation useful to the whole community. We hope that PDR5 is only the first member of a big adopted family! If you want to become a 'foster parent', please contact the UniProtKB/Swiss-Prot Fungal Proteome Annotation Program (FPAP).

Release 54.1 of 21-Aug-2007

More than 18'500 phosphorylation sites identified by mass spectrometry in UniProtKB/Swiss-Prot

Phosphorylation is a key reversible modification that regulates protein function, subcellular localization, stability, and interactions. It is believed that up to 30% of all eukaryotic proteins may be phosphorylated.

During the last few years, phosphoproteomics have greatly improved due to the optimization of enrichment protocols for phosphoproteins and phosphopeptides, better fractionation techniques using chromatography, and improvement of mass spectrometry instrumentation. Thanks to these developments, it is now possible to analyze entire phosphorylation sets rapidly. However, protein and phosphorylation site identification by mass spectrometry is crucially dependent on the quality and completeness of the biological resource used for analysis.

In UniProtKB/Swiss-Prot, we make a special effort to document post- translational modifications and especially phosphorylation sites, using data from the literature.

We have incorporated data from 38 high-quality phosphoproteomics studies which have allowed us to annotate or confirm 18'556 phosphorylation sites in 6'493 protein entries, mainly from human (45%), mouse (27%) and yeast (25%), but also from rat, Arabidopsis thaliana and bacteria. These high-throughput studies can be easily recognized among other UniprotKB references through the [LARGE SCALE ANALYSIS] tag appearing in the RP line.

Click here to obtain the complete list of UniProtKB/Swiss-Prot entries having at least one phosphorylation site found in proteomic studies.

Release 54.0 of 24-Jul-2007

New major release is available (54.0)

Release 54.0 of 24-Jul-07 of UniProtKB/Swiss-Prot contains 276'256 sequence entries, comprising 101'466'206 amino acids abstracted from 158'294 references. 7'104 sequences have been added since release 53.0: this represents a 3% increase. In addition, the sequence data of 690 existing entries have been updated and the annotations of 269'152 entries have been revised.

The following improvements were carried out in the last 2 months:

UniProt Knowledgebase release 12.0 includes Swiss-Prot release 54.0 and TrEMBL release 37.0.

Release 53.3 of 10-Jul-2007

Knottins or how to knit in the protein world

Knottins (also called inhibitor cystine knots or ICKs) are small disulfide-rich proteins characterized by a special "disulfide through disulfide knot". This knot is obtained when one disulfide bridge crosses the macrocycle formed by two other disulfides and the interconnecting backbone (disulfide 3-6 goes through disulfides 1-4 and 2-5).

The knottin structure is found in many unrelated families, such as plant protease inhibitors, cyclotides, toxins from cone snails, spiders, insects, horseshoe crabs and scorpions, gurmarin-like peptides, agouti-related proteins, and antimicrobial peptides.

In collaboration with Laurent Chiche (CNRS, Montpellier), about 450 UniProtKB/Swiss-Prot entries have been updated with knottin structural information. They can be retrieved with the newly introduced keyword Knottin.

Examples:

Release 53.2 of 26-Jun-2007

Obesity in the spotlight

Over the last 40 years, overweight and obesity have become a central health issue in a growing number of countries. Obesity comorbidities are severe and include cardiovascular diseases, diabetes, musculoskeletal disorders and some cancers. The two fundamental causes of obesity are clearly identified as an increased intake of high-fat and energy-dense diets and a decrease of physical activity. However, there is growing evidence that certain gene products have a direct or indirect influence on body mass.

In 1999, the mouse Fto gene was cloned and called Fatso, because of its large size (at least 250 kb). By a curious coincidence, the human orthologous protein was recently shown to predispose to childhood and adult obesity. The main culprits are intronic variations in the FTO gene. Carriers of one (or two) inherited copy (copies) of the variants have an increased risk of obesity of 30% or 70%, respectively. The function of Fatso is not yet known. This protein, along with other proteins involved in the development of obesity, can be retrieved from the UniProtKB/Swiss-Prot using the keyword Obesity.

Release 53.1 of 12-Jun-2007

4'000 bovine entries in UniProtKB/Swiss-Prot

UniProtKB/Swiss-Prot is happy to announce the annotation of over 4'000 entries of a very popular animal in Switzerland, almost a national symbol: Bos taurus, in other words the cow.

Those of you who have visited the Swiss Alps know that their gorgeous scenery is definitely associated with the sound of cowbells in summer pasture. Similarly, the modern biology landscape would be poorer without bovine sequences, obviously not in a decorative role, but as a key element for our understanding of human biology.

The domesticated cow is extensively used in biomedical research, as an animal model and also as a source of biological material. Remember that bovine insulin was the first sequenced protein and was used for decades to treat diabetes. The first draft of the bovine genome sequence was released in October 2004 by the Human Genome Sequencing Center of the Baylor College of Medicine. The human and bovine genomes are more similarly organized than when either is compared to the mouse. Despite its interest, only a few large scale cDNA sequencing projects have been initiated. Currently more than 70% of the UniProtKB/Swiss-Prot bovine sequences come from translation of cDNA sequences produced the NIH Mammalian Gene Collection and the Agricultural Research Service, US Department of Agriculture.

Release 53.0 of 29-May-2007

New major release is available (53.0)

Release 53.0 of 29-May-07 of UniProtKB/Swiss-Prot contains 269'293 sequence entries, comprising 98'902758 amino acids abstracted from 156'204 references. 9'228 sequences have been added since release 52.0: this represents a 3.5% increase. In addition, the sequence data of 734 existing entries have been updated and the annotations of 210'454 entries have been revised.

The following improvements were carried out in the last 3 months:

UniProt Knowledgebase release 11.0 includes Swiss-Prot release 53.0 and TrEMBL release 36.0.

So far metagenomic and environmental sequences were missing from UniProt, this gap is now filled with the introduction of a new ftp directory, UniMES, that allows download and subsequent analysis of these sequences of growing importance.

Release 52.5 of 15-May-2007

Links to wikipedia

While UniProt is a central resource for biologists, some specialized information is beyond the scope of our database. Therefore we link UniProtKB entries to more specialized resources:

We recently added links to the free encyclopedia Wikipedia in the web resource section. Proteins with a link to Wikipedia are mainly of medical or pharmaceutical interest. Wikipedia articles may describe the discovery of the protein and its use in medicine.

Examples:

Release 52.4 of 01-May-2007

T Rex and us

We have introduced the oldest fossil protein sequence to date into UniProtKB/Swiss-Prot, i.e. several peptides from collagen (P0C2W2, P0C2W3, P0C2W4) which were extracted from a 68 million year-old dinosaur: Tyrannosaurus rex . These collagen sequences were obtained by mass spectrometry analysis directly from soft tissue that remained in fossilized bones, which were unearthed from rocks in the Hell Creek Formation of eastern Montana, US.

Interestingly, Tyrannosaurus rex collagen is similar to chicken collagen, and similarities have also been found with frog and newt protein. The finding is consistent with the idea that we can trace a direct evolutionary line between birds and dinosaurs (for more information: PMID 17431180.)

The discovery of protein in bone soft tissue of dinosaur is a surprise - it was not thought that such organic material could survive this long. "The pathways of cellular decay are well known for modern organisms. And extrapolations predict that all organic matter vanishes within 100,000 years, maximum" (BBC news).

Until now, the oldest fossil protein sequence in UniprotKB/Swiss-Prot was a RuBisCO large subunit from a fossil leaf of a Miocene (17-20 million years old) Magnolia, P30828 (see headline release 43.1 of 13-Apr-2004)

You can get all these aged proteins by clicking on the keyword Extinct organism protein

.

Other reference:

Protein Spotlight (May 2004) Small blast from the past

Release 52.3 of 17-April-2007

More than 630 F-box proteins from Arabidopsis thaliana in UniProtKB/Swiss-Prot

F-box proteins play a major role in the ubiquitin conjugation pathway. There are involved in the third step of this pathway. Most of the F-box protein contains a conserved F-box domain near the N-terminus and a variable region. The F-box domain can interact with Cullin and one of the SKP1 proteins to form a E3 SCF (SKP1/Cullin/F-box) ubiquitin ligase complex. The variable region interacts with a specific protein, which is, in turn, ubiquitinated and thus targeted to protein degradation. This variable region confers the specificity of the SCF complex.

The whole set of Arabidopsis thaliana, more than 630 F-box protein sequences, has been manually reviewed and integrated into UniProtKB/Swiss-Prot. About 120 wrong gene model predictions have been corrected, including 26 F-box proteins obtained by splitting erroneous gene predictions covering more than one gene. This represents one of the largest protein family of a given species that had ever been integrated into UniProtKB/Swiss-Prot.

In A. thaliana, almost half of F-box proteins contains a combination of different domains which is used to define subgroups:

>300FBF-box alone
91FBLF-box associated with LRR-repeat
124FBKF-box associated with Kelch-repeat
30FBDF-box associated with FBD
41FDLF-box associated with FBD and LRR-repeat
4FBLKF-box associated with LRR-repeat and Kelch-repeat

Among this large protein family, less than 30 members have been characterized: their functions are various and include flowering, circadian cycle, hormone signaling, and plant defense.

Related entries:

Release 52.2 of 03-April-2007

Update of a spider dermonecrotic toxin family

Loxosceles is the genus of spiders that includes the infamous brown recluse spider Loxosceles reclusa. These spiders, also called violin spiders or fiddleback spiders because of violin-like marks on their cephalothorax, are brownish-yellow in color, and spin small, irregular webs under rocks, or in nooks and crannies of your house. These spiders are found in the USA, South America, Europe and Africa. Their most characteristic feature is actually their eyes: most spiders have eight eyes, but Loxosceles have six, arranged in three pairs, or dyads, that sit side-by-side.

The bite of a Loxosceles spider is not deadly, but it is very unpleasant - the venom is necrotoxic, causing tissue to die and fall off. Pain usually doesn't begin until 6-12 hours after the bite occurs. Loxosceles' necrotoxic venom is cytotoxic and hemolytic. It contains at least 8 enzymes. The enzyme thought to be responsible for most of the destructive effects is called Sphingomyelinase D. This enzyme catalyzes the hydrolysis of sphingomyelin and causes hemolysis and dermonecrosis.

The annotation of this family of toxin has just been updated in UniProtKB/Swiss-Prot (e.g. Q8I914 and P83045).

Release 52.1 of 20-March-2007

Koala genome invaded by a new retrovirus

Endogenous retroviruses are vestiges of ancestral viral infection that have been incorporated long time ago into a host's genome. Surprisingly, 8% of the human genome is composed of such "fossil" viruses (1). The most recent endogenization event is a porcine virus that entered its host approximately 5,000 years ago.

Recently a new endogenous retrovirus was identified in Australia koala populations.

Koalas were largely exterminated on mainland southern Australia in the late nineteenth century. Populations were established on a small number of islands in the early 1900s and have remained isolated since 1920s. These populations have since been used to restock the mainland.

The new Koala retrovirus (KoRV) has only been found in mainland populations, suggesting that this virus entered koala species in the last 100 years (2). This retrovirus is both endogenous and fully functional, meaning that it spreads both by contact and by heredity, and is still in the process of invading the koala genome. KoRV is very similar to Gibbon Ape Leukemia Virus (GALV), and these two retroviruses are thought to have diverged very recently. This suggests a scenario in which a monkey retrovirus has crossed species to enter newly established koala population and has started to colonize koala genome.

The KoRV is unique in that we are observing the initial entry of a new family of endogenous retrovirus into a wild host genome. The dynamic interaction between this virus and its new host provides a unique opportunity to study the process of endogenization and its impact on species development and evolution.

Related entries

References

1. Griffiths D.J.
Endogenous retroviruses in the human genome sequence
Genome Biology 2:reviews1017.1-1017.5 (2001).

2. Tarlinton R.E., Meers J., Young P.R.
Retroviral invasion of the koala genome
Nature 442:79-81 (2006)

Release 52.0 of 06-March-2007

New major release is available (52.0)

UniProt Knowledgebase release 10.0 includes Swiss-Prot release 52.0 and TrEMBL release 35.0.

Release 52.0 of 06-Mar-07 of UniProtKB/Swiss-Prot contains 260'175 sequence entries, comprising 95'002'661 amino acids abstracted from 152'564 references. 18'986 sequences have been added since release 51.0: this represents an increase of 7.3 %. In addition, the annotations of 190'910 entries have been revised.

Many improvements were carried out in the last 4 months:

UniProtKB/Swiss-Prot (flat file version) turned 1 Gigabyte (GB) long on this major release ! For comparison, the human genome contains 0.791175 GB of data (the 3.1647×10 9 base pairs represented as 2-bits) (wikipedia)

Release 51.7 of 20-Feb-2006

Complete human kinome in UniProtKB/Swiss-Prot

Phosphorylation by protein kinases is a universal and fundamental cell- signalling process in eukaryotic cells. A comprehensive catalog of predicted human kinases has been published in 2002 (Manning et al.).

We have annotated the 518 protein kinases predicted to exist, and when necessary revised their sequences. The human kinome as defined by Manning et al., is now complete in UniProtKB/Swiss-Prot !

These protein kinases are subdivided in 10 groups

In addition to these 518 protein kinases, there is currently one family of lipid kinases which is being fully characterized: the phosphatidyl 3- kinase (PI3 kinase) family (PI3 kinome). This emerging family appears to also include phosphatidyl 4-kinase (PI4 kinases). PI4 kinases as well as PI3 kinases share the same catalytic kinase domain. However, they are distantly related to the catalytic domain of the protein kinases and as a consequence belong to a separate family. This lipid kinase family will be soon integrated into UniProtKB/Swiss-Prot.

Mouse kinase orthologs are in the process of being all integrated into UniProtKB/Swiss-Prot. By providing annotated and up-to-date human and mouse kinomes to the scientific community, our knowledgebase becomes a central and reference portal for kinases.

Release 51.6 of 06-Feb-2007

One million comment lines in UniProtKB/Swiss-Prot!

Annotation is the focal point of our effort to maintain and develop UniProtKB/Swiss-Prot. Many of our manual annotation is found in the comment lines, which aim to provide a summary of what is known about a protein. There are 27 different types of comment line, which are arranged according to what we designate as 'topics'.

Recently, we reached a peak of 1 million CC topic lines. About 97 % of the UniProtKB/Swiss-Prot entries contains at least one CC topic line and, currently, there is an average of 4 different CC topic lines per entry.

Comment lines are mainly free text, but we have already set up a standardised format as well as the use of controlled vocabularies for several topics (ALTERNATIVE PRODUCTS, BIOPHYSICOCHEMICAL PROPERTIES, CATALYTIC ACTIVITY, DISEASE, INTERACTION, MASS SPECTROMETRY, PATHWAY, RNA EDITING, SIMILARITY, TOXIC DOSE...). Standardisation for two further topics - SUBCELLULAR LOCATION and CAUTION - are also on their way (more: Forthcoming changes)

The most represented CC topics in UniProtKB/Swiss-Prot are:

Such a distribution reflects the type of experimental biological data which is available for a protein sequence nowadays in the scientific literature.

The data found in UniProtKB/Swiss-Prot, are continuously updated and - since annotators are constantly improving their skills in literature-based information retrieval - the 'depth' of manual annotation is always increasing. This is highlighted by the fact that we have increased the average number of CC topics per entry from 3.5 to 4 since March 2004 (see also the release statistics).

Release 51.5 of 23-Jan-2007

Reintroduction of the initiator methionine

In UniProtKB/Swiss-Prot, the sequence data corresponds to the precursor form of a protein, i.e. before post-translational modifications such as cleavage of the signal peptide or other processing. However, for historical reasons, a notable exception was made: when the initiator methionine was post-translationally removed, the sequence stored in UniProtKB/Swiss-Prot did not include the methionine and instead started with the second residue.

As a consequence, our sequence data differed from that shown in other sequence databases where the initiator methionine is usually not removed. This discrepancy was confusing for users and was the subject of one of the most frequently asked questions to UniProtKB/Swiss-Prot.

This is no longer the case. With this release, all initiator methionines have been reintroduced to the UniProtKB/Swiss-Prot entries (over 10'000) from which it is cleaved. This caused a major change, since all amino acid positions described in these entries have now been updated to reflect the new sequence numbering.

The cleavage of the initiator methionine is still indicated by the INIT_MET line in the feature table but the sequence position is 1 instead of 0. We also added the comment Removed in the description field of INIT_MET line to indicate that the initiator methionine is indeed removed post-translationally.

Example P51487:

Previous format:

FT   INIT_MET      0      0       
FT   CHAIN         1    400       Phosrestin-1.
...
SQ   SEQUENCE   400 AA;  44781 MW;  DA786D7E9FFB4A29 CRC64;
      VVSVKVFKK ATPNGKVTFY LGRRHFIDHF DYIDPVDGVI VVDPDYLKNR KVFAQLATIY

New format:

FT   INIT_MET      1      1       Removed.
FT   CHAIN         2    401       Phosrestin-1.
...
SQ   SEQUENCE   401 AA;  44912 MW;  1212C2422CD35A94 CRC64;
     MVVSVKVFKK ATPNGKVTFY LGRRHFIDHF DYIDPVDGVI VVDPDYLKNR KVFAQLATIY
Release 51.4 of 10-Jan-2007

Complete yeast proteome in UniProtKB/Swiss-Prot

Brewer's yeast or baker's yeast are two common names for the species Saccharomyces cerevisiae, for which the scientifically correct name was first applied to a strain observed in malt circa 1837. These common names neatly reflect the major interests this organism holds for the majority of people. It is one of the earliest "domesticated" organisms, and while initially appreciated for its alcohol producing or dough leavening capabilities, the simple yeast soon became an important organism for research too.

The ease with which yeast can be cultivated and genetically manipulated made it a useful tool in the early days of biotechnological and biomedical research, where it was utilized for the production of pharmaceuticals and enzymes (a name that originates from the latin 'en zymi' = in yeast). S.cerevisiae has subsequently proven to be an extremely useful experimental model system for the study of the basic biological structures and processes of the eukaryotic cell. It is therefore not surprising that it was one of the first eukaryotic species targeted by large-scale sequencing efforts, and in 1996, researchers were able to celebrate the completion of the first eukaryotic genome sequence.

One decade later, and coincident with the 20th anniversary of Swiss-Prot, yeast is again in the headlines, representing the first complete eukaryotic proteome integrated into Swiss-Prot, the manually curated section of the UniProt knowledgebase. In the current release of UniProtKB/Swiss-Prot there are more than 6000 yeast entries containing every gene of the yeast genome believed to code for a protein. Each entry contains literature-curated annotations and numerous cross-references, the locus identifier, which maps a protein to its corresponding genomic locus, and a cross-reference to the Saccharomyces Genome Database (SGD), the community-designated repository for the reference genome sequence. A summary of all yeast entries including these references is listed in the file yeast.txt.

In the 10 years since the initial release of the S.cerevisiae genome, the annotation of protein encoding genes has continually evolved. New open reading frames have been identified and existing predicted ORFs have been revised or retired. In collaboration with SGD we have revisited and updated all entries for which the protein sequence has been changed since the initial release in order to provide users with a set of yeast proteins that corresponds to the most current view of the yeast proteome.

Ten years of post-genomic research have yielded a wealth of information on yeast proteins and we will continually revisit yeast entries to update their functional annotation. S.cerevisiae continues to be at the forefront of experimental molecular biology, particularly in the field of proteomics, and the availability of the complete proteome in UniProtKB/Swiss-Prot will facilitate the mapping and integration of results from large-scale proteomic studies. S.cerevisiae will also serve in the future as one of the model systems for functional annotation in UniProtKB/Swiss-Prot. As one of the best-characterized of the eukaryotic organisms, its proteins will provide many templates for the creation and annotation of fungal-specific or broader eukaryotic protein families.

Release 51.3 of 12-Dec-2006

Major update of a re-emerging pathogen: Dengue virus

Dengue is a mosquito-borne virus found in tropical and sub-tropical regions around the world, predominantly in urban and semi-urban areas in Southeast Asia, Africa, and South America. Dengue virus is transmitted through the bite of Aedes aegypti mosquitoes.

In the 1970s, the disease had recessed due to an active vector control program. But since the 1980s, both the virus and his vector have re-emerged and spread even more than before: the disease is now found in more than 100 countries. The reasons of this re-emergence might be the growing extension of urban areas and the arrest of the vector control program.

The virus is transmitted to humans by mosquito bite, it replicates in skin dendritic cells before infecting lymph nodes and blood cells. The symptoms are fever and pain that can be sustained for up to 7 days. In rare cases, human infection leads to dengue haemorrhagic fever (DHF), a potentially lethal complication. Today DHF affects most Asian countries and has become a leading cause of hospitalisation and death among children in several of them.

Some 2500 million people -- two fifths of the world's population -- are now at risk from dengue. WHO currently estimates there may be 50 million cases of dengue infection worldwide every year. The 2006 mild autumn has favoured long term spread of the vector and has been responsible for a major outbreak of dengue in India, with many cases in New Delhi.

The growing number of dengue virus sequences (more than 3400 in UniProtKB/TrEMBL) and the absence of taxonomic nomenclature does not facilitate identification of medical samples.

In the current UniProtKB/Swiss-Prot release, a systematic nomenclature has been adopted for 28 representative dengue strains, indicating the country and the year of isolation besides the strain name.

Example: Dengue virus type 2 (strain TH-36)
becomes: Dengue virus type 2 (strain Thailand/TH-36/1958)

The virus (+)RNA genome codes for a single polyprotein, cleaved into more than 12 products. 32 representative dengue virus polyproteins have been annotated and are available from UniProtKB/Swiss-Prot (e.g. P33478).

Release 51.2 of 28-Nov-2006

All known human G protein-coupled receptor proteins in UniProtKB/Swiss-Prot

The Human Proteome Initiative (HPI) aims to annotate all known human protein sequences, as well as their mammalian orthologs. The G protein-coupled receptor proteins (GPCRs), also known as seven transmembrane receptors (7TM receptors) form one of the largest proteins family in mammalian genomes. These proteins are involved in all types of stimulus-response pathways, from intercellular communication to physiological senses, including taste, smell, and vision (opsins receptors). Many diseases are linked to GPCRs and half of the drug products by the pharmaceutical industry are targeted against GPCRs. A special emphasis has been given to this family in the HPI project.

In the current release, all known and potential human G protein-coupled receptor protein are annotated and integrated in UniProtKB/Swiss-Prot. 775 human GPCRs are now available in our knowledgebase. About half of all GPCRs are presumed to be involved in the sense of smell. For the remaining half, the active ligand has been documented when available, but about 20% of human GPCRs are still orphans. Most of mouse and rat orthologs have been annotated.

All G protein-coupled receptor proteins annotated in UniProtKB/Swiss-Prot are classified by family and listed in the file 7tmrlist.txt.

Release 51.1 of 14-Nov-2006

CD antigens: molecular markers of cell differentiation

The CD nomenclature was proposed and established in 1982 at the first International Workshop and Conference on Human Leukocyte Differentiation Antigens (HLDA). This nomenclature system was intended for the classification of monoclonal antibodies (mAbs), generated in many laboratories around the world, against various cell surface molecules on leukocytes (white blood cells). The data were collated and analyzed by the statistical procedure of 'cluster analysis'. This analytical method identified clusters of antibodies with very similar patterns of binding to leukocytes at various stages of differentiation: hence the use of the abbreviation 'CD' for 'cluster of differentiation'. CD antibodies are used widely for research, differential diagnosis, monitoring and treatment of disease.

The HLDA workshops assign each CD on the basis of the reactivity of at least two mAbs to one human antigen; the provisional indicator 'w' (for example CDw293) is sometimes given to an imperfectly characterized cluster or to a cluster represented by only one mAb.

Gradually the use of the CD nomenclature has expanded to many other cell types such as endothelial and stromal cells. Therefore the 8th HLDA conference (HDLA8) decided in 2004 that the acronym HLDA would be succeeded by HCDM for "Human Cell Differentiation Molecules".

With this release, all 361 currently defined human CD antigens are annotated and integrated in UniProtKB/Swiss-Prot. In the entries, the CD antigen designation is found as a synonym for the protein name (see for example CD305 antigen, alias Leukocyte-associated immunoglobulin-like receptor 1).The CD name is also propagated to all orthologous mammalian proteins, so that human CD antigens and their orthologs in other mammals can be easily retrieved.

Release 51.0 of 31-Oct-2006

New major release is available (51.0)

Release 51.0 of 31-Oct-06 of UniProtKB/Swiss-Prot contains 241'242 sequence entries, comprising 88'541'632 amino acids abstracted from 148'048 references.

19'061 sequences have been added since release 50.0, the sequence data of 1'336 existing entries has been updated and the annotations of 222'181 entries have been revised.

Many improvements were carried out in the last 5 months:

All the recent changes to the UniProt Knowledgebase format are described in detail in the continuously updated document:

http://www.expasy.org/sprot/relnotes/sp_news.html

UniProt Knowledgebase release 9.0 includes Swiss-Prot release 51.0 and TrEMBL release 34.0. For more information you can also read the release notes for the UniProt Knowledgebase, i.e. Swiss-Prot and TrEMBL.

Release 50.9 of 17-Oct-2006

Human polymorphisms: juggling with health and disease

Recent advances in genomics and proteomics promise to give new insights into the molecular mechanisms of diseases and hopefully will lead to the discovery of novel treatments. The integration of phenotype descriptions along with sequence data, genetic information, as well as physiological, biochemical and structural knowledge may help understand the chain of events leading from a molecular defect to a pathology. In this context, UniProtKB/Swiss-Prot provides the scientific community with a wealth of information on genetic diseases, disease-linked variants and polymorphisms.

In the current release, over 2'000 human entries contain a disease description in the comment section under the topic DISEASE. The disease description is short, but it is supplemented with links to the OMIM database, allowing the retrieval of more detailed information about genetic disorders. Additional links to gene-specific databases can be found in the 'WEB RESOURCE' topic.

At the sequence level, close to 28'500 human single amino acid polymorphisms (SAPs) are described, more than half of which are associated with a disease state and about 30% are linked to the Single Nucleotide Polymorphism database (dbSNP). SAPs are described in the feature table and characterized by a unique identifier (FTId), which gives access to the variant web pages. These pages display a synopsis of relevant information for a given variant, including references, sequence context, as well as residue conservation throughout evolution and structural data, when available (for an example click here). Mutations that cause major changes to a protein sequence (as is the case for most frameshift mutations) are not and will not be considered to be relevant to UniProtKB/Swiss-Prot, as their deleterious effects on a given protein function is usually obvious.

Finally, our medical annotation effort also consists of the creation of keywords to allow easy retrieval of proteins involved in complex disorders and genetically heterogeneous diseases. The top 10 UniProtKB/Swiss-Prot keywords describing a disease are: deafness (105 entries), obesity (57 entries), retinitis pigmentosa (40 entries), diabetes mellitus (39 entries), cardiomyopathy (36 entries), cataract (34 entries), epilepsy (33 entries), dwarfism (32 entries), albinism (25 entries) and Charcot-Marie-Tooth disease (18 entries). Currently about 100 "medical" keywords have been created and the list is growing.

Release 50.8 of 03-Oct-2006

Rice harvest 2006: over 1'000 rice proteins annotated

Rice (Oryza sativa) is the most important food crop in the world and part of the daily diet of over half of the human population. It is grown in 114 countries worldwide and provides 50-80% of the calory consumption in a number of Southeast Asian countries (see world rice statistics).

In the current release, over 1'000 rice entries have been completed in UniProtKB/Swiss-Prot. How?

Following the completion of the first genome sequence of the model plant Arabidopsis thaliana, in 2001 the Swiss-Prot group initiated the Plant Proteome Annotation Program, which focuses on the annotation of plant-specific proteins and protein families. Our major effort was directed towards Arabidopsis, but the completion of the Oryza sativa (cultivar Nipponbare) genome sequence by the IRGSP prompted us to broaden our focus.

Each manually annotated rice entry already contains the TIGR locus identifiers - which map each protein to the corresponding gene in the rice genome - and will soon also include RAP loci. Amongst the numerous cross-references in rice entries is the link to Gramene which gives access to comparative grass genomics. We also plan to link our entries to RAP-DB in the near future, which will provide links to genomic data and genome annotation.

We are currently concentrating on the annotation of well-characterized proteins for which experimental data are available. The function of a number of rice proteins reflects physiological trait adaptation and grain property evolution owing to centuries of selection by farmers (over 100'000 rice varieties exist throughout the world).

As an example, large areas of Southeast Asia are flooded during the monsoon season. Deepwater rice copes with this by way of rapid internode elongation (up to 25 cm/day), and expansin A4 contributes by causing the cell walls to slacken and expand.

What is more, a primary factor that decreases rice crop yield is coastal salinity and the accumulation of salts in irrigated land. Pokkali, an indica variety of lowland rice, is classified as highly tolerant, because it contains a specific potassium-sodium cotransporter (HKT2), which mediates increased potassium uptake with external sodium accumulation.

Finally, grain texture of cooked rice is essential in various food cultures. A generic classification exists between long grain, medium grain and short grain rice, where the first is separate and fluffy and the last more moist, sticky and tender. The proportion of long chain amylopectin is correlated with firmer cooked rice. A starch synthase (SSII-3), which synthesizes long chain amylopectin, is barely active in the sticky cultivar japonica Nipponbare, however, a variation of 4 amino acids leads to an increased activity in firmer indica varieties.

All rice proteins annotated in UniProtKB/Swiss-Prot are classified by chromosome locus (Ordered locus name starting with "Os") and listed in the file rice.txt. In the future, we plan to manually annotate every rice gene family and to develop semi-automated annotation tools to complete rice proteome annotation.

Release 50.7 of 19-Sep-2006

In search of the origin of HIV-1: the 'missing link' revealed

The origin of Human immunodeficiency virus 1 (HIV-1) has been the subject of hot debate for more than twenty years. In 1999, American, Japanese and French researchers claimed to have discovered an indisputable link between a chimpanzee virus from central West Africa called SIVcpz (Simian Immunodeficiency Virus from chimpanzees) and HIV-1. SIVcpz is 70-90% identical to HIV-1 and does not appear to cause illness in chimpanzees.

However, since SIVcpz was only found in a few chimpanzees held in captivity, the possibility existed that another yet unidentified species could be the natural reservoir of both HIV-1 and SIVcpz.

A recent study (Science 313, 523-526 (2006)) provides for the first time a clear picture of the origin of HIV-1 and the seeds of the AIDS pandemic. New strains of SIVcpz have been identified in wild chimpanzees from Cameroon. These new strains are more closely related to human HIV-1 than to any Simian viruses.

There are three HIV-1 lineages: M (Major), O (Outlier) and N (New). The new SIVcpz isolate MB66 turned out to be more closely related to HIV-1 group M than to any Simian virus (see a similarity search for SIVcpz MB66 gag-pol protein). Moreover, another wild virus, SIVcpz isolate EK505, is very closely related to HIV-1 group N. This suggests that at least two independent SIVcpz transfers from chimpanzee to man occurred in this region. HIV-1 group M presumably crossed species early in the 20th century. HIV-1 group N may have infected humans more recently.

The authors of the study also postulate that "given the extensive genetic diversity and phylogeographical clustering of SIVcpz now recognised and the vast areas of west central Africa not yet sampled, it is quite possible that still other SIVcpz lineages exist that could pose risks for human infection and prove problematic for HIV diagnostics and vaccines."

Proteins from SIVcpz isolates MB66 and EK505 are fully annotated and available from UniProtKB/Swiss-Prot.

Release 50.6 of 05-Sep-2006

A thing of beauty is a joy forever (*)

3D-structure information is now available for over 10'000 proteins in UniProtKB/Swiss-Prot.

Protein structures not only delight the eye, they shed light on protein architecture and provide proof for the existence of a given protein fold. They are indispensable to determine the interactions of a protein with its ligands (substrates, ions, cofactors or regulatory molecules) and provide solid proof for post-translational modifications. Likewise, 3D-structures pinpoint the exact position of residues that cause a genetic disease when mutated (example:Q8NBK3). They help to design experiments and make it possible to attribute a function to so-far hypothetical proteins (Q46856).

UniProtKB aims to be fully synchronized with PDB and provide access to information about protein 3D-structures via cross-references to PDB, and by giving high priority to the annotation of proteins with known 3D-structures. A semi-automated mapping procedure was established in collaboration with the Macromolecular Structure Database (MSD), so that the whole PDB archive could be mapped to UniProtKB.

3D-structures are now available for 10'006 entries in UniProtKB/Swiss-Prot, corresponding to 36'671 individual cross-references to PDB. These entries can be retrieved by a search with the keyword '3D-structure'.

(*) From John Keats' epic poem, Endymion, 1818

Release 50.5 of 22-Aug-2006

10'000 species in UniProtKB/Swiss-Prot!

We have now 10000 different species represented in UniProtKB/Swiss-Prot for which protein entries are stored in the knowledgebase. Ten times more species are stored in UniProtKB/TrEMBL. Each species present in UniProtKB/Swiss-Prot is curated: the curation consists of the verification of the scientific name validity, the consistency of the lineage and the existence of a common name and/or synonym. You think the taxomony is indigestible? Have a look at the following recipe ;-)

Pizza recipe

Pizza is not a new program, it is really a delicious and tasteful recipe!

Pizza crust:Toppings:Homemade tomato sauce:
(*) Lactobacillus helveticus is used for the manufacture of these 2 cheeses

Add fresh Saccharomyces cerevisiae to the water and stir until dissolved. Add Beta vulgaris sugar, Olea europaea oil, salt and Triticum aestivum powder. On lightly floured board, knead dough until smooth and elastic. Place in a bowl and let rise in a warm place until volume has doubled.

Heat Olea europaea oil in a wide frying pan over medium heat; add Allium cepa and cook for about 10 minutes until softened, stirring often. Turn the heat on to high and add Allium sativum, herbs (Ocimum basilicum, Origanum vulgare and Petroselinum crispum and Lycopersicon esculentum paste. Add Capsicum annuum powder and season to taste with salt.

Let simmer for at least 30 minutes.

Roll dough into a large circle, place on greased baking sheet, press around edges to form 2 cm rim. Cover with homemade tomato sauce. Layer toppings on dough in order listed. Bake at 240°C for 13 minutes until nicely coloured. You can top the pizza with a few leaves of Diplotaxis tenuifolia (it tastes hotter than Eruca sativa).

You uncovered 18 species in our recipe but 9982 other species are now in UniProtKB/Swiss-Prot

ENJOY :)

Release 50.4 of 25-Jul-2006

Happy anniversary, Swiss-Prot!

On July 21st 1986, the first Swiss-Prot release was created. It contained close to 4'000 protein sequence entries and was produced by a single graduate student, Amos Bairoch, at the University of Geneva. In 1996, while Swiss-Prot was rapidly growing (60'000 entries) and was used worldwide, the granting agencies could not find a solution to finance it. Without the support of thousands of users, Swiss-Prot would not be celebrating its 20th anniversary today! This financial crisis was solved by the creation of the Swiss Institute of Bioinformatics (SIB), and additional resources were provided by license fees paid by commercial users, Swiss-Prot remaining freely accessible to the academic community.

The first Swiss-Prot annotators used to annotate protein sequences concomitant with the submission of the nucleotide coding sequences to the EMBL database. However, the increase of submissions made it impossible to keep pace. In collaboration with the European Bioinformatics Institute (EBI), a solution was found with the creation of TrEMBL, a computer-annotated supplement to Swiss- Prot in 1996, which contained roughly 60'000 entries in its first release.

In 2006, a staff of 60 annotators at the SIB and the EBI, supported by a dedicated programming team, is maintaining Swiss-Prot. Close to 250'000 entries are currently in the knowledgebase. Interestingly, 10 years were necessary to reach the first 50'000 protein sequence entries, while 50'000 proteins can now be manually annotated in about 18 months. In parallel, TrEMBL's exponential growth results in a database containing close to 3 millions entries.

Since 2002, both databases are at the heart of the UniProt project and together they constitute the UniProt Knowledgebase (UniProtKB), one of 3 UniProt components. UniProt is produced by a collaboration between 3 institutes, SIB, EBI and PIR (Protein Information Resource). This single, centralized, authoritative resource for protein sequences and functional information aims to make protein data available, to facilitate their retrieval and to provide new tools to help in their analysis. Since Swiss-Prot became UniProtKB/Swiss-Prot, the access to the knowledgebase is free again for commercial users. Currently 160 persons are involved in the UniProt services to the scientific community.

The means have changed, but the 20 year old key idea of a graduate student to share knowledge is still, and more than ever, vivid.

Release 50.3 of 11-Jul-2006

Of mice and men: more than 10'000 orthologous sequence pairs in UniProtKB/Swiss-Prot

Comparisons of orthologous proteins between mammalian species contributes greatly to understanding the biological basis underlying disease susceptibility or responsiveness to drugs, or simply to understanding what makes us human and not simply another great ape.

Human protein sequences and those of all available mammalian orthologous sequences are annotated and compared in the frame of the UniProtKB/Swiss-Prot HPI annotation program (Human Proteomics Initiative). During the annotation process, sequence length, alternative splicing isoforms or even polymorphisms can be validated. In order to provide our users with a coherent view of mammalian proteomes, similar isoforms are shown for orthologous proteins from all mammalian species whenever possible.

The laboratory mouse is a widely used model organism and thus many murine sequences are available for annotation. It is currently the most highly represented non-human mammal with more than 11'000 entries, and 91% of these entries are orthologous to human proteins. Human-mouse orthologous pairs share 85% identity on average. About 36% of these pairs have identical sequence length and share 94% identity. The most highly conserved proteins are involved in core biological processes such as mRNA processing and transport, translation and ubiquitin-dependent protein degradation. In contrast, fast evolving proteins generally play roles in immunity, reproduction and signal transduction.

The percentage of identity between orthologous protein pairs in the most highly represented mammals in UniProtKB/Swiss-Prot is shown in the table below:

            Orangutan   Bovine   Mouse     Rat
Human           97.43    87.37   85.46   85.80
Orangutan                89.34   87.44   87.20
Bovine                           83.99   84.80
Mouse                                    93.48

UniProtKB/Swiss-Prot entries for orthologous proteins usually share the same protein mnemonic code in the ID line and thus can be easily identified.

Release 50.2 of 27-June-2006

Looking for Titin

"I am looking for Titine" Charlie Chaplin sang in Modern Times. While for many people Titin brings back memories about this song, for the scientific community the meaning is completely different. Titin is a giant sarcomeric protein of roughly 35'000 aa. Protein analysis programs used to crash when encountering huge proteins, and the size limit of a protein to be integrated into UniProtKB/Swiss-Prot used to be under 10'000 aa long. Modern times finally arrived and bioinformatics has improved by leaps and bounds. Programs are now able to deal with huge proteins and titin has finally been integrated into UniProtKB/Swiss-Prot.

Titin is a long (up to 1 micron), slender and flexible strand, frequently with a large globule at one end. It has a complex modular structure that varies depending on the splicing events. In its longest form it may contain up to 132 fibronectin type-III domains, 152 Ig-like domains, 9 Kelch, 17 RCC1, 14 TPR, 15 WD and 31 PEVK repeats and 1 protein kinase domain. Titin functions as a mechanical sensor through its interaction with many other proteins, such as myomesins, tropomyosins, myosins, actins, myopalladin, etc. By providing connections at the level of individual microfilaments, it contributes to the fine balance of forces between the two halves of the sarcomere and thus to muscle extensibility. In non-muscle cells, it seems to play a role in chromosome condensation and segregation during mitosis.

Needless to say, the titin-seeking of Charlie Chaplin was a legitimate demand, because all human beings need titin in their life.

Release 50.1 of 13-June-2006

Man gave names to all the... proteins

We have spent many years curating all kinds of proteins from all kinds of species. One recurring challenge is to offer an easily searchable and consistent knowledgebase dealing, in particular, with many ambiguities and discrepancies regarding protein names. Nomenclature is not only indispensable for communication, but also for literature search and entry retrieval. We feel that our experience in this field can be valuable, and that we can play a role in helping the standardization of protein nomenclature.

To take up this challenge, we created a new document which describes guidelines used by UniProtKB/Swiss-Prot annotators to give each entry the most appropriate name, called the "Recommended name" (RN). In short, an RN should follow the approved nomenclature, if it exists, and should be unique and attributed to all orthologs. Other rules deal mostly with the syntax of submitted protein names in order to have consistent and reproducible RNs in spite of the variability observed in various submissions. If our RN differs from the submitted one, the latter is kept as "alternative name". In this way we enhance the searchability, as well as the consistency, of our database.

We sincerely hope that researchers will adhere as much as possible to these guidelines for naming new proteins when publishing or submitting their data. This will make their results easily searchable, allow tracking of a given protein across related organisms and help us in our continuing effort to standardize nomenclature.

Release 50.0 of 30-May-2006

New major release is available (50.0)

Release 50.0 of 30-May-2006 of UniProtKB/Swiss-Prot contains 222'289 sequence entries, comprising 81'585'146 amino acids abstracted from 142'438 references.

15'220 sequences have been added since release 49.0, the sequence data of 953 existing entries has been updated and the annotations of 190'604 entries have been revised. This represents an increase of 8%.

Many improvements were carried out in the last 3 months:

All the recent