ExPASy logo ExPASy Home page Site Map Search ExPASy Contact us Swiss-Prot
 Hosted by ca flag CBR Canada Mirror sites: Australia  Brazil  China  Korea  Switzerland
Search for

                    SWISS-PROT RELEASE 29.0 RELEASE NOTES

                               1. INTRODUCTION

   1.1  Evolution

   Release 29.0  of SWISS-PROT  contains 38303 sequence entries, comprising
   13'464'008 amino acids abstracted from 36638 references. This represents
   an increase  of 7.7% over release 28. The recent growth of the data bank
   is summarized below.

   Release    Date   Number of entries     Nb of amino acids

   3.0        11/86               4160               969 641
   4.0        04/87               4387             1 036 010
   5.0        09/87               5205             1 327 683
   6.0        01/88               6102             1 653 982
   7.0        04/88               6821             1 885 771
   8.0        08/88               7724             2 224 465
   9.0        11/88               8702             2 498 140
   10.0       03/89              10008             2 952 613
   11.0       07/89              10856             3 265 966
   12.0       10/89              12305             3 797 482
   13.0       01/90              13837             4 347 336
   14.0       04/90              15409             4 914 264
   15.0       08/90              16941             5 486 399
   16.0       11/90              18364             5 986 949
   17.0       02/91              20024             6 524 504
   18.0       05/91              20772             6 792 034
   19.0       08/91              21795             7 173 785
   20.0       11/91              22654             7 500 130
   21.0       03/92              23742             7 866 596
   22.0       05/92              25044             8 375 696
   23.0       08/92              26706             9 011 391
   24.0       12/92              28154             9 545 427
   25.0       04/93              29955            10 214 020
   26.0       07/93              31808            10 875 091
   27.0       10/93              33329            11 484 420
   28.0       02/94              36000            12 496 420
   29.0       06/94              38303            13 464 008

   1.2  Source of data

   Release 29.0  has been  updated using protein sequence data from release
   40.0 of  the PIR (Protein Identification Resource) protein data bank, as
   well as translation of nucleotide sequence data from release 38.0 of the
   EMBL Nucleotide Sequence Database.

   As an  indication to  the source  of the sequence data in the SWISS-PROT
   data bank we list here the statistics concerning the DR (Database cross-
   references) pointer lines:

   Entries with pointer(s) to only PIR entri(es):            4682
   Entries with pointer(s) to only EMBL entri(es):           5191
   Entries with pointer(s) to both EMBL and PIR entri(es):  27691
   Entries with no pointers lines:                            739



<PAGE>


      2. DESCRIPTION OF THE CHANGES MADE TO SWISS-PROT SINCE RELEASE 28

   2.1  Sequences and annotations

   About 2320 sequences have been added since release 28, the sequence data
   of 351  existing entries  has been  updated and  the annotations of 7300
   entries have been revised.

   We are  continuing the process to 'clean-up' the various representations
   of domains  in the  feature lines  (especially the  usage of the feature
   keys "DOMAIN", "REPEAT", "DNA_BIND", and "SITE"). We also have undertook
   an overall  revision of  the CC  topics "SUBCELLULAR LOCATION", "SUBUNIT
   and "CAUTION".

   2.2  What's happening with the model organisms

   As we  announced in  the last  two releases we have selected a number of
   organisms that  are the  target  of  genome  sequencing  and/or  mapping
   projects and for which we intend to:

   -  Be as  complete as  possible. All sequences available at a given time
      should be  immediately included  in SWISS-PROT.  This  also  includes
      sequence corrections and updates.
   -  Provide a high level of annotations.
   -  Cross-references to specialized database(s) that contain, among other
      data, some  genetic information  about the  genes that code for these
      proteins.
   -  Provide specific indices or documents.

   What was  done since  the last  release or  in preparation  for the next
   release:

   -  We have  added Homo  sapiens (human) as the sixth model organism (see
      the next section for some additional information).
   -  In the  next release  we will  add Bacillus  subtilis as  the seventh
      organism. We will link SWISS-PROT to the SubtiList database currently
      being designed by Ivan Moszer of the Pasteur Institute in Paris.
   -  The next  release of  LISTA will  include accession  numbers for each
      gene entry;  we will  therefore be able to cross-reference SWISS-PROT
      to LISTA.

   Here is the current status of the model organisms:

   Organism        Database                    Index file       Number of
                   cross-referenced                             sequences
   --------------  --------------------------  --------------   ---------
   B.subtilis      SubtiList (in preparation)  In preparation         563
   C.elegans       WormPep                     CELEGANS.TXT           679
   D.discoideum    DictyDB                     DICTY.TXT              198
   D.melanogaster  FlyBase                     In preparation         600
   E.coli          EcoGene                     ECOLI.TXT             2674
   H.sapiens       MIM                         MIMTOSP.TXT           2862
   S.cerevisiae    LISTA (in preparation)      YEAST.TXT             1951





<PAGE>


   2.3  Human genetic diseases

   We have  made an  important effort in the implementation, in SWISS-PROT,
   of data relevant to human genetic diseases. This effort has mainly dealt
   with the following enhancements:

   a) In sequence  entries associated  with one  more genetic  diseases, we
      have  updated  and  expanded  the  annotations  characterizing  those
      diseases.  These   annotations  are  stored  in  the  CC  line  topic
      'DISEASE'.

      Examples:

   CC   -!- DISEASE: DEFECTS IN GALNS ARE A CAUSE OF  MUCOPOLYSACCHARIDOSIS
   CC       TYPE IVA (MPS IVA) (ALSO KNOWN AS MORQUIO A SYNDROME) WHICH IS
   CC       CHARACTERIZED BY SPECIFIC SPONDYLOEPIPHYSEAL DYSPLASIA, SHORT
   CC       TRUNK DWARFISM, COXA VALGA, ODONTOID HYPOPLASIA, CORNEAL
   CC       OPACITIES, PRESERVATION OF INTELLIGENCE, AND EXCESSIVE URINARY
   CC       EXCRETION OF KERATAN SULFATE AND CHONDROITIN-6-SULFATE.

   CC   -!- DISEASE: DEFECTS IN KRT9 ARE A CAUSE OF EPIDERMOLYTIC
   CC       PALMOPLANTAR KERATODERMA (EPPK), AN AUTOSOMAL DOMINANT DISEASE
   CC       CHARACTERIZED BY DIFFUSE THICKENING OF THE EPIDERMIS ON THE
   CC       ENTIRE SURFACE OF PALMS AND SOLES SHARPLY BORDERED WITH
   CC       ERYTHEMATOUS MARGINS.


   b) We have  entered in  SWISS-PROT all the mutations linked with genetic
      diseases or  polymorphisms as  long as  they are  not  frameshift  or
      nonsense mutation. These mutations are described in the feature table
      ('VARIANT' key) and the relevant references have been added.

      Partial example  (from entry  P07949 / KRET_HUMAN) describing the RET
      protein which is linked with the diseases MEN2A, MEN2B, MTC and HSCR:

   RN   [7]
   RP   VARIANT MEN2B MET-929.
   RM   94159102
   RA   HOFSTRA R.M.W., LANDSVATER R.M., CECCHERINI I., STULP R.P.,
   RA   STELWAGEN T., LUO Y., PASINI B., HOEPPENER J.W.M.,
   RA   VAN AMSTEL H.K.P., ROMEO G., LIPS C.J.M., BUYS C.H.C.M.;
   RL   NATURE 367:375-376(1994).
   RN   [8]
   RP   VARIANTS HSCR PRO-765; GLN-897 AND GLY-972.
   RM   94159103
   RA   ROMEO G., RONCHETTO P., LUO Y., BARONE V., SERI M., CECCHERINI I.,
   RA   PASINI B., BOCCIARDI R., LERONE M., KAARLAINEN H., MARTUCCIELLO G.;
   RL   NATURE 367:377-378(1994).


   FT   VARIANT     765    765       S -> P (IN HSCR).
   FT   VARIANT     897    897       R -> Q (IN HSCR).
   FT   VARIANT     929    929       T -> M (IN MEN2B).
   FT   VARIANT     972    972       R -> G (IN HSCR).




<PAGE>


   c) A new  CC topic  'POLYMORPHISM' has been implemented. Examples of its
      use:

   CC   -!- POLYMORPHISM: THE ALLELIC FORM OF THE ENZYME WITH GLN-191
   CC       HYDROLYZES PARAOXON WITH A LOW TURNOVER NUMBER AND THE ONE WITH
   CC       ARG-191 WITH A HIGH TURNOVER NUMBER.

   CC   -!- POLYMORPHISM: OVER 80 VARIANTS OF HUMAN DBP HAVE BEEN
   CC       IDENTIFIED. THE THREE MOST COMMON ALLELES ARE CALLED GC1F,
   CC       GC1S, AND GC2. THE SEQUENCE SHOWN IS THAT OF THE GC2 ALLELE.

   d) New keywords have been introduced:

      - 'DISEASE MUTATION' is used  for sequences in which there is at least
        one known disease-inducing mutation.
      - 'POLYMORPHISM' is used in  each entry  where "neutral" variants have
        been found (at the level of the protein sequence).
      - 'CHROMOSOMAL TRANSLOCATION' is used  to indicate proteins whose gene
        are known to be involved in chromosomal translocations.
      - Keywords have been implemented for genetic diseases linked with more
        than a single gene/protein. These keywords are:

   ALBINISM
   ALZHEIMER'S DISEASE
   AMYOTROPHIC LATERAL SCLEROSIS
   ATHEROSCLEROSIS
   AUTOIMMUNE ENCEPHALOMYELITIS
   AUTOIMMUNE UVEITIS
   BERNARD SOULIER SYNDROME
   CHARCOT-MARIE-TOOTH DISEASE
   CHRONIC GRANULOMATOUS DISEASE
   COCKAYNE'S SYNDROME
   DEJERINE-SOTTAS SYNDROME
   DIABETES
   DOWN'S SYNDROME
   DWARFISM
   ELLIPTOCYTOSIS
   EMPHYSEMA
   GAUCHER DISEASE
   GLYCOGEN STORAGE DISEASE
   GM2-GANGLIOSIDOSIS
   GOUT
   HEMOPHILIA
   HEREDITARY HEMOLYTIC ANEMIA
   HYPERLIPOPROTEINEMIA
   LEBER'S HEREDITARY OPTIC NEUROPATHY
   MAPLE SYRUP URINE DISEASE
   METACHROMATIC LEUCODYSTROPHY
   MUCOPOLYSACCHARIDOSIS
   PHENYLKETONURIA
   PSEUDOHERMAPHRODITISM
   RETINITIS PIGMENTOSA
   SCID
   SYSTEMIC LUPUS ERYTHEMATOSUS
   THROMBOPHILIA



<PAGE>


   VON WILLEBRAND DISEASE
   XERODERMA PIGMENTOSUM

   e) The GDB list of genes has been used to update the GN (gene name) line
      of many SWISS-PROT entries.


   2.4  Changes in the DR line

   We  have  added  cross-references  from  SWISS-PROT  to  two  additional
   databases:

   -  The G-protein--coupled  receptor database  (GCRDb)  prepared  by  Lee
      Frank Kolakowski at the Massachusetts General Hospital Renal Unit.
      Reference: Kolakowski L.F. Jr.; Receptors Channels In press(1994).

   -  The Maize  Genome Database  (MaizeDB) developed by the USDA-ARS Maize
      Genome Project  as part  of the National Agricultural Library's Plant
      Genome Research Program.


   These cross-references are present in the DR lines:

   Data bank identifier:    GCRDB
   Primary identifier  :    Unique identifier  (accession  number)  of  the
                            entry
   Secondary identifier:    None; a dash '-' is stored in that field
   Example             :    DR   GCRDB; GCR_0087; -.


   Data bank identifier:    MAIZEDB
   Primary identifier  :    'Gene-product' accession ID
   Secondary identifier:    None; a dash '-' is stored in that field
   Example             :    DR   MAIZEDB; 25342; -.


   We  have   removed  from   SWISS-PROT  cross-references   to  TFD   (the
   Transcription Factor  Database). The  main reason  for this  decision is
   that the  information stored  in the 'polypeptide' table of TFD does not
   expand on the data present in the corresponding SWISS-PROT entry.


   2.5  Status of the documentation files

   SWISS-PROT is  distributed with  a large  number of documentation files.
   Some of  these files  have been  available for  a long  time  (the  user
   manual, release  notes, the  various  indices  for  authors,  citations,
   keywords, etc.),  but  many  have  been  created  recently  and  we  are
   continuously  adding  new  files.  The  following  table  list  all  the
   documents that are currently available or that will be added in the next
   few months.

   USERMAN .TXT   User manual
   RELNOTES.TXT   Release notes
   SHORTDES.TXT   Short description of entries in SWISS-PROT



<PAGE>



   JOURLIST.TXT   List of abbreviations for journals cited
   KEYWLIST.TXT   List of keywords in use
   SPECLIST.TXT   List of organism identification codes
   EXPERTS .TXT   List of on-line experts for PROSITE and SWISS-PROT [1, 3]

   ACINDEX .TXT   Accession number index
   AUTINDEX.TXT   Author index
   CITINDEX.TXT   Citation index
   KEYINDEX.TXT   Keyword index
   SPEINDEX.TXT   Species index

   7TMRLIST.TXT   List of 7-transmembrane G-linked receptors entries
   CDLIST  .TXT   CD nomenclature for surface proteins of human leucocytes
   CELEGANS.TXT   Index  of   Caenorhabditis  elegans   entries  and  their
                  corresponding  gene   designations  and   WormPep  cross-
                  references
   DICTY   .TXT   Index  of  Dictyostelium  discoideum  entries  and  their
                  corresponding   gene   designations  and  DictyDB  cross-
                  references
   EC2DTOSP.TXT   Index of  Escherichia coli  Gene-protein database entries
                  referenced in SWISS-PROT
   ECOLI   .TXT   Index of  Escherichia coli  K12 chromosomal  entries  and
                  their corresponding EcoGene cross-reference [4]
   EMBLTOSP.TXT   Index of EMBL Database entries referenced in SWISS-PROT
   GLYCOSYL.TXT   Index of  glycosyl hydrolases  classified by  families on
                  the basis of sequence similarities [2]
   HOXLIST .TXT   Vertebrate homeobox proteins: nomenclature and index
   MIMTOSP .TXT   Index of MIM entries referenced in SWISS-PROT
   NOMLIST .TXT   List of nomenclature related references for proteins [1]
   PDBTOSP .TXT   Index of Brookhaven PDB entries referenced in SWISS-PROT
   PLASTID .TXT   List of chloroplast and cyanelle encoded proteins
   RESTRIC .TXT   List of restriction enzymes and methylases entries [1]
   RIBOSOMP.TXT   Index of ribosomal proteins classified by families on the
                  basis of sequence similarities [2]
   YEAST   .TXT   Index  of  Saccharomyces  cerevisiae  entries  and  their
                  corresponding gene designations
   YEAST11 .TXT   Yeast Chromosome XI entries [1]


   Notes:

   [1]  New in release 29.
   [2]  Will be available starting with release 30 in October 1994.
   [3]  The list  of on-line  experts used to be an appendix of the release
        notes. We now provide it as a separate document.
   [4]  The format  of this  file was slightly modified in this release; we
        added a field that  indicates  if the  3D  structure  of  an E.coli
        protein is available in PDB.









<PAGE>


   2.6  The Expasy World-Wide Web server

        2.6.1  Background information

   The World-Wide Web (WWW), which originated at CERN, is a powerful global
   information  system   merging  networked   information   retrieval   and
   hypertext. It  gives access, using hypertext links, to the documents and
   information contained  in all the existing WWW servers around the world,
   as well  as to  the data  obtainable through other information retrieval
   systems like WAIS, Gopher, X500, etc. To access a WWW server, one has to
   run on a local computer a client program (a WWW browser), which displays
   hypertext documents.  The user  can then either request a keyword search
   or jump  to another  document by following a hypertext link. WWW has the
   outstanding advantage  of extending  the hypertext  model to  the  whole
   world (by allowing hypertext jumps to documents anywhere on the internet
   network) and  by being  device and  user-interface independent (browsers
   exist for  a variety  of computers  and user-interfaces,  including Unix
   workstations  running  XWindows,  MacIntoshes  and  PCs  with  Microsoft
   Windows).

   The ExPASy  WWW server  allows access, using the user-friendly hypertext
   model,  to  the  SWISS-PROT,  PROSITE,  SWISS-2DPAGE  and  SWISS-3DIMAGE
   databases and,  through any  SWISS-PROT protein sequence entry, to other
   databases such  as EMBL, PROSITE, REBASE, FlyBase, GCRDb, MaizeDB, OMIM,
   PDB and Medline. Using a browser which is able to display images one can
   also remotely access 2D gels image data from SWISS-2DPAGE.

   A WWW  server can  be accessed  on  the  internet  through  its  Uniform
   Resource Locator  (URL), the addressing system defined by the WWW model.
   The URL for the ExPASy WWW server is:

                           http://expasy.hcuge.ch/
   or
                            http://129.195.254.61/

   To access a WWW server, you need to run a browser (or client) program on
   your local computer. Browsers exist for a variety of machines and may be
   obtained by  anonymous ftp. ExPASy can be used with any WWW browser, but
   we recommend  NCSA Mosaic.  It is  a very  flexible and powerful browser
   with  a  graphical  user  interface;  available  for  Unix  boxes  using
   X11/Motif; for  Apple McIntoshes  and for Microsoft Windows. You can get
   it from the FTP site: ftp.ncsa.uiuc.edu.

   To access  all the  data available  from SWISS-2DPAGE,  the user's local
   computer needs  to run  an image  viewing program.  For most browsers on
   Unix workstations  the default  program is  xv, a shareware application.
   Similar Windows  or Apple  shareware or  public domain  applications are
   also available.

   For more  information on  the  ExPASy  WWW  server,  you  can  read  the
   following article:

      Appel R.D., Bairoch A., Hochstrasser D.F.
      A new  generation of  information retrieval tools for biologists: the
      example of the ExPASy WWW server.
      Trends Biochem. Sci. 19:258-260(1994).


<PAGE>


   Or you can contact Dr. Ron Appel:

      Email: appel@cih.hcuge.ch
      Fax: +41-22-372 61 98


        2.6.2  Changes to the WWW ExPASy server

   There has been quite a number of changes to the server in the last three
   months. We want to list specifically the following enhancements:

   -  A direct  entry point to PROSITE has been implemented. It is possible
      to search in PROSITE by description (title of the entry), entry name,
      accession number, author name and by performing a full text search.
   -  It is  now possible to retrieve either the EMBL or GenBank version of
      a cross-referenced nucleotide sequence entry.
   -  Active cross-references  are now  provided to  GCRDb and MaizeDB (see
      section 2.4 above).
   -  New SWISS-PROT  documents such  as RESTRIC.TXT  or  YEAST11.TXT  (see
      section 2.5 above) are available as hypertext documents.


   2.7  Weekly updates of SWISS-PROT

   Since release 24, we provide weekly updates of SWISS-PROT.

   The weekly  updates are  available by  anonymous FTP.  Three  files  are
   updated at each update:

   new_seq.dat    Contains all the new entries since the last full release.
   upd_seq.dat    Contains the entries for which the sequence data has been
                  updated since the last release.
   upd_ann.dat    Contains the  entries for  which one  or more  annotation
                  fields have been updated since the last release.

   Currently these  files are  available on  the  following  anonymous  ftp
   servers:

   Organization   ExPASy (Geneva University Expert Protein Analysis System)
   Address        expasy.hcuge.ch  (or 129.195.254.61)
   Directory      /databases/swiss-prot/updates

   Organization   National Center for Biotechnology Information (NCBI)
   Address        ncbi.nlm.nih.gov (or 130.14.20.1)
   Directory      /repository/swiss-prot/updates

   Organization   EMBL ftp server
   Address        ftp.embl-heidelberg.de (or 192.54.41.33)
   Directory      /pub/databases/swissprot/new

   !! Important notes !!!

   Although we  try to  follow a  regular schedule,  we do  not promise  to
   update these  files every  week. In some cases two weeks will elapse in-
   between two updates.



<PAGE>


   Due to  the current  mechanism used  to build a release the entries that
   are provided in these updates are not guaranteed to be error free. Also,
   for the  same reason,  new  entries  do  not  contain  an  OC  (Organism
   Classification) line.


                            3. ENZYME AND PROSITE

   3.1  The ENZYME data bank

   Release 16.0  of the  ENZYME data bank is distributed with release 29 of
   SWISS-PROT. ENZYME  release 16.0  contains information  relative to 3546
   enzymes. For the first time we have integrated information directly sent
   to us by the Enzyme nomenclature subcommitee of the NCB-IUBMB.


   3.2  The PROSITE data bank

        3.2.1  Statistics for release 12

   Release 12.0  of the PROSITE data bank is distributed with release 29 of
   SWISS-PROT.  Release   12  contains   785  documentation  chapters  that
   describes 1029  different patterns,  rules and  profiles/matrices. Since
   the last major release of PROSITE (release 11.0 of October 1993), 71 new
   chapters have been added and 338 chapters have been updated.

   Out of  a total  of  38303  entries  in  SWISS-PROT,  18786  are  cross-
   referenced in  PROSITE (excluding  the false  positives). This tally for
   exactly 49% of the sequences in SWISS-PROT.

   The next  release of  PROSITE (12.1) will be distributed with release 30
   of SWISS-PROT.


        3.2.2  List of the new entries in release 12

      Ly-6 / u-PAR domain signature
      Nuclear transition protein 2 signatures
      Ribosomal protein L19 signature
      Ribosomal protein L20 signature
      Ribosomal protein L35 signature
      Ribosomal protein L1e signature
      Ribosomal protein S2 signatures
      Ribosomal protein S7e signature
      Ribosomal protein S21e signature
      Ribosomal protein S28e signature
      DnaA protein signature
      NAD-dependent glycerol-3-phosphate dehydrogenase signature
      FAD-dependent glycerol-3-phosphate dehydrogenase signatures
      Mannitol dehydrogenases signature
      Coproporphyrinogen III oxidase signature
      Bacterial-type phytoene dehydrogenase signature
      Ergosterol biosynthesis ERG4/ERG24 family signatures
      Transaldolase active site




<PAGE>


      Myristoyl-CoA:protein N-myristoyltransferase signatures
      PTS EIIB domains cysteine phosphorylation site signature
      Eukaryotic RNA polymerases 15 Kd subunits signature
      Protein phosphatase 2A regulatory subunit PR55 signatures
      Protein phosphatase 2C signature
      Glycosyl hydrolases family 16 signature
      Glycosyl hydrolases family 25 active sites signature
      Glycosyl hydrolases family 39 putative active site
      Ubiquitin carboxyl-terminal hydrolases family 2 signatures
      Glycoprotease family signature
      Dehydroquinase class I active site
      Dehydroquinase class II signature
      Imidazoleglycerol-phosphate dehydratase signatures
      Cysteine synthase/cystathionine beta-synthase P-phosphate
      Glyoxalase I signatures
      6-pyruvoyl tetrahydropterin synthase signatures
      Phosphomannose isomerase type I signatures
      Folylpolyglutamate synthase signatures
      Transposases, Mutator family, signature
      OHHL biosynthesis luxI family signature
      Succinate dehydrogenase cytochrome b subunit signatures
      Globins profile
      PTR2 family proton/oligopeptide symporters signatures
      glpT family of transporters signature
      Bacterial formate and nitrite transporters signatures
      Fungal hydrophobins signature
      G-protein coupled receptors family 3 signatures
      Antenna complexes alpha and beta subunits signatures
      Photosystem I psaG and psaK proteins signature
      ER lumen protein retaining receptor signatures
      Neuromedin U signature
      Urotensin II signature
      Neutrophil bactenecins signatures
      Gamma-thionins family signature
      Streptomyces subtilisin-type inhibitors signature
      Heat shock hsp20 proteins family profile
      Bacterial export FHIPEP family signature
      Cytochrome c oxidase assembly factor COX10/ctaB/cyoE signature
      Cyclin-dependent kinases regulatory subunits signatures
      ADP-ribosylation factors family signature
      SAR1 family signature
      Initiation factor 3 signature
      Transcription termination factor nusG signature
      BTG1 family signature
      G10 protein signatures
      Clathrin adaptor complexes medium chain signatures
      Clathrin adaptor complexes small chain signature
      Extracellular proteins SCP/Tpx-1/Ag5/PR-1/Sc7 signatures
      Oxysterol-binding protein family signature
      Serum amyloid A proteins signature
      Spermadhesins family signatures
      Syndecans signature
      Translationally controlled tumor protein signatures





<PAGE>


        3.2.3  Status of profiles in PROSITE

   There are  a number  of  protein  families  as  well  as  functional  or
   structural domains  that cannot  be detected using patterns due to their
   extreme sequence  divergence. Typical  examples of  important functional
   domains which  are weakly  conserved are the immunoglobulin domains, the
   SH2 and SH3 domains, or the fibronectin type III domain. In such domains
   there are  only a  few sequence  positions which are well conserved. Any
   attempt of  building a  consensus pattern  for such  regions will either
   fail to  pick up  a significant proportion of the protein sequences that
   contain such  region (false  negative) or will pick up too many proteins
   that do  not contain  the region  (false positive). The use of technique
   based on  weight matrices  or profiles  allows  the  detection  of  such
   proteins or  domains. Philipp  Bucher, Kay  Hofmann at ISREC in Lausanne
   and myself are collaborating to include such methods into PROSITE.

   This is  the first  release of  PROSITE to include weight matrices (also
   known as  profiles). In  this release  only  two  profiles  entries  are
   available (for the hsp20 family of small chaperones and for globins). We
   plan to add many new profiles for the next major release (release 13) as
   well as to replace some of the existing pattern entries by profiles.

   None of  the  many  academic  or  commercial  programs  which  has  been
   developed to scan PROSITE can currently make use of the profile entries.
   We  are  therefore  distributing,  with  PROSITE,  the  source  code  (C
   language) of  two programs  that  should  help  software  developers  to
   implement profile-specific routines in their application(s):

   scan4prf Loads a  sequence from a file and scans it with all (or one) of
   the PROSITE profiles.

   srch4prf Loads a  profile from  a file  and scans  for that profile in a
   SWISS-PROT data base file.

   These programs  will  soon  be  available  in  the  respective  /prosite
   directory of  the  NCBI  and  Expasy  anonymous  FTP  servers  (for  the
   addresses, see section 2.7).

   Important notice  for software  developers: the  integration of profiles
   into PROSITE did not "break" the current format. The profiles entries in
   the PROFILE.DAT  file are  tagged with  the token  "MATRIX" on  the "ID"
   line; a  new line-type  "MA" is  used in  these entries to store all the
   weight matrices  specific parameters. The full description of the format
   of the  "MA" line-type  is available  as a  user's  manual  (file  name:
   PROFILE.TXT) that  is part of the PROSITE distribution files. The format
   of the PROSITE.DOC file has not be changed.



        3.2.4  Author index file

   Starting with  this release, we distribute a file that contains an index
   of the authors (and on-line experts) referenced in the PROSITE.DOC file.
   The name of this file is 'PAUTINDX.TXT'.




<PAGE>



                             WE NEED YOUR HELP !

   We welcome  feedback from our users. We would especially appreciate that
   you notify  us if  you find  that sequences  belonging to  your field of
   expertise are  missing from  the data  bank. We  also would  like to  be
   notified about  annotations to be updated, if, for example, the function
   of a protein has been clarified or if new post-translational information
   has become available.

















































<PAGE>


                         APPENDIX A: SOME STATISTICS



   A.1  Amino acid composition


        A.1.1  Composition in percent for the complete data bank

   Ala (A) 7.60   Gln (Q) 4.03   Leu (L) 9.21   Ser (S) 7.15
   Arg (R) 5.23   Glu (E) 6.27   Lys (K) 5.83   Thr (T) 5.82
   Asn (N) 4.47   Gly (G) 6.97   Met (M) 2.36   Trp (W) 1.29
   Asp (D) 5.28   His (H) 2.25   Phe (F) 4.00   Tyr (Y) 3.21
   Cys (C) 1.78   Ile (I) 5.58   Pro (P) 5.02   Val (V) 6.52

   Asx (B) 0.005  Glx (Z) 0.005  Xaa (X) 0.02


        A.1.2  Classification of the amino acids by their frequency

   Leu, Ala, Ser, Gly, Val, Glu, Lys, Thr, Ile, Asp, Arg, Pro, Asn, Gln,
   Phe, Tyr, Met, His, Cys, Trp



   A.2  Repartition of the sequences by their organism of origin

   Total number of species represented in this release of SWISS-PROT: 4471


        A.2.1 Table of the frequency of occurrence of species


        Species represented 1x: 2010
                            2x:  735
                            3x:  402
                            4x:  255
                            5x:  182
                            6x:  196
                            7x:  102
                            8x:   79
                            9x:   88
                           10x:   45
                       11- 20x:  176
                       21- 50x:  121
                       51-100x:   37
                         >100x:   43











<PAGE>




        A.2.2  Table of the most represented species

    Number   Frequency          Species
         1        2862          Human
         2        2674          Escherichia coli
         3        1951          Baker's yeast (Saccharomyces cerevisiae)
         4        1697          Mouse
         5        1565          Rat
         6         710          Bovine
         7         679          Caenorhabditis elegans
         8         633          Fruit fly (Drosophila melanogaster)
         9         563          Bacillus subtilis
        10         542          Chicken
        11         410          African clawed frog (Xenopus laevis)
        12         394          Salmonella typhimurium
        13         387          Rabbit
        14         339          Pig
        15         251          Vaccinia virus (strain Copenhagen)
        16         239          Maize
        17         229          Arabidopsis thaliana (Mouse-ear cress)
        18         221          Fission yeast (Schizosaccharomyces pombe)
        19         200          Bacteriophage T4
        20         198          Slime mold (Dictyostelium discoideum)
        21         197          Rice
        22         193          Human cytomegalovirus (strain AD169)
        23         185          Pseudomonas aeruginosa
        24         183          Vaccinia virus (strain WR)
        25         180          Tobacco
        26         174          Pea
        27         168          Wheat
        28         158          Barley
        29         146          Variola virus
        30         142          Dog
        31         139          Sheep
        32         137          Soybean
        33         134          Staphylococcus aureus
        34         131          Spinach
        35         127          Pseudomonas putida
        36         124          Neurospora crassa
        37         122          Marchantia polymorpha (Liverwort)
        38         121          Rhodobacter capsulatus
        39         119          Klebsiella pneumoniae
        40         111          Agrobacterium tumefaciens
        41         108          Bacillus stearothermophilus
        42         104          Tomato
        43         101          Rhizobium meliloti










<PAGE>




   A.3  Repartition of the sequences by size

               From   To  Number             From   To   Number
                  1-  50    2159             1001-1100      363
                 51- 100    3748             1101-1200      257
                101- 150    5182             1201-1300      196
                151- 200    3711             1301-1400      115
                201- 250    3240             1401-1500      113
                251- 300    2860             1501-1600       54
                301- 350    2687             1601-1700       55
                351- 400    2771             1701-1800       45
                401- 450    2065             1801-1900       51
                451- 500    2192             1901-2000       37
                501- 550    1560             2001-2100       19
                551- 600    1089             2101-2200       48
                601- 650     776             2201-2300       53
                651- 700     580             2301-2400       19
                701- 750     558             2401-2500       25
                751- 800     425             >2500          119
                801- 850     328
                851- 900     353
                901- 950     229
                951-1000     221


   Currently the ten longest sequences are:

                            HTS1_COCCA  5217 a.a.
                             FAT_DROME  5147 a.a.
                            RYNR_RABIT  5037 a.a.
                            RYNR_HUMAN  5032 a.a.
                            RYNC_RABIT  4969 a.a.
                            DYHC_DICDI  4725 a.a.
                            APB_HUMAN   4563 a.a.
                            APOA_HUMAN  4548 a.a.
                            RRPA_CVMJH  4488 a.a.
                            DYHC_TRIGR  4466 a.a.



















<PAGE>



           APPENDIX B: RELATIONSHIPS BETWEEN BIOMOLECULAR DATABASES

   The current  status of the relationships (cross-references) between some
   biomolecular databases is shown in the following schematic:

                                                       **********************
                        ***********************        * EPD [Euk. Promot.] *
                        *  EMBL Nucleotide    * <----> **********************
                        *  Sequence Data      *
******************      *  Library            *        **********************
* FLYBASE        * <--> *********************** <----- * ECD [E. coli map]  *
* [Drosophila    *                ^   ^  ^             **********************
* genomic d.b.]  * <---------+    |   |  |
******************           |    |   |  +------------ **********************
                             |    |   |                * TFD [Trans. fact.] *
                             |    |   |                **********************
******************           |    |   |
* MaizeDb        * <------+  |    |   |                **********************
******************        |  |    |   +--------------> * GCRDb [7TM recep.] *
                          |  |    |   |                **********************
******************        |  |    |   |     
* WormPep        *        |  |    |   |                **********************
* [C.elegans]    * <----+ |  |    |   |       +------> * DictyDB [D.disco.] *
******************      | |  |    |   |       |        **********************
                        | |  |    |   |       |
******************      | v  v    v   v       v        **********************
* REBASE         *      ***********************        * ENZYME [Nomencl.]  *
* [Restriction   * <--- *  SWISS-PROT         * <----- **********************
*  enzymes]      *      *  Protein Sequence   *            |
******************      *  Data Bank          *            v
                        ***********************        **********************
******************       ^  ^  |  |  ^   ^  |          * OMIM   [Diseases]  *
* EcoGene/EcoSeq *       |  |  |  |  |   |  +--------> **********************
* [E. coli]      * <-----+  |  |  |  |   |
******************          |  |  |  |   +-----------> **********************
                            |  |  |  |                 * ECO2DBASE     [2D] *
                            |  |  |  |                 **********************
******************          |  |  |  |
* PROSITE        * <--------+  |  |  +---------------> **********************
* [Patterns]     *             |  |                    * SWISS-2DPAGE  [2D] *
******************             |  +---------------+    **********************
             |                 v                  |
             |          ***********************   |    **********************
             +--------> * PDB [3D structures] *   +--> * Aarhus/Ghent  [2D] *
                        ***********************        **********************












<PAGE>

ExPASy logo ExPASy Home page Site Map Search ExPASy Contact us Swiss-Prot
 Hosted by ca flag CBR Canada Mirror sites: Australia  Brazil  China  Korea  Switzerland