Welcome to Allermatch.orgtm

Allermatchtm is a unique website where you can compare the amino acid sequence of a protein of interest with sequences of allergenic proteins. This website carries out automatically the procedures for predicting the potential allergenicity of proteins by bioinformatics approaches as recommended by the Codex alimentarius and FAO/WHO Expert consultation on allergenicity of foods derived through modern biotechnology [1,2]. The unique features of the Allermatchtm website allow the user in a user-friendly and time-saving manner to enter the input sequence and retrieve, with a few mouse-clicks, the outcomes of interest in an accurate, concise, and comprehensible format.

Important: read the disclaimer below before using this website

All sequences submitted will be treated confidentially

Contact Dr. Gijs Kleter for more details. Batch analysis of multiple sequences is also possible

Go to the search page immediately

Contents

Disclaimer

This public, non-commercial website has been constructed solely for the purpose of investigational use by scientific researchers, regulatory professionals, and other interested parties. This website is not meant to be used, among others, for commercial purposes or the automatic generation of commands for this website's facilities by other computers. While the Allermatchtm website aims at providing data that are as accurate as possible to the current state of knowledge, website users are strongly recommended to check the data and outcomes that they retrieve from this website against original scientific literature and other sources of information. No rights or damage can be claimed based on data provided and/or generated by this website and its facilities, including, but not limited to, information, search facilities, and outcomes of search actions carried out with the aid of the data and search facilities made available to website users. The sequences of proteins stored in the website's database have been obtained from SwissProt accession files and are therefore liable to SwissProt's copyrights (http://www.expasy.org/, see below). Website users should respect SwissProt's rights to the sequence information. Data obtained through this website may be published without the explicit written consent of the website owners, provided that the data characteristics are not modified and the website is acknowledged as the source of these data, while the acknowledgement should include the website's name Allermatchtm, its URL http://allermatch.org/, and the date of its last update (see end of page below) during use. Notwithstanding the previous statements, neither the authors' nor SwissProt's rights should be affected.

SwissProt copyright statement

Swiss-Prot is copyright. It is produced through a collaboration between the Swiss Institute of Bioinformatics and the EMBL Outstation - the European Bioinformatics Institute. There are no restrictions on its use by non-profit institutions as long as its content is in no way modified. Usage by and for commercial entities requires a license agreement. For information about the licensing scheme see: http://www.isb-sib.ch/announce/ or send an email to license@isb-sib.ch.

Acknowledgement

Financial support for this website from Plant Research International and the Dutch Ministry of Agriculture, Nature, and Food Quality (Research Programs 378, 390, and 404 [North-South]) is gratefully acknowledged.

About this website

In most nations, genetically engineered foods must be assessed for their safety before market approval is granted. An important issue in this safety assessment is the potential allergenicity of transgenic ("foreign") proteins that have been introduced into the food by genetic engineering. In other words, what is the chance that the foreign protein may cause allergic reactions after consumption of the genetically engineered food containing this protein?

Potential allergenicity is assessed during a step-by-step procedure described by the guidelines of the FAO/WHO Codex alimentarius Commission for the safety assessment of foods derived from genetically engineered plants and micro-organisms [1]. One important step in this procedure is to determine, with the aid of computer programs, whether the primary structure (amino acid sequence) of the transgenic protein is similar to sequences of allergenic proteins, of which the latter are avalaible from public protein sequence databases.

Two types of similarity are searched for:
  • Short identical stretches of 6-8 contiguous amino acids;
  • larger stretches (80 amino acids long) containing a minimum of 35 % (non contiguous) identical amino acids.
The similar stretches that are identified this way may harbour potential binding sites (called epitopes) for IgE antibodies. IgE antibodies are allergy-related and involved in the binding of the allergen to mast cells, after which these cells release compounds, such as histamine, that cause the symptoms of allergy. Allergens must at least contain two IgE-binding epitopes to trigger a mast cell reaction.

To search for the two types of similarities, a recent Expert Consultation of the FAO/WHO, which was held in preparation of the Codex alimentarius guidelines, devised the following procedure [2]:

6.1. Sequence Homology as Derived from Allergen Databases

The commonly used protein databases (PIR, SwissProt and TrEMBL) contain the amino acid sequences of most allergens for which this information is known. However, these databases are currently not fully up-to-date. A specialized allergen database is under construction.

Suggested procedure on how to determine the percent amino acid identity between the expressed protein and known allergens.

Step 1: obtain the amino acids sequences of all allergens in the protein databases (for SwissProt and TrEMBL: see http://expasy.ch/tools; for PIR see http://wwwnbrf.georgetown.edu/pirwww ) in FASTA-format (using the amino acids from the mature proteins only, disregarding the leader sequences, if any). Let this be data set (1).

Step 2: prepare a complete set of 80-amino acid length sequences derived from the expressed protein (again disregarding the leader sequence, if any). Let this be data set (2).

Step 3: go to EMBL internet address: http://www2.ebi.ac.uk and compare each of the sequences of the data set (2) with all sequences of data set (1), using the FASTA program on the web site for alignment with the default settings for gap penalty and width.

Cross-reactivity between the expressed protein and a known allergen (as can be found in the protein databases) has to be considered when there is:

1) more than 35 % identity in the amino acid sequence of the expressed protein (i.e. without the leader sequence, if any), using a window of 80 amino acids and a suitable gap penalty (using Clustal-type alignment programs or equivalent alignment programs)

or:

2) identity of 6 contiguous amino acids.

If any of the identity scores equals or exceeds 35 %, this is considered to indicate significant homology within the context of this assessment approach. The use of amino acid sequence homologies to identify prospective cross-reacting allergens in genetically modified foods has been discussed in more detail elsewhere (Gendel, 1998a; Gendel, 1998b).

The search facility on the Allermatchtm website automatically carries out the procedure recommended by the guidelines on protein sequences that are entered by the website visitor in FASTA format (one-letter code without residue numbers, see example sequence below). The visitor has the option to select the following outputs of interest:

  1. Alignment of 80-amino acids subsequences of the input sequence using a sliding window of 80-amino acids size. The step size is 1 amino acid, such that from a sequence of 100 amino acids, for example, 21 subsequences of 80 amino acids length are made (1-80, 2-81, 3-82 ... 20-99, 21-100). Each of these subsequences is aligned to database sequences. The FASTA computer algorithm is used for these sequence alignments, as recommended (see above; default values are used). With FASTA, "head to tail" alignments (from the start to the end of a sequence) are made of each subsequence with each database sequence. The default threshold for the number of identical amino acids is 35% in the alignment with an 80-amino acids window, which is considered a significant level of identity between the input sequence and the allergenic protein's sequence (see recommendations cited above). The identity presented by the website in the results of the alignments is therefore the % identical amino acids in the 80-amino acids window. The default threshold can be changed by the user. Input sequences shorter than 80 amino acids should not be aligned using this option.
  2. Full alignment of the whole input sequence with database sequences using the FASTA algorithm. This option can be used, for example, for input sequences shorter than 80 amino acids, for which the option of the 80-amino acids sliding window (see above) cannot be used. Also in case where an input sequence shows sufficient identity with many proteins over its entire sequence, this option may provide for a good oversight of the alignments between the input- and database- sequences.
  3. Exact hits of short identical stretches of, for example, 6 amino acids. To this end, a wordmatch algorithm is used, which searches for identical matches of a specified number of contiguous amino acids ("wordlength") between the input sequence and a given database sequence. The default value for the wordlength, which can be changed by the user, is 6 amino acids. Decreasing the wordlength likely results in a larger number of positive scores, while increasing it may yield less positive results.

The entered sequences will be compared to the sequences of allergenic proteins compiled in the database. These sequences of allergenic proteins have been extracted from the SwissProt list of allergens (http://www.expasy.org/cgi-bin/lists?allergen.txt; update November 14, 2003) and (putative) signal-, pro-, and transit-peptides, whose positions are indicated by SwisProt as "features", have been removed from these sequences, which yields the sequences of "mature" proteins. The total number of mature protein sequences in the database is currently 315.

Positive results of the procedure will be provided to the user. To our knowledge, the automatic feature of this website for comparing the sequence of a protein of interest to sequences of allergenic proteins following the recommended procedure is unique on the Internet. The details of the various options for sequence alignments are discussed in the following sections.

Example of matching an input sequence

(print this page so it can be consulted during the subsequent steps described below)

Input sequence
The following sequence is that of the mature protein of the allergen Zea m 14 from maize pollen. As
may be noticed, this sequence contains one-letter codes for each amino acid, while the complete
sequence is made up of 93 letters or amino acids:

aiscgqvasaiapcisyargqgsgpsagccsgvrslnnaarttadrraacnclknaaagvsglnagnaasipskcgvsipytiststdcsrvn

While the original protein sequence in the SwissProt database entry P19656 consisted of 120 amino acids, removal of the signal peptide comprising the first 27 amino acids has yielded this mature protein sequence containing 93 amino acids.

If users enter their own input sequences, numbers in this sequence should be removed, whereas
spaces, paragraph- or line- returns, need not be removed. In addition, three-letter codes for amino
acids, such as IleSerCys... (first 3 residues of Zea m 14) should be changed into one-letter codes, for
example by using web-based conversion tools (for example, "Three-to-One",
http://www.ualberta.ca/~stothard/javascript/threeToOne.html).

Entering an input sequence and selecting the alignment of interest
Enter the input sequence, by typing or copy-pasting it, in the searchbox (below "Copy Paste your
amino acid sequence here") of the Allermatchtm main page. With the cursor, select one of the following
options:
  • "Do an 80 amino acids sliding window alignment"
  • "Look for a small exact wordmatch"
  • "Do a full fasta alignment"
In case the 80-amino acids sliding windows has been chosen, the default threshold value of 35% identity may be modified by the user in the box next to "Cutoff Percentage (only applicable to the 80 amino acids sliding window)". The threshold is the lower limit for alignments that will be displayed in the following steps (alignments scoring below the threshold will therefore not be displayed).

If the option for a small exact wordmatch has been chosen, the default value 6 for the wordlength can be modified by the user in the box next to " Wordlength (only applicable to the exact wordmatch search)". The wordlength is the minimal number of amino acids in an exact match.

After having selected the options and thresholds (if applicable) of interest, click then the "Go" button. The results will appear in the new page that is created in the same window on the user's screen. The various outcomes are discussed below for each of the specific options.

80 amino acids sliding window

Summary table

The new page that appears after starting the 80-amino acids sliding window alignment on the input sequence provides a table with a summary of the "hits", which are alignments scoring above the cutoff value. Each specific allergenic protein whose database sequence scored hits is presented in a new line, while data on this allergenic protein and the alignment are presented under the following column headings:
  • "Hit No", the rank of the best hit (see third column) of the allergenic protein, such as 1, 2, or 3, while the rank for the highest best hit is 1.
  • "Allergen id", the Allermatchtm identifier for the allergenic protein whose sequence is stored in the database
  • "Best hit (identity)", highest number of identical amino acids in the hits, expressed as percentage of 80- or more- amino acids, for example 30% for 24 identical amino acids
  • "No of hits ident>....", the number of 80-amino acids subsequences (windows) of the input sequence that showed hits above the cut-off value with the database sequence of the allergenic protein 
  • "% of hits ident>.... ", the fraction (percentage, %) of the total number of analysed subsequences (windows) of the input sequence that showed hits above the cutoff value with the allergenic protein
  • "Full identity", identical amino acids in the FASTA alignment of the complete input sequence against the database sequence of the allergenic protein. The first number is the percentage of identical amino acids as part of the total length of the alignment, while the second number is the total length of this alignment expressed as number of amino acids (including non-identical amino acids).
  • "Swissprot", the Swissprot accession number, which is clickable and provides a link to the original accession file for the database sequence on the Swissprot website (the user's browser will exit the Allermatchtm website)
  • "Species name", Latin name of the organism from which the allergenic protein is derived
  • "Detailed information", the clickable "Go" button links to a page with specific details on the database sequence of the allergenic protein, as well as the complete FASTA alignment and the subsequences (windows) of the input sequence aligning to the database sequence. After having clicked on the "Go" button, a new page will appear in the same window on the user's screen.

Detailed information

This page provides the following information:
  • The input sequence (amino acid sequence).
  • Details on the database sequence of the allergenic protein, including allergen name, species name, Swissprot accession, remarks (for example, signal-, pro-, or transit- peptides that have been removed from the sequence) and amino acid sequence.
  • The complete amino acid sequences of the input- and database- sequences are shown in this page. Below each one-letter code for amino acid residues in both of these sequences, a "#"-marking may be displayed. The residues marked with "#" were aligned with residues in the other sequence (database or input, respectively) in the 80-amino acids window alignments that had 35% or more identical amino acids in the window. Please note that these "#" markings also include nonidentical residues in both the input- and database- sequences that were aligned to each other. The 35% cut-off value is fixed for these "#" markings and cannot be changed by the user.
  • The full alignment between the complete input sequence (no 80-amino acids windows) and the allergenic protein.
By clicking the "Show all alignments" button in the upper right corner, all the separate hits, i.e. alignments of those 80-amino acid subsequences (windows) of the input sequence that scored equal to- or above- the cutoff value of 35% (fixed value, cannot be changed by the user), can be viewed. The new page that appears in the same window on the user's screen contains the same information as the previous page, in addition to the separate hits. After clicking "Hide all alignments", the previous page re-appears.

Example

For the input sequence Zea m 14, for example, the summary table lists 10 database sequences of allergenic proteins that score hits if the cutoff value equals 35%. Since the Zea m 14 sequence contains 93 amino acids, 14 subsequences (windows) of 80-amino acids have been generated (1-80, 2-81, ...., 13-92, 14-93). The highest ranking database sequence in the table is Zea m 14 itself, because the same sequence has also been stored in the Allermatchtm database, which shows a best hit of 100%, while all of the 14 windows of the input sequence scored hits, as expected. The last ranking database sequence in the table is designated Par_j_2_a (Allermatchtm identifier), one of the two database sequences of the allergenic protein Par j 2 derived from weed pollen. The best hit for this sequence is 36.59% identity, while 5 of the 14 windows scored hits. The detailed information on the alignments with Par_j_2_a show that a large part of both the input and database sequence are part of the 80-amino acid sliding window- and full- alignments. Interestingly, all the sequences listed in the table are lipid transfer proteins, as mentioned in the original SwissProt accession files to which the table provides links.

Exact hits of small stretches of identical amino acids

Summary table

The new page that appears after starting the alignment of small identical stretches using WordMatch provides a table summarising the "hits", which are the alignments equal to- or above- the wordlength, i.e. the minimal number of identical contiguous amino acids. Each of the database sequences of allergenic proteins that showed a hit with the input sequence is shown in a separate line of the table, while the data on the allergenic protein are shown under the following column headings:
  • "No", rank of the database sequence of the allergenic protein, while the sequence that scores the highest number of wordmatches ranks number 1.
  • "Allergen id", the Allermatchtm identifier for the allergenic protein whose sequence is stored in the database
  • "Number of exact wordmatches", the number of identical stretches of a given wordlength shared by the input- and database- sequences.
  • "% of exact wordmatches", the identical stretches of a given wordlength shared by the input- and database- sequences, expressed as percentage of the maximum number of stretches (nonidentical and identical) of the same wordlength that can be made from the input sequence.
  • "Swissprot", the Swissprot accession number, which is clickable and provides a link to the original accession file for the database sequence on the Swissprot website (the user's browser will exit the Allermatchtm website)
  • "Species name", Latin name of the organism from which the allergenic protein is derived
  • "Detailed information", after the "Go" button has been clicked on, a new page is created in the same window on the user's screen that contains information on the allergenic protein, and the hits of short identical stretches.

Detailed information

This page provides the following information on the hits of the selected wordlength with a specific allergenic protein:

  • The input sequence (amino acid sequence)
  • Details on the database sequence of the allergenic protein, including allergen name, species name, Swissprot accession, remarks (for example, signal-, pro-, or transit- peptides that have been removed from the sequence) and amino acid sequence.
  • The complete amino acid sequences of the input- and database- sequences, while the "#"-symbols mark the residues within these sequences that are part of the exact hits with the wordlength of 6 amino acids (fixed wordlength, which does not change to the wordlength entered by the user).
  • Matches that are shorter than 6 amino acids may be found in the output of the full alignment (see below).

Example

For the Zea m 14 test sequence, the summary table mentions 7 database sequences of allergenic proteins, including Zea m 14 itself, if a wordlength of 6 is selected. Besides Zea m 14, the other 6 database sequences are allergenic proteins that are classified as lipid transfer proteins. The two last ranking database sequences are Pru av 3 and Pru ar 3 from cherry and apricot, respectively, each of which scored one hit. As can be inferred from the detailed information, the single identical stretch of 6 amino acids (acnclk) in Pru av 3 and Pru ar 3 is also present in the other 5 listed database sequences.

Full alignment

The new page that appears after starting the full alignment contains the following information:
  • Bar diagram showing the number of hits for certain statistical scores (E, opt) of the FASTA alignments of the input sequence with the database sequences of allergenic proteins.
  • List of database sequences of allergenic proteins, ranked in descending order of best statistical scores for the alignment of these sequences with the input sequence.
  • Details of each specific alignment from the previous list, in the same order, showing the aligned sequences, while the positions of identical- and similar (non identical, but evolutionarily related)- residues are indicated by two dots and one dot, respectively, between the aligned sequences

Example

If Zea m 14 has been entered as input sequence, the highest scoring database sequences are the same as for the 80-amino acids sliding window alignment, i.e. lipid transfer proteins, in addition to the 3 database sequences of Par j 1, another lipid transfer protein.

About us

This website has been constructed through a joint effort of RIKILT - Institute of Food Safety and Plant Research International, both part of Wageningen University and Research Center in Wageningen, The Netherlands. Both partners also participate in the Allergy Consortium Wageningen (ACW; http://www.allergymatters.org).

RIKILT - Institute of Food Safety (http://www.rikilt.wur.nl) is specialised in food safety research, including the safety of genetically engineered foods and animal feed. For example, RIKILT develops advanced methods for detection- and safety testing- of genetically engineered foods. In addition, RIKILT advises national and international authorities on the safety of genetically engineered foods and feed.

Participants for RIKILT in this project are Dr.ir. Gijs A. Kleter (gijs.kleter@wur.nl) and Dr. A.A.C.M. Peijnenburg (ad.peijnenburg@wur.nl). See also our publications on predicting potential IgE epitopes in novel proteins [3, 4]

Plant Research International (http://www.plant.wur.nl) carries research in all fields of plant science, including plant biotechnology and genomics. It has a proven track record in the field of bioinformatics applied to genomics and proteomics of plants and other organisms, such as Arabidopis thaliana and Lactobacillus plantarum, respectively.

Participants for Plant Research International in this project are Ir. M. Fiers (mark.fiers@wur.nl) and Mr. H. Nijland (herman.nijland@wur.nl)

Feedback

It is our goal to improve this website's facilities in future and to further extend it with a database of IgE epitopes as well as a module that predicts the antigenicity (likelihood of antibody binding) of identical stretches. Please send your comments and inquiries to gijs.kleter@wur.nl

References

  1. Codex Alimentarius Commission (2003) Codex Principles and Guidelines on Foods Derived from Biotechnology. Rome, Italy: Codex Alimentarius Commission, Joint FAO/WHO Food Standards Programme, Food and Agriculture Organization. ftp://ftp.fao.org/codex/standard/en/CodexTextsBiotechFoods.pdf

    This document recommends the following sequence comparisons between transgenic and allergenic proteins:

    8. The purpose of a sequence homology comparison is to assess the extent to which a newly expressed protein is similar in structure to a known allergen. This information may suggest whether that protein has an allergenic potential. Sequence homology searches comparing the structure of all newly expressed proteins with all known allergens should be done. Searches should be conducted using various algorithms such as FASTA or BLASTP to predict overall structural similarities. Strategies such as stepwise contiguous identical amino acid segment searches may also be performed for identifying sequences that may represent linear epitopes. The size of the contiguous amino acid search should be based on a scientifically justified rationale in order to minimize the potential for false negative or false positive results*. Validated search and evaluation procedures should be used in order to produce biologically meaningful results.

    9. IgE cross-reactivity between the newly expressed protein and a known allergen should be considered a possibility when there is more than 35% identity in a segment of 80 or more amino acids (FAO/WHO 2001) or other scientifically justified criteria. All the information resulting from the sequence homology comparison between the newly expressed protein and known allergens should be reported to allow a case-by-case scientifically based evaluation.

    * It is recognized that the 2001 FAO/WHO consultation suggested moving from 8 to 6 identical amino acid segments in searches. The smaller the peptide sequence used in the stepwise comparison, the greater the likelihood of identifying false positives, inversely, the larger the peptide sequence used, the greater the likelihood of false negatives, thereby reducing the utility of the comparison.

  2. FAO/WHO (2001) Joint FAO/WHO Expert Consultation on Foods Derived from Biotechnology - Allergenicity of Genetically Modified Foods - Rome, 22 - 25 January 2001. Rome: Food and Agriculture Organisation of the United Nations.http://www.who.int/foodsafety/publications/biotech/en/ec_jan2001.pdf
  3. Kleter, G.A., Peijnenburg, A.A.C.M. (2002) Screening of transgenic proteins expressed in transgenic food crops for the presence of short amino acid sequences identical to potential, IgE-binding linear epitopes of allergens. BMC Structural Biology 2, 8. http://www.biomedcentral.com/1472-6807/2/8
  4. Kleter, G.A., Peijnenburg, A.A.C.M. (2003) Presence of potential allergy-related linear epitopes in novel proteins from conventional crops and the implication for the safety assessment of these crops with respect to the current testing of genetically modified crops. Plant Biotechnology Journal 1, 371-380.
(c) RIKILT-Institute of Food Safety and Plant Research International, 2004
Search algorithms installed and adapted by Ir. Mark Fiers (mark.fiers@wur.nl)
Website constructed by Mr. Herman Nijland (herman.nijland@wur.nl)
Allergen sequences compiled by Dr.ir. Gijs A. Kleter (gijs.kleter@wur.nl) and Dr. Ad A.C.M.
Peijnenburg (ad.peijnenburg@wur.nl)

The sequence search facility is provided by Applied Bioinformatics from Plant Research International. The software runs on a Sun V60 server running Suse Linux. The software is written in Python  and is served by an apache webserver and mod_python.

January 29, 2004.

Questions: Dr. Gijs Kleter