Welcome to Allermatch.orgtm
Allermatchtm is a unique website where you can compare the amino acid
sequence of a protein of interest with sequences of allergenic
proteins. This website carries out automatically the procedures for
predicting the potential allergenicity of proteins by bioinformatics
approaches as recommended by the Codex alimentarius and FAO/WHO Expert
consultation on allergenicity of foods derived through modern
biotechnology [1,2]. The
unique features of the Allermatchtm website allow the user in a
user-friendly and time-saving manner to enter the input sequence and
retrieve, with a few mouse-clicks, the outcomes of interest in an
accurate, concise, and comprehensible format.
Important: read the disclaimer below
before using this website
All sequences submitted will be treated confidentially
Contact Dr. Gijs Kleter for more details. Batch
analysis of multiple sequences is also possible
Contents
Disclaimer
This public, non-commercial website has been constructed solely for the
purpose of investigational use by scientific researchers, regulatory
professionals, and other interested parties. This website is not meant
to be used, among others, for commercial purposes or the automatic
generation of commands for this website's facilities by other
computers. While the Allermatchtm website aims at providing data that are
as accurate as possible to the current state of knowledge, website
users are strongly recommended to check the data and outcomes that they
retrieve from this website against original scientific literature and
other sources of information. No rights or damage can be claimed based
on data provided and/or generated by this website and its facilities,
including, but not limited to, information, search facilities, and
outcomes of search actions carried out with the aid of the data and
search facilities made available to website users. The sequences of
proteins stored in the website's database have been obtained from
SwissProt accession files and are therefore liable to SwissProt's
copyrights (http://www.expasy.org/,
see below). Website users should respect SwissProt's rights to the
sequence information. Data obtained through this website may be
published without the explicit written consent of the website owners,
provided that the data characteristics are not modified and the website
is acknowledged as the source of these data, while the acknowledgement
should include the website's name Allermatchtm, its URL http://allermatch.org/, and the date
of its last update (see end of page below) during use. Notwithstanding
the previous statements, neither the authors' nor SwissProt's rights
should be affected.
SwissProt copyright
statement
Swiss-Prot is copyright. It is produced through a collaboration between
the Swiss Institute of Bioinformatics and the EMBL Outstation - the
European Bioinformatics Institute. There are no restrictions on its use
by non-profit institutions as long as its content is in no way
modified. Usage by and for commercial entities requires a license
agreement. For information about the licensing scheme see: http://www.isb-sib.ch/announce/
or send an email to license@isb-sib.ch.
Acknowledgement
Financial support for this website from Plant Research International
and the Dutch Ministry of Agriculture, Nature, and Food Quality
(Research Programs 378, 390, and 404 [North-South]) is gratefully
acknowledged.
About this website
In most nations, genetically engineered foods must be assessed for
their safety before market approval is granted. An important issue in
this safety assessment is the potential allergenicity of transgenic
("foreign") proteins that have been introduced into the food by genetic
engineering. In other words, what is the chance that the foreign
protein may cause allergic reactions after consumption of the
genetically engineered food containing this protein?
Potential allergenicity is assessed during a step-by-step procedure
described by the guidelines of the FAO/WHO Codex alimentarius
Commission for the safety assessment of foods derived from genetically
engineered plants and micro-organisms [1]. One
important step in this procedure is to determine, with the aid of
computer programs, whether the primary structure (amino acid sequence)
of the transgenic protein is similar to sequences of allergenic
proteins, of which the latter are avalaible from public protein
sequence databases.
Two types of similarity are searched for:
- Short identical stretches of 6-8 contiguous amino acids;
- larger stretches (80 amino acids long) containing a minimum of 35
% (non contiguous) identical amino acids.
The similar stretches that are identified this way may harbour
potential binding sites (called epitopes) for IgE antibodies. IgE
antibodies are allergy-related and involved in the binding of the
allergen to mast cells, after which these cells release compounds, such
as histamine, that cause the symptoms of allergy. Allergens must at
least contain two IgE-binding epitopes to trigger a mast cell reaction.
To search for the two types of similarities, a recent Expert
Consultation of the FAO/WHO, which was held in preparation of the Codex
alimentarius guidelines, devised the following procedure [2]:
6.1. Sequence Homology as Derived
from Allergen Databases
The commonly used protein databases (PIR, SwissProt and TrEMBL) contain
the amino acid sequences of most allergens for which this information
is known. However, these databases are currently not fully up-to-date.
A specialized allergen database is under construction.
Suggested procedure on how to determine the percent amino acid identity
between the expressed protein and known allergens.
Step 1: obtain the amino acids sequences of all allergens in the
protein databases (for SwissProt and TrEMBL: see http://expasy.ch/tools; for PIR see http://wwwnbrf.georgetown.edu/pirwww
) in FASTA-format (using the amino acids from the mature proteins only,
disregarding the leader sequences, if any). Let this be data set (1).
Step 2: prepare a complete set of 80-amino acid length sequences
derived from the expressed protein (again disregarding the leader
sequence, if any). Let this be data set (2).
Step 3: go to EMBL internet address: http://www2.ebi.ac.uk
and compare each of the sequences of the data set (2) with all
sequences of data set (1), using the FASTA program on the web site for
alignment with the default settings for gap penalty and width.
Cross-reactivity between the expressed protein and a known allergen (as
can be found in the protein databases) has to be considered when there
is:
1) more than 35 % identity in the amino acid sequence of the expressed
protein (i.e. without the leader sequence, if any), using a window of
80 amino acids and a suitable gap penalty (using Clustal-type alignment
programs or equivalent alignment programs)
or:
2) identity of 6 contiguous amino acids.
If any of the identity scores equals or exceeds 35 %, this is
considered to indicate significant homology within the context of this
assessment approach. The use of amino acid sequence homologies to
identify prospective cross-reacting allergens in genetically modified
foods has been discussed in more detail elsewhere (Gendel, 1998a;
Gendel, 1998b).
The search facility on the Allermatchtm website automatically carries out
the procedure recommended by the guidelines on protein sequences that
are entered by the website visitor in FASTA format (one-letter code
without residue numbers, see example sequence below). The visitor has
the option to select the following outputs of interest:
- Alignment of 80-amino acids subsequences of the input sequence
using a sliding window of 80-amino acids size. The step size is 1 amino
acid, such that from a sequence of 100 amino acids, for example, 21
subsequences of 80 amino acids length are made (1-80, 2-81, 3-82 ...
20-99, 21-100). Each of these subsequences is aligned to database
sequences. The FASTA computer algorithm is used for these sequence
alignments, as recommended (see above; default values are used). With
FASTA, "head to tail" alignments (from the start to the end of a
sequence) are made of each subsequence with each database sequence. The
default threshold for the number of identical amino acids is 35% in the
alignment with an 80-amino acids window, which is considered a
significant level of identity between the input sequence and the
allergenic protein's sequence (see recommendations cited above). The
identity presented by the website in the results of the alignments is
therefore the % identical amino acids in the 80-amino acids window. The
default threshold can be changed by the user. Input sequences shorter
than 80 amino acids should not be aligned using this option.
- Full alignment of the whole input sequence with database
sequences using the FASTA algorithm. This option can be used, for
example, for input sequences shorter than 80 amino acids, for which the
option of the 80-amino acids sliding window (see above) cannot be used.
Also in case where an input sequence shows sufficient identity with
many proteins over its entire sequence, this option may provide for a
good oversight of the alignments between the input- and database-
sequences.
- Exact hits of short identical stretches of, for example, 6 amino
acids. To this end, a wordmatch algorithm is used, which searches for
identical matches of a specified number of contiguous amino acids
("wordlength") between the input sequence and a given database
sequence. The default value for the wordlength, which can be changed by
the user, is 6 amino acids. Decreasing the wordlength likely results in
a larger number of positive scores, while increasing it may yield less
positive results.
The entered sequences will be compared to the sequences of allergenic
proteins compiled in the database. These sequences of allergenic
proteins have been extracted from the SwissProt list of allergens (http://www.expasy.org/cgi-bin/lists?allergen.txt;
update November 14, 2003) and (putative) signal-, pro-, and
transit-peptides, whose positions are indicated by SwisProt as
"features", have been removed from these sequences, which yields the
sequences of "mature" proteins. The total number of mature protein
sequences in the database is currently 315.
Positive results of the procedure will be provided to the user. To our
knowledge, the automatic feature of this website for comparing the
sequence of a protein of interest to sequences of allergenic proteins
following the recommended procedure is unique on the Internet. The
details of the various options for sequence alignments are discussed in
the following sections.
Example of
matching an input sequence
(print this page so it can be consulted during the subsequent steps
described below)
Input sequence
The following sequence is that of the mature protein of the allergen
Zea m 14 from maize pollen. As
may be noticed, this sequence contains one-letter codes for each amino
acid, while the complete
sequence is made up of 93 letters or amino acids:
aiscgqvasaiapcisyargqgsgpsagccsgvrslnnaarttadrraacnclknaaagvsglnagnaasipskcgvsipytiststdcsrvn
While the original protein sequence in the SwissProt database entry
P19656 consisted of 120 amino acids, removal of the signal peptide
comprising the first 27 amino acids has yielded this mature protein
sequence containing 93 amino acids.
If users enter their own input sequences, numbers in this sequence
should be removed, whereas
spaces, paragraph- or line- returns, need not be removed. In addition,
three-letter codes for amino
acids, such as IleSerCys... (first 3 residues of Zea m 14) should be
changed into one-letter codes, for
example by using web-based conversion tools (for example,
"Three-to-One",
http://www.ualberta.ca/~stothard/javascript/threeToOne.html).
Entering an input sequence and selecting the alignment of interest
Enter the input sequence, by typing or copy-pasting it, in the
searchbox (below "Copy Paste your
amino acid sequence here") of the Allermatchtm main page. With the
cursor, select one of the following
options:
- "Do an 80 amino acids sliding window alignment"
- "Look for a small exact wordmatch"
- "Do a full fasta alignment"
In case the 80-amino acids sliding windows has been chosen, the default
threshold value of 35% identity may be modified by the user in the box
next to "Cutoff Percentage (only applicable to the 80 amino acids
sliding window)". The threshold is the lower limit for alignments that
will be displayed in the following steps (alignments scoring below the
threshold will therefore not be displayed).
If the option for a small exact wordmatch has been chosen, the default
value 6 for the wordlength can be modified by the user in the box next
to " Wordlength (only applicable to the exact wordmatch search)". The
wordlength is the minimal number of amino acids in an exact match.
After having selected the options and thresholds (if applicable) of
interest, click then the "Go" button. The results will appear in the
new page that is created in the same window on the user's screen. The
various outcomes are discussed below for each of the specific options.
80 amino acids sliding window
Summary table
The new page that appears after starting the 80-amino acids sliding
window alignment on the input sequence provides a table with a summary
of the "hits", which are alignments scoring above the cutoff value.
Each specific allergenic protein whose database sequence scored hits is
presented in a new line, while data on this allergenic protein and the
alignment are presented under the following column headings:
- "Hit No", the rank of the best hit (see third column) of the
allergenic protein, such as 1, 2, or 3, while the rank for the highest
best hit is 1.
- "Allergen id", the Allermatchtm identifier for the allergenic
protein whose sequence is stored in the database
- "Best hit (identity)", highest number of identical amino acids in
the hits, expressed as percentage of 80- or more- amino acids, for
example 30% for 24 identical amino acids
- "No of hits ident>....", the number of 80-amino acids
subsequences (windows) of the input sequence that showed hits above the
cut-off value with the database sequence of the allergenic protein
- "% of hits ident>.... ", the fraction (percentage, %) of the
total number of analysed subsequences (windows) of the input sequence
that showed hits above the cutoff value with the allergenic protein
- "Full identity", identical amino acids in the FASTA alignment of
the complete input sequence against the database sequence of the
allergenic protein. The first number is the percentage of identical
amino acids as part of the total length of the alignment, while the
second number is the total length of this alignment expressed as number
of amino acids (including non-identical amino acids).
- "Swissprot", the Swissprot accession number, which is clickable
and provides a link to the original accession file for the database
sequence on the Swissprot website (the user's browser will exit the
Allermatchtm website)
- "Species name", Latin name of the organism from which the
allergenic protein is derived
- "Detailed information", the clickable "Go" button links to a page
with specific details on the database sequence of the allergenic
protein, as well as the complete FASTA alignment and the subsequences
(windows) of the input sequence aligning to the database sequence.
After having clicked on the "Go" button, a new page will appear in the
same window on the user's screen.
Detailed information
This page provides the following information:
- The input sequence (amino acid sequence).
- Details on the database sequence of the allergenic protein,
including allergen name, species name, Swissprot accession, remarks
(for example, signal-, pro-, or transit- peptides that have been
removed from the sequence) and amino acid sequence.
- The complete amino acid sequences of the input- and database-
sequences are shown in this page. Below each one-letter code for amino
acid residues in both of these sequences, a "#"-marking may be
displayed. The residues marked with "#" were aligned with residues in
the other sequence (database or input, respectively) in the 80-amino
acids window alignments that had 35% or more identical amino acids in
the window. Please note that these "#" markings also include
nonidentical residues in both the input- and database- sequences that
were aligned to each other. The 35% cut-off value is fixed for these
"#" markings and cannot be changed by the user.
- The full alignment between the complete input sequence (no
80-amino acids windows) and the allergenic protein.
By clicking the "Show all alignments" button in the upper right corner,
all the separate hits, i.e. alignments of those 80-amino acid
subsequences (windows) of the input sequence that scored equal to- or
above- the cutoff value of 35% (fixed value, cannot be changed by the
user), can be viewed. The new page that appears in the same window on
the user's screen contains the same information as the previous page,
in addition to the separate hits. After clicking "Hide all alignments",
the previous page re-appears.
Example
For the input sequence Zea m 14, for example, the summary table lists
10 database sequences of allergenic proteins that score hits if the
cutoff value equals 35%. Since the Zea m 14 sequence contains 93 amino
acids, 14 subsequences (windows) of 80-amino acids have been generated
(1-80, 2-81, ...., 13-92, 14-93). The highest ranking database sequence
in the table is Zea m 14 itself, because the same sequence has also
been stored in the Allermatchtm database, which shows a best hit of 100%,
while all of the 14 windows of the input sequence scored hits, as
expected. The last ranking database sequence in the table is designated
Par_j_2_a (Allermatchtm identifier), one of the two database sequences of
the allergenic protein Par j 2 derived from weed pollen. The best hit
for this sequence is 36.59% identity, while 5 of the 14 windows scored
hits. The detailed information on the alignments with Par_j_2_a show
that a large part of both the input and database sequence are part of
the 80-amino acid sliding window- and full- alignments. Interestingly,
all the sequences listed in the table are lipid transfer proteins, as
mentioned in the original SwissProt accession files to which the table
provides links.
Exact hits of small stretches of identical amino acids
Summary table
The new page that appears after starting the alignment of small
identical stretches using WordMatch provides a table summarising the
"hits", which are the alignments equal to- or above- the wordlength,
i.e. the minimal number of identical contiguous amino acids. Each of
the database sequences of allergenic proteins that showed a hit with
the input sequence is shown in a separate line of the table, while the
data on the allergenic protein are shown under the following column
headings:
- "No", rank of the database sequence of the allergenic protein,
while the sequence that scores the highest number of wordmatches ranks
number 1.
- "Allergen id", the Allermatchtm identifier for the allergenic
protein whose sequence is stored in the database
- "Number of exact wordmatches", the number of identical stretches
of a given wordlength shared by the input- and database- sequences.
- "% of exact wordmatches", the identical stretches of a given
wordlength shared by the input- and database- sequences, expressed as
percentage of the maximum number of stretches (nonidentical and
identical) of the same wordlength that can be made from the input
sequence.
- "Swissprot", the Swissprot accession number, which is clickable
and provides a link to the original accession file for the database
sequence on the Swissprot website (the user's browser will exit the
Allermatchtm website)
- "Species name", Latin name of the organism from which the
allergenic protein is derived
- "Detailed information", after the "Go" button has been clicked
on, a new page is created in the same window on the user's screen that
contains information on the allergenic protein, and the hits of short
identical stretches.
Detailed information
This page provides the following information on the hits of the
selected wordlength with a specific allergenic protein:
- The input sequence (amino acid sequence)
- Details on the database sequence of the allergenic protein,
including allergen name, species name, Swissprot accession, remarks
(for example, signal-, pro-, or transit- peptides that have been
removed from the sequence) and amino acid sequence.
- The complete amino acid sequences of the input- and database-
sequences, while the "#"-symbols mark the residues within these
sequences that are part of the exact hits with the wordlength of 6
amino acids (fixed wordlength, which does not change to the wordlength
entered by the user).
- Matches that are shorter than 6 amino acids may be found in the
output of the full alignment (see below).
Example
For the Zea m 14 test sequence, the summary table mentions 7 database
sequences of allergenic proteins, including Zea m 14 itself, if a
wordlength of 6 is selected. Besides Zea m 14, the other 6 database
sequences are allergenic proteins that are classified as lipid transfer
proteins. The two last ranking database sequences are Pru av 3 and Pru
ar 3 from cherry and apricot, respectively, each of which scored one
hit. As can be inferred from the detailed information, the single
identical stretch of 6 amino acids (acnclk) in Pru av 3 and Pru ar 3 is
also present in the other 5 listed database sequences.
Full alignment
The new page that appears after starting the full alignment contains
the following information:
- Bar diagram showing the number of hits for certain statistical
scores (E, opt) of the FASTA alignments of the input sequence with the
database sequences of allergenic proteins.
- List of database sequences of allergenic proteins, ranked in
descending order of best statistical scores for the alignment of these
sequences with the input sequence.
- Details of each specific alignment from the previous list, in the
same order, showing the aligned sequences, while the positions of
identical- and similar (non identical, but evolutionarily related)-
residues are indicated by two dots and one dot, respectively, between
the aligned sequences
Example
If Zea m 14 has been entered as input sequence, the highest scoring
database sequences are the same as for the 80-amino acids sliding
window alignment, i.e. lipid transfer proteins, in addition to the 3
database sequences of Par j 1, another lipid transfer protein.
About us
This website has been constructed through a joint effort of RIKILT - Institute of Food Safety and Plant Research International,
both part of Wageningen University and Research Center in Wageningen,
The Netherlands. Both partners also participate in the Allergy
Consortium Wageningen (ACW; http://www.allergymatters.org).
RIKILT - Institute of Food Safety (http://www.rikilt.wur.nl)
is specialised in food safety research, including the safety of
genetically engineered foods and animal feed. For example, RIKILT
develops advanced methods for detection- and safety testing- of
genetically engineered foods. In addition, RIKILT advises national and
international authorities on the safety of genetically engineered foods
and feed.
Participants for RIKILT in this project are Dr.ir. Gijs A. Kleter (gijs.kleter@wur.nl) and Dr.
A.A.C.M. Peijnenburg (ad.peijnenburg@wur.nl).
See also our publications on predicting potential IgE epitopes in novel
proteins [3, 4]
Plant Research International (http://www.plant.wur.nl)
carries research in all fields of plant science, including plant
biotechnology and genomics. It has a proven track record in the field
of bioinformatics applied to genomics and proteomics of plants and
other organisms, such as Arabidopis thaliana and Lactobacillus
plantarum, respectively.
Participants for Plant Research International in this project are Ir.
M. Fiers (mark.fiers@wur.nl) and
Mr. H. Nijland (herman.nijland@wur.nl)
Feedback
It is our goal to improve this website's facilities in future and to
further extend it with a database of IgE epitopes as well as a module
that predicts the antigenicity (likelihood of antibody binding) of
identical stretches. Please send your comments and inquiries to gijs.kleter@wur.nl
References
- Codex Alimentarius Commission (2003) Codex
Principles and Guidelines on Foods Derived from Biotechnology.
Rome, Italy: Codex Alimentarius Commission, Joint FAO/WHO Food
Standards Programme, Food and Agriculture Organization.
ftp://ftp.fao.org/codex/standard/en/CodexTextsBiotechFoods.pdf
This document recommends the following sequence comparisons between
transgenic and allergenic proteins:
8. The purpose of a sequence homology comparison is to assess the
extent to which a newly expressed protein is similar in structure to a
known allergen. This information may suggest whether that protein has
an allergenic potential. Sequence homology searches comparing the
structure of all newly expressed proteins with all known allergens
should be done. Searches should be conducted using various algorithms
such as FASTA or BLASTP to predict overall structural similarities.
Strategies such as stepwise contiguous identical amino acid segment
searches may also be performed for identifying sequences that may
represent linear epitopes. The size of the contiguous amino acid search
should be based on a scientifically justified rationale in order to
minimize the potential for false negative or false positive results*.
Validated search and evaluation procedures should be used in order to
produce biologically meaningful results.
9. IgE cross-reactivity between the newly expressed protein and a known
allergen should be considered a possibility when there is more than 35%
identity in a segment of 80 or more amino acids (FAO/WHO 2001) or other
scientifically justified criteria. All the information resulting from
the sequence homology comparison between the newly expressed protein
and known allergens should be reported to allow a case-by-case
scientifically based evaluation.
* It is recognized that the 2001 FAO/WHO consultation suggested moving
from 8 to 6 identical amino acid segments in searches. The smaller the
peptide sequence used in the stepwise comparison, the greater the
likelihood of identifying false positives, inversely, the larger the
peptide sequence used, the greater the likelihood of false negatives,
thereby reducing the utility of the comparison.
- FAO/WHO (2001) Joint FAO/WHO Expert
Consultation on Foods Derived from Biotechnology - Allergenicity of
Genetically Modified Foods - Rome, 22 - 25 January 2001. Rome: Food and
Agriculture Organisation of the United Nations.http://www.who.int/foodsafety/publications/biotech/en/ec_jan2001.pdf
- Kleter, G.A., Peijnenburg, A.A.C.M. (2002)
Screening of transgenic proteins expressed in transgenic food crops for
the presence of short amino acid sequences identical to potential,
IgE-binding linear epitopes of allergens. BMC Structural Biology 2, 8. http://www.biomedcentral.com/1472-6807/2/8
- Kleter, G.A., Peijnenburg, A.A.C.M. (2003)
Presence of potential allergy-related linear epitopes in novel proteins
from conventional crops and the implication for the safety assessment
of these crops with respect to the current testing of genetically
modified crops. Plant Biotechnology Journal 1, 371-380.
(c) RIKILT-Institute of Food Safety and Plant Research International,
2004
Search algorithms installed and adapted by Ir. Mark Fiers (mark.fiers@wur.nl)
Website constructed by Mr. Herman Nijland (herman.nijland@wur.nl)
Allergen sequences compiled by Dr.ir. Gijs A. Kleter (gijs.kleter@wur.nl) and Dr. Ad
A.C.M.
Peijnenburg (ad.peijnenburg@wur.nl)
The sequence search facility is provided by Applied Bioinformatics from
Plant Research International. The software runs on a Sun V60 server
running Suse Linux. The software is written in Python and is served by an apache webserver and mod_python.
January 29, 2004.
|