Results Interpretation
"OK, I've got a match, but is it the right match?"
Confidence in a peptide mass fingerprint result may come from having independent supporting
evidence. For example, if the analyte originated from
a spot at approximately 40 kDa on a 2D gel separation of yeast proteins,
then the anticipated result of a peptide mass fingerprint is a 40 kDa yeast protein. If the
top scoring protein fits this expectation, the search is deemed "successful".
If the top scoring match is a 200 kDa protein from a different species, the
initial reaction is likely to be that the search has "failed".
While this is a reasonable approach, Mascot provides additional guidance in the
form of a significance level. By default, the significance level is set at 5%. That is,
if the score for a particular match exceeds the significance level, there is less than
a 1 in 20 chance that the observed match is a random event.
If the score is substantially above the significance level, look carefully
before dismissing the result as spurious. Conversely,
if the score is below the significance level, examine the match sceptically.
In most cases, there is prior knowledge of the origin of a sample, so
it is only natural to look for matches to proteins from a particular species
or kingdom. While a Mascot search can be restricted to a particular species,
the taxonomy filter should be used with care:
-
Many sequence databases do not provide species information in a systematic
and rigorous form
-
Contaminants can never be ruled out, and could come from any species, e.g.
BSA or keratins
-
Unless the genome of the species of interest is completely sequenced, there
is no guarantee that the true sequence of the analyte protein is actually present in
the database. If it is missing, then high scoring matches from other species
are of interest because they are likely to be homologous to the unknown.
It is the uncertainty in the mass of the intact protein which is the Achilles
heel of a peptide mass fingerprint. This uncertainty is unavoidable, even when
an accurate experimental
mass for the intact protein is available, because it is unlikely that the
mass of the expressed and processed protein will be exactly the same as that
of the sequence entry in the protein database. A peptide mass fingerprint
can only provide the
statistically most probable identification. This is a great step
forward over simply counting peptide mass matches, which can only work
when a ceiling is placed on the intact protein mass. Otherwise, the mega-proteins
always come out top of the list due to random matches. Unfortunately, even
with an ideal scoring algorithm, there may be insufficient matching
mass values for a confident identification without making assumptions about
the intact protein mass or the species.
One method of improving the specificity of a peptide mass fingerprint was
first proposed by Peter James [James, 1994]. Simply do
additional digests using different proteases. Seeing the same protein with
a high score in two independent digests provides a similar degree of confidence
to seeing multiple peptide matches in an MS/MS ions search.
In an MS/MS ions search, confidence that a protein, (as
opposed to a peptide), has been identified correctly
comes largely from getting multiple matches to peptides from the
same protein. An MS/MS ions search on data from an isolated peptide may actually be
less specific than a good peptide mass fingerprint. This is because a peptide mass
fingerprint has two significant advantages which are often overlooked:
-
Unsuspected post translational modifications cause only marginal data loss
in a peptide mass fingerprint. For example, if the cysteines have been modified by acrylamide,
but this has not been specified in the the search, only mass values from
the cysteine containing peptides are "wasted". As long as there was a reasonable
number of mass values, the chances of a successful search are still good. In contrast,
a single unsuspected post translational modification in an MS/MS ions spectrum eliminates
the peptide completely.
-
A peptide mass fingerprint provides "shotgun" coverage of the entire protein.
An MS/MS ions search on data from
a single peptide can only ever identify the peptide. This peptide may be
unique to one protein, or may be common to a number of proteins.
In Mascot, the score for an MS/MS match is based on the absolute probability (P)
that the observed match between the experimental data and the database sequence
is a random event. The reported score is -10Log(P). So, during a
search, if 1.5 x 10^5 peptides fell within the mass tolerance window about the
precursor mass, and the significance threshold was chosen to be 0.05, (a 1 in 20
chance of a false positive), this would translate into a score threshold of 65.
If the quality of an MS/MS spectrum is poor, particularly if the signal to noise
ratio is low, a match to the "correct" sequence might not exceed this absolute
threshold. Even so, the match to the correct sequence could have a relatively
high score, which is well differentiated from the quasi-normal distribution of
1.5 x 10^5 random scores. In other words, the score is an outlier. This would
indicate that the match is not a random event and, on inspection, such matches
are often found to be either the correct match or a match to a close homologue.
For this reason, Mascot also attempts to characterise the distribution of random
scores, and provide a second, lower threshold to highlight the presence of any
outlier. The lower, relative threshold is reported as the "homology" threshold
while the higher, absolute threshold is reported as the "identity" threshold.
Or, to put it another way: An ideal MS/MS spectrum would have one or more
complete fragment ion series, no unassigned noise peaks, and perfect mass
accuracy. If we had such ideal data, we would not need probability based
scoring. Of course, real mass spectra are never ideal, and we never get a
perfect match. So, given that the match is not perfect, we try to provide
guidance as to whether the score indicates (i) a meaningless, random match, (ii)
the MS/MS spectrum represents either the sequence in the database or
something very similar (significant homology), (iii) there is a high probability
that the MS/MS spectrum represents
the exact sequence in the database (identity or extensive homology).
Unlike a BLAST search, in the absence of ideal data, we can never be certain
that we have identity, so we prefer the term "identity or extensive homology".
The protein score in a Peptide Summary is derived from the ions scores.
For a search that contains a small number of queries, the protein score is the sum of the
unique ions scores. That is, excluding the scores for duplicate matches, which are shown in parentheses.
A small correction is applied to reduce the contribution of low-scoring random matches.
This correction is a function of the total number of molecular mass matches for each query
and the width of the peptide tolerance window. This correction is usually very small,
except in no enzyme searches.
This protein score works well, and provides a logical order to the report. If multiple
queries match to a single protein, but the individual ions scores are below threshold, the
combined ions scores can still place the protein high in the report. However, the
standard protein score is less satisfactory for searches with very large numbers of queries,
such as MudPIT data sets. For each MS/MS query, Mascot
retains up to 10 peptide matches. When the number of queries is comparable with the number
of entries in the database, this means that there can be random, low-scoring matches
for every entry. Although the average number of random matches per entry might be low, the actual number
will follow a distribution, and some entries will have large numbers of low scoring matches,
leading to large protein scores.
While it is obvious from a detailed study of the report that these are meaningless matches,
it would be better to eliminate them entirely. So, if the ratio between the number of queries
and the number of entries in the database exceeds a
pre-determined threshold, the basis for calculating the protein score is changed. Only
those ions scores that exceed one or both significance thresholds contribute to the score,
so that low scoring, random matches have no effect. This gives a much cleaner report for a large
scale search. This threshold is 0.001 by default, and can be changed on a global
basis in the configuration file, mascot.dat, or changed for a single report by using the
format controls at the top of the report. Note
that, when calculating this threshold, if a taxonomy filter is being used, the number of
entries in the database is the number remaining after the taxonomy filter.
Reporting the results from a search which includes multiple queries can be complex,
because it is not always clear which peptide "belongs" to which protein.
The use of red and bold
typefaces is intended to highlight the most logical assignment
of peptides to proteins. The first time a peptide match to a query appears in the report, it is
shown in bold face. Whenever the top ranking peptide match appears, it is shown in red.
This means that protein hits with peptide matches that are both bold and red are the most
likely assignments. These
hits represent the highest scoring protein that contains one or more top ranking peptide
matches.
The Peptide Summary report seeks to answer the question: "which minimal set of
proteins completely accounts for the peptide matches found the experimental
data?". This approach is sometimes referred to as Occam's Razor or the Principal
of Parsimony. Proteins that span
the same set of peptides, or a subset, are collapsed into a single entry on the hit list.
Clear-cut matches can usually be expected in the first few hits
from a large and complex search. More ambiguous cases may be found further down the list. For
example, the highest scoring protein might contain a poorly scoring peptide match, well below
the first rank.
However, the fact that the protein is known to be present because of other matches would suggest that
the low score reflects poor quality data rather than an incorrect match. The only reason to
doubt such an assignment would be if a substantially better match, with a score
above the significance threshold, could be found in a different protein.
Searches of LC-MS/MS data sets can produce an overwhelming volume of data. It may
not be feasible to study every single peptide match; one has to automatically
separate the good matches from the bad as efficiently as possible. The other
problem associated with MudPIT-type searches is that the result report may become so
large that it is slow to load and scroll in a standard web browser.
The format controls near the top of the report
can help streamline the results from a large search by eliminating most of the "junk".
These options can also be selected by adding URL switches to
the report URL.
Large search mode protein scoring: By default, large searches
will switch to using more aggressive protein scoring. This removes
many of the junk protein hits, which have high protein scores but no high scoring
peptide matches.
Require Bold Red: Requiring a protein hit to include at least one bold
red peptide match is a good way to remove duplicate homologous proteins from a report.
You can turn this on using a checkbox in the format controls.
The down-side is that you may sometimes throw out the wrong protein! For example, imagine you
are searching with a taxonomy of mammals but are mainly interested in yeti proteins. If the
same strong peptide matches are found in a yeti protein and also in the human homologue,
and one or more junk peptide matches prevent the two proteins collapsing into a single hit, but give the
human protein a slightly higher score, that is the one that will feature in the report.
Ignore Ions Score Below: You can minimise the previous problem by judicious use of the
Ions score cut-off field. By setting this to (say) 20, you cut out all of the very low
scoring, random peptide matches. This means that homologous proteins are more likely to collapse
into a single hit, avoiding the need to choose between them.
Suppress the pop-ups: The JavaScript pop-up windows, that show the top 10 peptide matches
for each query, are very useful, but they make the HTML report much larger and slower to load
in a web browser. If you have a report that never seems to load, or is very slow to scroll, try
using the radio buttons to suppress pop-ups.
|