Results Interpretation

"OK, I've got a match, but is it the right match?"

Peptide Mass Fingerprint

Confidence in a peptide mass fingerprint result may come from having independent supporting evidence. For example, if the analyte originated from a spot at approximately 40 kDa on a 2D gel separation of yeast proteins, then the anticipated result of a peptide mass fingerprint is a 40 kDa yeast protein. If the top scoring protein fits this expectation, the search is deemed "successful". If the top scoring match is a 200 kDa protein from a different species, the initial reaction is likely to be that the search has "failed".

While this is a reasonable approach, Mascot provides additional guidance in the form of a significance level. By default, the significance level is set at 5%. That is, if the score for a particular match exceeds the significance level, there is less than a 1 in 20 chance that the observed match is a random event.

If the score is substantially above the significance level, look carefully before dismissing the result as spurious. Conversely, if the score is below the significance level, examine the match sceptically.

In most cases, there is prior knowledge of the origin of a sample, so it is only natural to look for matches to proteins from a particular species or kingdom. While a Mascot search can be restricted to a particular species, the taxonomy filter should be used with care:

Many sequence databases do not provide species information in a systematic and rigorous form
Contaminants can never be ruled out, and could come from any species, e.g. BSA or keratins
Unless the genome of the species of interest is completely sequenced, there is no guarantee that the true sequence of the analyte protein is actually present in the database. If it is missing, then high scoring matches from other species are of interest because they are likely to be homologous to the unknown.

It is the uncertainty in the mass of the intact protein which is the Achilles heel of a peptide mass fingerprint. This uncertainty is unavoidable, even when an accurate experimental mass for the intact protein is available, because it is unlikely that the mass of the expressed and processed protein will be exactly the same as that of the sequence entry in the protein database. A peptide mass fingerprint can only provide the statistically most probable identification. This is a great step forward over simply counting peptide mass matches, which can only work when a ceiling is placed on the intact protein mass. Otherwise, the mega-proteins always come out top of the list due to random matches. Unfortunately, even with an ideal scoring algorithm, there may be insufficient matching mass values for a confident identification without making assumptions about the intact protein mass or the species.

One method of improving the specificity of a peptide mass fingerprint was first proposed by Peter James [James, 1994]. Simply do additional digests using different proteases. Seeing the same protein with a high score in two independent digests provides a similar degree of confidence to seeing multiple peptide matches in an MS/MS ions search.

MS/MS Ions Search

In an MS/MS ions search, confidence that a protein, (as opposed to a peptide), has been identified correctly comes largely from getting multiple matches to peptides from the same protein. An MS/MS ions search on data from an isolated peptide may actually be less specific than a good peptide mass fingerprint. This is because a peptide mass fingerprint has two significant advantages which are often overlooked:

Unsuspected post translational modifications cause only marginal data loss in a peptide mass fingerprint. For example, if the cysteines have been modified by acrylamide, but this has not been specified in the the search, only mass values from the cysteine containing peptides are "wasted". As long as there was a reasonable number of mass values, the chances of a successful search are still good. In contrast, a single unsuspected post translational modification in an MS/MS ions spectrum eliminates the peptide completely.
A peptide mass fingerprint provides "shotgun" coverage of the entire protein. An MS/MS ions search on data from a single peptide can only ever identify the peptide. This peptide may be unique to one protein, or may be common to a number of proteins.

Ions score significance thresholds

In Mascot, the score for an MS/MS match is based on the absolute probability (P) that the observed match between the experimental data and the database sequence is a random event. The reported score is -10Log(P). So, during a search, if 1.5 x 10^5 peptides fell within the mass tolerance window about the precursor mass, and the significance threshold was chosen to be 0.05, (a 1 in 20 chance of a false positive), this would translate into a score threshold of 65.

If the quality of an MS/MS spectrum is poor, particularly if the signal to noise ratio is low, a match to the "correct" sequence might not exceed this absolute threshold. Even so, the match to the correct sequence could have a relatively high score, which is well differentiated from the quasi-normal distribution of 1.5 x 10^5 random scores. In other words, the score is an outlier. This would indicate that the match is not a random event and, on inspection, such matches are often found to be either the correct match or a match to a close homologue. For this reason, Mascot also attempts to characterise the distribution of random scores, and provide a second, lower threshold to highlight the presence of any outlier. The lower, relative threshold is reported as the "homology" threshold while the higher, absolute threshold is reported as the "identity" threshold.

Or, to put it another way: An ideal MS/MS spectrum would have one or more complete fragment ion series, no unassigned noise peaks, and perfect mass accuracy. If we had such ideal data, we would not need probability based scoring. Of course, real mass spectra are never ideal, and we never get a perfect match. So, given that the match is not perfect, we try to provide guidance as to whether the score indicates (i) a meaningless, random match, (ii) the MS/MS spectrum represents either the sequence in the database or something very similar (significant homology), (iii) there is a high probability that the MS/MS spectrum represents the exact sequence in the database (identity or extensive homology).

Unlike a BLAST search, in the absence of ideal data, we can never be certain that we have identity, so we prefer the term "identity or extensive homology".

More about protein scores

The protein score in a Peptide Summary is derived from the ions scores. For a search that contains a small number of queries, the protein score is the sum of the unique ions scores. That is, excluding the scores for duplicate matches, which are shown in parentheses. A small correction is applied to reduce the contribution of low-scoring random matches. This correction is a function of the total number of molecular mass matches for each query and the width of the peptide tolerance window. This correction is usually very small, except in no enzyme searches.

This protein score works well, and provides a logical order to the report. If multiple queries match to a single protein, but the individual ions scores are below threshold, the combined ions scores can still place the protein high in the report. However, the standard protein score is less satisfactory for searches with very large numbers of queries, such as MudPIT data sets. For each MS/MS query, Mascot retains up to 10 peptide matches. When the number of queries is comparable with the number of entries in the database, this means that there can be random, low-scoring matches for every entry. Although the average number of random matches per entry might be low, the actual number will follow a distribution, and some entries will have large numbers of low scoring matches, leading to large protein scores.

While it is obvious from a detailed study of the report that these are meaningless matches, it would be better to eliminate them entirely. So, if the ratio between the number of queries and the number of entries in the database exceeds a pre-determined threshold, the basis for calculating the protein score is changed. Only those ions scores that exceed one or both significance thresholds contribute to the score, so that low scoring, random matches have no effect. This gives a much cleaner report for a large scale search. This threshold is 0.001 by default, and can be changed on a global basis in the configuration file, mascot.dat, or changed for a single report by using the format controls at the top of the report. Note that, when calculating this threshold, if a taxonomy filter is being used, the number of entries in the database is the number remaining after the taxonomy filter.

Grouping of peptide matches to protein hits

Reporting the results from a search which includes multiple queries can be complex, because it is not always clear which peptide "belongs" to which protein. The use of red and bold typefaces is intended to highlight the most logical assignment of peptides to proteins. The first time a peptide match to a query appears in the report, it is shown in bold face. Whenever the top ranking peptide match appears, it is shown in red. This means that protein hits with peptide matches that are both bold and red are the most likely assignments. These hits represent the highest scoring protein that contains one or more top ranking peptide matches.

The Peptide Summary report seeks to answer the question: "which minimal set of proteins completely accounts for the peptide matches found the experimental data?". This approach is sometimes referred to as Occam's Razor or the Principal of Parsimony. Proteins that span the same set of peptides, or a subset, are collapsed into a single entry on the hit list.

Clear-cut matches can usually be expected in the first few hits from a large and complex search. More ambiguous cases may be found further down the list. For example, the highest scoring protein might contain a poorly scoring peptide match, well below the first rank. However, the fact that the protein is known to be present because of other matches would suggest that the low score reflects poor quality data rather than an incorrect match. The only reason to doubt such an assignment would be if a substantially better match, with a score above the significance threshold, could be found in a different protein.

Large MS/MS Searches

Searches of LC-MS/MS data sets can produce an overwhelming volume of data. It may not be feasible to study every single peptide match; one has to automatically separate the good matches from the bad as efficiently as possible. The other problem associated with MudPIT-type searches is that the result report may become so large that it is slow to load and scroll in a standard web browser.

The format controls near the top of the report can help streamline the results from a large search by eliminating most of the "junk". These options can also be selected by adding URL switches to the report URL.

Large search mode protein scoring: By default, large searches will switch to using more aggressive protein scoring. This removes many of the junk protein hits, which have high protein scores but no high scoring peptide matches.

Require Bold Red: Requiring a protein hit to include at least one bold red peptide match is a good way to remove duplicate homologous proteins from a report. You can turn this on using a checkbox in the format controls. The down-side is that you may sometimes throw out the wrong protein! For example, imagine you are searching with a taxonomy of mammals but are mainly interested in yeti proteins. If the same strong peptide matches are found in a yeti protein and also in the human homologue, and one or more junk peptide matches prevent the two proteins collapsing into a single hit, but give the human protein a slightly higher score, that is the one that will feature in the report.

Ignore Ions Score Below: You can minimise the previous problem by judicious use of the Ions score cut-off field. By setting this to (say) 20, you cut out all of the very low scoring, random peptide matches. This means that homologous proteins are more likely to collapse into a single hit, avoiding the need to choose between them.

Suppress the pop-ups: The JavaScript pop-up windows, that show the top 10 peptide matches for each query, are very useful, but they make the HTML report much larger and slower to load in a web browser. If you have a report that never seems to load, or is very slow to scroll, try using the radio buttons to suppress pop-ups.