On this page

Scoring Schemes

Mowse

The Mowse scoring algorithm is described in [Pappin, 1993].

The first stage of a Mowse search is to compare the calculated peptide masses for each entry in the sequence database with the set of experimental data. Each calculated value which falls within a given mass tolerance of an experimental value counts as a match. A molecular weight range for the intact protein can be used as a pre-filter.

Rather than just counting the number of matching peptides, Mowse uses empirically determined factors to assign a statistical weight to each individual peptide match. The matrix of weighting factors is calculated during the database build stage, as follows:

A frequency factor matrix, F, is created, in which each row represents an interval of 100 Da in peptide mass, and each column an interval of 10 kDa in intact protein mass. As each sequence entry is processed, the appropriate matrix elements f_i,j are incremented so as to accumulate statistics on the size distribution of peptide masses as a function of protein mass. The elements of F are then normalised by dividing the elements of each 10 kDa column by the largest value in that column to give the Mowse factor matrix M:

equation 1

After searching the experimental mass values against a calculated peptide mass database, the score for each entry is calculated according to:

Where M_Prot is the molecular weight of the entry and the product term is calculated from the Mowse factor elements for each match between the experimental data and peptide masses calculated from the entry.

Probability Based Scoring

Mascot incorporates a probability based implementation of the Mowse algorithm. The Mowse algorithm is an excellent starting point because it accurately models the behaviour of a proteolytic enzyme. By casting the Mowse score into a probabilistic framework, we gain a number of additional benefits:

A simple rule can be used to judge whether a result is significant or not.
Different types of matching (peptide masses and fragment ions) can be combined in a single search.
Scores from different searches and on different databases can be compared.
Search parameters can be optimised more readily by iteration.

Matches using mass values (either peptide masses or MS/MS fragment ion masses) are always handled on a probabilistic basis. The total score is the absolute probability that the observed match is a random event. Reporting probabilities directly can be confusing. Partly because they encompass a very wide range of magnitudes, and also because a "high" score is a "low" probability, which can be ambiguous. For this reason, we report scores as -10*LOG₁₀(P), where P is the absolute probability. A probability of 10^-20 thus becomes a score of 200.

Significance Level

Given an absolute probability that a match is random, and knowing the size of the sequence database being searched, it becomes possible to provide an objective measure of the significance of a result. A commonly accepted threshold is that an event is significant if it would be expected to occur at random with a frequency of less than 5%. This is the value which is reported on the master results page.

The master results page for typical peptide mass fingerprint search (open in new window) reports that "Scores greater than 67 are significant (p<0.05)". The histogram of the score distribution looks like this:

The protein with the high score of 108 is a 26 kDa heat shock protein from yeast. This is a nice result because the highest score is highly significant, leaving little room for doubt.

(It may be useful to think of the score histogram as a highly magnified view of the extreme tail of the distribution of scores for all the entries in the sequence database. In this case, 50 entries out of 257,964. Scores in the green region are inside this tail, and are of no significance. A real match, which is a non-random event, gives a score which is well clear of the tail.)

It is important to distinguish between a significant match and the best match. Ideally, the correct match is both the best match and a significant match. However, significance is a function of data quality. It may be that there are just not enough mass values or the mass measurement accuracy is not good enough to get a significant match. This doesn't mean that the best match isn't correct, it just means that you must study the result more critically.

To illustrate the difference between a significant match and a correct match, try repeating the search in the example, but with the mass tolerance increased from ±0.1 Da to ±1.0 Da. The discrimination of the search is greatly reduced, and the score for the correct match falls close to the significance level:

The best match is still correct, but it is barely significant. If we did 20 such searches, we could expect to get this score by chance alone because there is such a huge number of entries in the sequence database. The correct match remains at the top of the hit list even when the mass tolerance is increased to ±2.0 Da, but because the score is well below the significance threshold, there could be no confidence in this match if it was an unknown. Increase the mass tolerance to ±2.5 Da, and the match is finally lost. The highest score is 48 for a random match

Even if this was an unknown, it is clear from the significance level that this is not a useful match, and there is no danger of this result becoming a false positive.

Expectation Values

Each protein score in a peptide mass fingerprint, and each ions score in an MS/MS search, is accompanied by an expectation value. This is the number of matches with equal or better scores that are expected to occur by chance alone. It is directly equivalent to the E-value in a Blast search result. For a score that is exactly on the default significance threshold, (p<0.05), the expectation value is also 0.05. Increase the score by 10 and the expectation value drops to 0.005. The lower the expectation value, the more significant the score.

Mass Tolerances

If the number of matched mass values is constant, the score in a peptide mass fingerprint will be inversely related to the mass tolerance, as shown in the example above. This is not the case for an MS/MS ions search, where increasing the peptide mass tolerance will have no effect on the ions score. This is because the ions score comes from the MS/MS fragment ion matches. Opening up the peptide mass tolerance means that Mascot has to test many more peptides, so the search takes longer and the discrimination is reduced, but the ions score remains unchanged.

Of course, if the peptide mass tolerance is set too tightly, in an effort to improve discrimination, one or more of the peptide matches may be lost, which will dramatically reduce the overall score.

Limitations

Like any statistical approach, Probability Based scoring depends on assumptions and models.

One of these assumptions is that the entries in the sequence databases are random sequences. This is not always a good assumption. Some of the most glaring examples involve extended repeats, such as AAC62527, porcine submaxillary apomucin. Although the molecular weight of this protein is 1.2 MDa, over 80% of the sequence is composed of an identical 7 kDa repeat. It is difficult to know how to treat such cases. If a single experimental peptide mass is allowed to match to multiple calculated masses, then a single experimental mass which matches within a repeat will give a huge and meaningless score. But, if duplicate matches are not permitted, it will be virtually impossible to get a match to such a protein because the number of measurable mass values is too small to give a statistically significant score.

Another assumption is that the experimental measurements are independent determinations. This will not be true if the data include multiple mass values for the same peptide, even if these are from ions with different charge states in an electrospray LC-MS run. Good peak detection and thresholding (in both mass and time domains for LC-MS) are essential for any scoring algorithm to give meaningful results.

Sequence Query Scoring

Amino acid sequence or composition information, if included as seq(…) or comp(…) qualifiers, is treated as a filter on the candidate sequences. Ambiguous sequence or composition data can be used (in a manner similar to a regular expression search in computing) but it still functions as a filter, not a probabilistic match of the type found in a Blast or Fasta search.

In contrast, tag(…) and etag(…) qualifiers are scored probabilistically. That is, the more qualifiers that match, the higher the score, but all qualifiers are not required to match.