Data File Format
A Mascot data file is a plain text (ASCII) file containing peak list information and,
optionally, search parameters.
For a Peptide Mass Fingerprint, the file should contain a list
of peptide mass values, one per line, optionally followed by white space
and a peak area or intensity value.
The peak list formats of a wide range of instrument data systems are
directly compatible with these requirements. In addition, Mascot will automatically
recognise the following formats:
For an MS/MS Ions Search, the data file must contain one or more
MS/MS peak lists. In the Mascot generic format, (MGF), each MS/MS
dataset is a list of pairs of mass and intensity values, delimited by
BEGIN IONS and END IONS statements. The following
formats are also supported for MS/MS data:
A data file may include embedded search parameters.
Most embedded parameters can only appear once, at the head of the data file.
In a Mascot generic format file, a few parameters can appear within an MS/MS
dataset.
If there is a conflict between the values of the embedded parameters and values
entered into search form fields, the embedded parameters always take precedence.
The search form fields are essentially defaults for values missing from the
data file.
The following paragraphs illustrate the data file formats by means of examples.
The rules which Mascot follows when parsing a data file provide an
alternative description of what is and is not acceptable.
The Mascot generic format for a data file submitted to Mascot is (square brackets
indicate optional elements, they should not be included in an actual data file):
[Embedded Parameter(s)]
Query 1
[Query 2]
.
.
.
[Query N]
Blank lines can be used anywhere, to improve readability.
Comment lines beginning with
one of the symbols #;!/ can be included, but only outside of the
BEGIN IONS and END IONS statements that delimit an MS/MS dataset.
Peptide Mass Fingerprint
In the case of a Peptide Mass Fingerprint, each query is just a single peptide
mass value, with an optional second value for peak area or intensity. For example:
764.2
1231.0
1284
1944.8
2020.2
2100.35
Or
764.2 2010
1231.0 2345
1284 456
1944.8 1012
2020.2 23
2100.35 566
If your MS data system outputs additional values on each line,
these will be ignored.
There are two ways to change default search parameters. One way is using the search form
fields. The other is to place embedded parameters at the beginning of the data file.
For example:
COM=Digest #A6345
CLE=Lys-C
CHARGE=1+
PFA=1
764.2 2010
1231.0 2345
1284 456
1944.8 1012
2020.2 23
2100.35 566
The embedded parameters (COM, CLE, CHARGE, PFA) over-ride the entries in the
corresponding form fields, if any. All of the other search parameters default to
the search form settings.
A peptide mass fingerprint data file can only contain peptide mass fingerprint queries.
Sequence queries or MS/MS datasets are not permitted.
MS/MS Ions Search
For an MS/MS
Ions Search, each query represents a complete MS/MS spectrum, and is
delimited by a pair of statements: BEGIN IONS and END IONS.
The search form defaults can be over-ridden
by including embedded parameters at the beginning of the data file. Parameters specified in the
search form or the data file header apply to the entire search.
Within each MS/MS query, the mass of the precursor peptide must be specified using the
PEPMASS parameter. Additional parameters within
each query are optional, and can be used to specify:
- TITLE for spectrum identification
- CHARGE state of the precursor peptide
- TOL peptide tolerance
- TOLU peptide tolerance units
- SEQ for a sequence qualifier (multiple SEQ qualifiers are allowed)
- COMP for a composition qualifier (only one COMP qualifier is allowed)
- TAG for a sequence tag (multiple TAG qualifiers are allowed)
- ETAG for an error tolerant sequence tag (multiple ETAG qualifiers are allowed)
- SCANS scan number or range
- RTINSECONDS retention time or range (in seconds)
- INSTRUMENT ion series to be matched
- IT_MODS variable modifications
Parameters within an MS/MS query only apply locally, to the one spectrum. In the case of
the CHARGE parameter, this means that you can have a global CHARGE setting, either from
the search form or from a parameter at the head of the data file, as well as a local
setting in one or more of the MS/MS queries. This can be useful if the mass spectrometer
data system cannot always determine precursor charge state correctly. For example, the global
setting could be 2+ and 3+. When an unambiguous charge state can be determined,
the correct charge is written to the local CHARGE parameter.
Parameters within an MS/MS query must always be at the beginning, immediately following
the BEGIN IONS tag.
They cannot appear within or following the fragment ion list. For example:
COM=10 pmol digest of Sample X15
ITOL=1
ITOLU=Da
MODS=Carbamidomethyl (C)
IT_MODS=Oxidation (M)
MASS=Monoisotopic
USERNAME=Lou Scene
USEREMAIL=leu@altered-state.edu
CHARGE=2+ and 3+
BEGIN IONS
TITLE=Spectrum 1
PEPMASS=983.6
846.60 73
846.80 44
847.60 67
.
.
.
1640.10 291
1640.60 54
1895.50 49
END IONS
BEGIN IONS
TITLE=Spectrum 2
PEPMASS=1084.9
SCANS=3
RTINSECONDS=25
345.10 237
370.20 128
460.20 108
.
.
.
1673.30 1007
1674.00 974
1675.30 79
END IONS
BEGIN IONS
TITLE=Spectrum 3
PEPMASS=1244.7
SCANS=4-5,7,29-34
RTINSECONDS=26-27,40,95-97
.
.
.
Most MS data systems have some form of ASCII peak list export, which will generate
a file requiring only minor modification with a text editor to conform to this
format.
Fragment ion intensity information is very important. Mascot will iteratively select
sub-sets of the most intense peaks,
looking for the group which most clearly discriminates the score of the top matched
protein. Precursor intensity can be specified by including a second value on the
PEPMASS line, delimited by white space.
N.B. There is an upper limit of 10,000 peaks per individual MS/MS spectrum.
If you see an error message reporting that this limit has been exceeded, it almost certainly
means that your data are profile data, and not peak lists. Mascot has a peptide mass limit of
16 kDa or 255 residues. This makes it very unlikely that a single MS/MS spectrum
could ever contain more than 1000 genuine peaks, never mind 10,000.
It is possible for an MS/MS ions search data file in the Mascot generic format to include
sequence queries and peptide mass fingerprint queries.
This is not allowed if the file contains proprietary format MS/MS data, and neither is mixing
proprietary formats.
Here is a rather baroque example:
# following lines define parameters.
# NB no spaces allowed on either side of the = symbol
COM=My favourite protein has been eaten by an enzyme
CLE=Trypsin
CHARGE=2+
# following line will be treated as a peptide mass
1024.6
# following line is a sequence query, which must
# conform precisely to sequence query syntax rules
2321 seq(n-ACTL) comp(2[C])
# so is this
1896 ions(345.6:24.7,347.8:45.4, ... ,1024.7:18.7)
# An MS/MS ions query is delimited by the tags
# BEGIN IONS and END IONS. Space(s)
# are used to separate mass and intensity values
BEGIN IONS
TITLE=The first peptide - dodgy peak detection, so extra wide tolerance
PEPMASS=896.05 25674.3
CHARGE=3+
TOL=3
TOLU=Da
SEQ=n-AC[DHK]
COMP=2[H]0[M]3[DE]*[K]
240.1 3
242.1 12
245.2 32
.
.
.
1623.7 55
1624.7 23
END IONS
Search parameters can be embedded into the data file or entered in the search form query window using
the following parameter labels. In the absence of an embedded parameter, the default value is
the setting of the corresponding search form field.
The FORMAT parameter is used to identify proprietary MS/MS dataset
formats. It can appear once only, at the start of the file. If there is no FORMAT parameter,
the default is Mascot generic format (MGF).
If the peak list format is not MGF, then parameters can only
appear once, in the data file header, before the peak list begins.
For an MGF peak list, parameters with a tick in the Header column of the table below can
appear in the header and those with a tick in the Local column can appear in the local scope of
a single MS/MS query (spectrum). That is, after the BEGIN IONS line and before the fragment mass and
intensity values.
|