Matrix Science
Home Mascot Help  
   
  Help > Data File Format   
 
 

Data File Format

A Mascot data file is a plain text (ASCII) file containing peak list information and, optionally, search parameters.

For a Peptide Mass Fingerprint, the file should contain a list of peptide mass values, one per line, optionally followed by white space and a peak area or intensity value. The peak list formats of a wide range of instrument data systems are directly compatible with these requirements. In addition, Mascot will automatically recognise the following formats:

For an MS/MS Ions Search, the data file must contain one or more MS/MS peak lists. In the Mascot generic format, (MGF), each MS/MS dataset is a list of pairs of mass and intensity values, delimited by BEGIN IONS and END IONS statements. The following formats are also supported for MS/MS data:

A data file may include embedded search parameters. Most embedded parameters can only appear once, at the head of the data file. In a Mascot generic format file, a few parameters can appear within an MS/MS dataset.

If there is a conflict between the values of the embedded parameters and values entered into search form fields, the embedded parameters always take precedence. The search form fields are essentially defaults for values missing from the data file.

The following paragraphs illustrate the data file formats by means of examples. The rules which Mascot follows when parsing a data file provide an alternative description of what is and is not acceptable.

Mascot Generic Format

The Mascot generic format for a data file submitted to Mascot is (square brackets indicate optional elements, they should not be included in an actual data file):

[Embedded Parameter(s)]
 Query 1
[Query 2]
 .
 .
 .
[Query N]

Blank lines can be used anywhere, to improve readability.

Comment lines beginning with one of the symbols #;!/ can be included, but only outside of the BEGIN IONS and END IONS statements that delimit an MS/MS dataset.

Peptide Mass Fingerprint

In the case of a Peptide Mass Fingerprint, each query is just a single peptide mass value, with an optional second value for peak area or intensity. For example:

764.2
1231.0
1284
1944.8
2020.2
2100.35
Or
764.2 2010
1231.0 2345
1284 456
1944.8 1012
2020.2 23
2100.35 566

If your MS data system outputs additional values on each line, these will be ignored.

There are two ways to change default search parameters. One way is using the search form fields. The other is to place embedded parameters at the beginning of the data file. For example:

COM=Digest #A6345
CLE=Lys-C
CHARGE=1+
PFA=1
764.2 2010
1231.0 2345
1284 456
1944.8 1012
2020.2 23
2100.35 566

The embedded parameters (COM, CLE, CHARGE, PFA) over-ride the entries in the corresponding form fields, if any. All of the other search parameters default to the search form settings.

A peptide mass fingerprint data file can only contain peptide mass fingerprint queries. Sequence queries or MS/MS datasets are not permitted.

MS/MS Ions Search

For an MS/MS Ions Search, each query represents a complete MS/MS spectrum, and is delimited by a pair of statements: BEGIN IONS and END IONS.

The search form defaults can be over-ridden by including embedded parameters at the beginning of the data file. Parameters specified in the search form or the data file header apply to the entire search. Within each MS/MS query, the mass of the precursor peptide must be specified using the PEPMASS parameter. Additional parameters within each query are optional, and can be used to specify:

  • TITLE for spectrum identification
  • CHARGE state of the precursor peptide
  • TOL peptide tolerance
  • TOLU peptide tolerance units
  • SEQ for a sequence qualifier (multiple SEQ qualifiers are allowed)
  • COMP for a composition qualifier (only one COMP qualifier is allowed)
  • TAG for a sequence tag (multiple TAG qualifiers are allowed)
  • ETAG for an error tolerant sequence tag (multiple ETAG qualifiers are allowed)
  • SCANS scan number or range
  • RTINSECONDS retention time or range (in seconds)
  • INSTRUMENT ion series to be matched
  • IT_MODS variable modifications

Parameters within an MS/MS query only apply locally, to the one spectrum. In the case of the CHARGE parameter, this means that you can have a global CHARGE setting, either from the search form or from a parameter at the head of the data file, as well as a local setting in one or more of the MS/MS queries. This can be useful if the mass spectrometer data system cannot always determine precursor charge state correctly. For example, the global setting could be 2+ and 3+. When an unambiguous charge state can be determined, the correct charge is written to the local CHARGE parameter.

Parameters within an MS/MS query must always be at the beginning, immediately following the BEGIN IONS tag. They cannot appear within or following the fragment ion list. For example:

COM=10 pmol digest of Sample X15
ITOL=1
ITOLU=Da
MODS=Carbamidomethyl (C)
IT_MODS=Oxidation (M)
MASS=Monoisotopic
USERNAME=Lou Scene
USEREMAIL=leu@altered-state.edu
CHARGE=2+ and 3+

BEGIN IONS
TITLE=Spectrum 1
PEPMASS=983.6
846.60 73
846.80 44
847.60 67
.
.
.
1640.10 291
1640.60 54
1895.50 49
END IONS

BEGIN IONS
TITLE=Spectrum 2
PEPMASS=1084.9
SCANS=3
RTINSECONDS=25
345.10 237
370.20 128
460.20 108
.
.
.
1673.30 1007
1674.00 974
1675.30 79
END IONS

BEGIN IONS
TITLE=Spectrum 3
PEPMASS=1244.7
SCANS=4-5,7,29-34
RTINSECONDS=26-27,40,95-97
.
.
.

Most MS data systems have some form of ASCII peak list export, which will generate a file requiring only minor modification with a text editor to conform to this format.

Fragment ion intensity information is very important. Mascot will iteratively select sub-sets of the most intense peaks, looking for the group which most clearly discriminates the score of the top matched protein. Precursor intensity can be specified by including a second value on the PEPMASS line, delimited by white space.

N.B. There is an upper limit of 10,000 peaks per individual MS/MS spectrum. If you see an error message reporting that this limit has been exceeded, it almost certainly means that your data are profile data, and not peak lists. Mascot has a peptide mass limit of 16 kDa or 255 residues. This makes it very unlikely that a single MS/MS spectrum could ever contain more than 1000 genuine peaks, never mind 10,000.

It is possible for an MS/MS ions search data file in the Mascot generic format to include sequence queries and peptide mass fingerprint queries. This is not allowed if the file contains proprietary format MS/MS data, and neither is mixing proprietary formats.

Here is a rather baroque example:

# following lines define parameters. 
# NB no spaces allowed on either side of the = symbol
COM=My favourite protein has been eaten by an enzyme
CLE=Trypsin
CHARGE=2+
# following line will be treated as a peptide mass
1024.6
# following line is a sequence query, which must 
# conform precisely to sequence query syntax rules
2321 seq(n-ACTL) comp(2[C])
# so is this
1896 ions(345.6:24.7,347.8:45.4, ... ,1024.7:18.7)
# An MS/MS ions query is delimited by the tags 
# BEGIN IONS and END IONS. Space(s)  
# are used to separate mass and intensity values
BEGIN IONS
TITLE=The first peptide - dodgy peak detection, so extra wide tolerance
PEPMASS=896.05 25674.3
CHARGE=3+
TOL=3
TOLU=Da
SEQ=n-AC[DHK]
COMP=2[H]0[M]3[DE]*[K]
240.1 3
242.1 12
245.2 32
.
.
.
1623.7 55
1624.7 23
END IONS

Embedded Search Parameters

Search parameters can be embedded into the data file or entered in the search form query window using the following parameter labels. In the absence of an embedded parameter, the default value is the setting of the corresponding search form field.

The FORMAT parameter is used to identify proprietary MS/MS dataset formats. It can appear once only, at the start of the file. If there is no FORMAT parameter, the default is Mascot generic format (MGF).

If the peak list format is not MGF, then parameters can only appear once, in the data file header, before the peak list begins.

For an MGF peak list, parameters with a tick in the Header column of the table below can appear in the header and those with a tick in the Local column can appear in the local scope of a single MS/MS query (spectrum). That is, after the BEGIN IONS line and before the fragment mass and intensity values.

Name Description Header Local Choices/Range Notes
ACCESSION Database entries
to be searched
  List of double
quoted, comma
separated values
 
CHARGE Peptide charge 1- M-H- on PMF form
Mr  
1+ MH+ on PMF form
8- to 8+ and
combinations
Not PMF
CLE Enzyme   Trypsin Default
etc., as defined
in enzymes file
 
COM Search title     Applies to the whole
search
CUTOUT Precursor
removal
  Pair of comma
separated
integers
MIS only
COMP Amino acid
composition
     
DB Database   As defined
in mascot.dat
 
DECOY Perform decoy
search
  0 (false) Default
1 (true)  
ERRORTOLERANT Error tolerant   0 (false) Default
1 (true) Not PMF
ETAG Error tolerant
sequence tag
    A single query can
have multiple ETAGs
FORMAT MS/MS data file   Mascot generic Default
Sequest (.DTA)  
Finnigan (.ASC)  
Micromass (.PKL)  
PerSeptive (.PKS)  
Sciex API III  
Bruker (.XML)  
mzData (.XML)  
FRAMES NA translation   Comma separated
list of frames
Default is
1,2,3,4,5,6
INSTRUMENT MS/MS ion series Default Default
ESI-QUAD-TOF
etc., as defined in
fragmentation_rules
 
IT_MODS Variable Mods  As defined in
unimod.xml
 
ITOL Fragment ion tol.   Unit dependent  
ITOLU Units for ITOL   Da  
mmu  
MASS Mono. or average    Monoisotopic  
Average  
MODS Fixed Mods    As defined in
unimod.xml
 
PEP_ISOTOPE_ERROR Misassigned 13C   0 to 2 MIS only
PEPMASS Peptide mass   >100 optionally followed
by intensity
PFA Partials   integer, 0 to 9 default 1
PRECURSOR Precursor m/z   >100  
QUANTITATION Quantitation method   as defined in
quantitation.xml
MIS only
REPORT Maximum hits    AUTO or integer  
REPTYPE Type of report   protein  
peptide Default for MIS
archive MIS only
concise Default for PMF
select MIS only
unassigned MIS only
RTINSECONDS Retention time or
range (in seconds)
  a[[-b][,c[-d]]] MIS only
SCANS Scan number or
range
  v[[-w][,x[-y]]] MIS only
SEARCH Type of search   PMF  
SQ = MIS
MIS = SQ
SEG Protein mass (kDa)   Empty or >0  
SEQ Amino acid
sequence
    A single query can
have multiple SEQs
TAG Sequence tag     A single query can
have multiple TAGs
TAXONOMY Taxonomy   As defined in
taxonomy file
 
TITLE Query title     Applies to a single
spectrum
TOL Peptide mass tol. Unit dependent  
TOLU Units for TOL %  
ppm  
mmu  
Da  
USER00 to
USER12
    Uncommitted
parameters
 
USEREMAIL User email      
USERNAME User name      

Proprietary MS/MS Peak List Formats

Finnigan (ASC) Files

Files in this format are created by the LIST command on the ICIS data system. The header block for each MS/MS dataset begins with a "LIST:" field. The text in this field is used by Mascot to identify the query, equivalent to an embedded TITLE parameter.

The ASC file header does not specify a charge state for the precursor peptide. This can be specified (globally) on the search form, or by an embedded CHARGE parameter at the head of the data file.

The precursor peptide m/z value is parsed from the "Mode:" field. Mascot uses the prevailing CHARGE value to calculate Mr from the observed m/z.

A blank line to delimit MS/MS datasets is optional.

Example of Finnigan ASC format:

LIST: dp210198b                      21-Jan-98 DERIVED SPECTRUM    #9
Samp: Spot 6483 from Gel 29A44                 Start : 18:37:54   100
Mode: ESI +DAU 808.3 @ 25eV UP LR
Oper: Administrator                            Inlet :
Base: 798.9             Inten : 25525          Masses: 225 > 2000
Norm: 798.9             RIC   : 181489         #peaks: 586
Peak: 1000.00 mmu
Data: +/1>99
                            0
  No.        Mass   Intensity     %RA    %RIC  Flags
    1       229.3           8    0.03    0.00   #
    2       230.3           9    0.04    0.00   #
    3       259.9           8    0.03    0.00   #

{{{{{{{{{{{{{{{{{{{
{ data edited out {
{{{{{{{{{{{{{{{{{{{

  583      1831.0           5    0.02    0.00   #
  584      1878.3           5    0.02    0.00   #
  585      1881.8           8    0.03    0.00   #

LIST: dp210198a                      21-Jan-98 DERIVED SPECTRUM    #9
Samp: Spot 6483 from Gel 29A44                 Start : 18:27:30    95
Mode: ESI +DAU 973.9 @ 25eV AVER UP LR
Oper: Administrator                            Inlet :
Base: 974.5             Inten : 191564         Masses: 270 > 1800
Norm: 974.5             RIC   : 341387         #peaks: 593
Peak: 1000.00 mmu
Data: +/1>95
                            0
  No.        Mass   Intensity     %RA    %RIC  Flags
    1       297.9          10    0.01    0.00   #
    2       326.7           8    0.00    0.00   #
    3       345.1         237    0.12    0.07   #

{{{{{{{{
{ Etc. {
{{{{{{{{

Sequest (DTA) Files

Sequest users can create these files from Finnigan LCQ data using the lcq_dta.exe or extract_msn.exe utilities. Further information can be found here.

The DTA format is very simple. The first line contains the singly protonated peptide mass (MH+) and the peptide charge state as a pair of space separated values. Subsequent lines contain space separated pairs of fragment ion m/z and intensity values.

N.B. In a DTA file, the precursor peptide mass is an MH+ value independent of the charge state. In Mascot generic format, the precursor peptide mass is an observed m/z value, from which Mr or MHnn+ is calculated using the prevailing charge state. For example, in Mascot:

PEPMASS=1000
CHARGE=2+

... means that the relative molecular mass Mr is 1998. This is equivalent to a DTA file which starts:

1999 2

The DTA format uses the file name to identify the dataset. An example of a file name would be "Myoglobin_digest.0012.0015.3.dta". This corresponds to scans 12 to 15 of an LC-MS run, averaged together, and a peptide charge state of 3+.

While it is perfectly possible to submit a native DTA file to Mascot, each file contains only a single MS/MS data set. If you have a series of related datasets, such as from an LC-MS experiment, it is much better to concatenate the DTA files into a single data file so that the queries can be scored and reported collectively.

Remember to include at least one blank line between each MS/MS dataset. A delimiter between datasets is essential because the DTA format is relatively unstructured. Without a delimiter, the first line of a new dataset (peptide mass, charge) might be just another line from the previous dataset (fragment ion mass, intensity).

Utilities to concatenate DTA files automatically can be downloaded from the Xcalibur help page.

Micromass (PKL) Files

QTof users can export peak list data in either DTA or PKL format using the Micromass ProteinLynx package. Further information can be found here.

The PKL format is similar to the DTA file format, but supports multiple MS/MS datasets in a single file. The first line of a PKL dataset contains the observed m/z, intensity, and charge state of the precursor peptide as a triplet of space separated values. Subsequent lines contain space separated pairs of fragment ion m/z and intensity values.

Multiple MS/MS datasets are delimited by at least one blank line.

PerSeptive (.PKS)

PSD peak lists exported from Grams as .PKS files contain data from a single PSD spectrum. Since the .PKS format does not include details of the precursor peptide m/z, this information must be entered manually into the PRECURSOR and CHARGE form fields. This limitation also means that multiple spectra cannot be merged into a single data file.

Example of the .PKS file format:

"Peak Table"
OP=0
Center X   Peak Y   Left X   Right X   Time X  Mass Difference  Name
STD.Misc   Height   Left Y   Right Y   %Height,Width,%Area,%Quan,H/A
818.39992      4265.0000 818.39992 818.39992 81554.550 0          818.3999
C  0.?     0         4265.0000 4265.0000 
820.42154      3765.0000 820.42154 820.42154 81616.547 0          820.4215
C  0.?     0         3765.0000 3765.0000 
842.38252      2571.0000 842.10681 842.62999 82290.021 0          842.3825
C  0.?     0         1800.0000 1800.0000 
{{{{{{{{
{ Etc. {
{{{{{{{{

Sciex API III

Peak lists exported from PE Sciex API III contain data from a single MS/MS spectrum. Since the file format does not include details of the precursor peptide m/z, this information must be entered manually into the PRECURSOR and CHARGE form fields. This limitation also means that multiple spectra cannot be merged into a single data file.

Example of PE Sciex peak list format:

287.50	650	287.5
301.00	1150	301.0
305.00	1150	305.0
315.00	6550	315.0
321.00	16,000	321.0
333.00	3050	333.0
333.50	1800	333.5
370.00	1550	370.0
{{{{{{{{
{ Etc. {
{{{{{{{{

Bruker (.XML)

Bruker XMASS and flexAnalysis save peak lists in a simple XML format. A DTD or XSD for the format is not publically available. For each peak, Mascot takes the m/z value from the <mass> element and the intensity from the <absi> element.

The file format for MS/MS does not include details of the precursor peptide m/z, so this information must be entered manually into the PRECURSOR and CHARGE form fields. This limitation also means that multiple spectra cannot be merged into a single data file.

mzData (.XML)

Mascot supports mzData version 1.05. Follow the link for a schema document and further information.

The Rules

  1. Filename extensions are not significant.
  2. Numeric values must be non-localised US ASCII. That is, the decimal separator must be a period and the thousands separator, if any, must be a comma. Leading white space is acceptable on lines that start with a number.
  3. Parameter labels are not case sensitive. Parameter values may be case sensitive. Case is preserved for parameter values which are free text strings. There must be no leading space before a parameter label and no space either side of the = symbol
  4. Parameters at the head of the data file apply to the entire search and over-ride the default settings provided by the search form fields.
  5. In the absence of a FORMAT parameter, the default format is Mascot generic.
  6. Mascot generic format permits an MS/MS search to include peptide mass fingerprint queries and sequence queries.
  7. In Mascot generic format, each MS/MS spectrum is delimited by BEGIN IONS and END IONS statements. There is a line for each fragment ion peak, containing an m/z and intensity value, separated by white space. Fragment ion m/z values must be positive, non-zero values. Intensities must be positive values. Any additional values or text are ignored.
  8. Parameters between the BEGIN IONS and END IONS statements only apply to the local MS/MS query. A PEPMASS parameter is required, all others are optional. Parameters within an MS/MS query must appear before the fragment ion data. If an MS/MS query has no fragment ions, it is treated as a PMF query.
  9. Most parameters can only appear at the head of the file, prior to any query data. The exceptions are PEPMASS, TITLE, SCANS, and RTINSECONDS, which can only appear within an MS/MS query block, and CHARGE, INSTRUMENT, IT_MODS, TOL, and TOLU, which can appear in either place. SEQ, COMP, TAG and ETAG can appear within an MS/MS query block or as qualifiers to a mass value using the Sequence Query syntax.
  10. Blank lines can be used anywhere to improve readability.
  11. Lines that start with one of the symbols # ; ! / are comment lines and are ignored. Comments cannot be used between the BEGIN IONS and END IONS statements delimiting an MS/MS query block.
  12. A SEARCH type must be defined, (PMF, SQ or MIS). The default is determined by the search form used to upload the file. Like any other parameter, this can be over-ridden by including a SEARCH parameter in the file header.
  13. A peptide mass fingerprint (PMF) search can only contain PMF queries. This allows for a relaxed syntax in which any line starting with a number is assumed to be a query. The first number is parsed as a peptide m/z value and the second number, if any, is parsed as a peak area or intensity. The rest of the line is ignored. Peptide m/z values must be equivalent to 100 <= Mr <= 16000.
  14. MS/MS searches can contain MS/MS data in proprietary formats only if this is declared with a FORMAT parameter. Mixing proprietary formats, or including non-MS/MS queries in a proprietary format file, is not allowed.
  15. User parameters are any parameters named USER\d\d (where \d is a digit) or any name beginning with an underscore except for the following, which are reserved: _INTEGRA_* _DAEMON_* _DISTILLER_* _SERVER_*. User parameters cannot be used between the BEGIN IONS and END IONS statements delimiting an MS/MS query block.
 
 
Copyright © 2007 Matrix Science Ltd. All Rights Reserved.