Export search results
This utility enables Mascot search results to be exported in a variety of "machine readable" formats. When
used interactively, the file format is chosen and customised using a web browser form, displayed by
choosing Export Search Results in the format controls of a results
report and pressing Format As. In addition, the utility can be executed by scripts, with the options specified
on the command line.
The information contained in these two formats is identical. XML is ideal for importing into a relational database.
CSV can be opened in spreadsheets such as Microsoft Excel.
For a Peptide Mass Fingerprint, the result information is structured in a very similar way to a Concise Protein
Summary report. For search results that include MS/MS data, the information is structured in a very similar way
to a Peptide Summary report. That is, peptide matches are assigned to protein hits, and
proteins that contain the same set of peptides, or a sub-set, are grouped together into a single hit.
If the Include same-set protein hits checkbox is clear, then only the primary protein for each hit is exported.
For a PMF, Include same-set protein hits should be checked and Include sub-set protein hits set to 1 to get the
exact equivalent of a Concise Protein Summary report.
Precise details for individual data items, such as the data type and whether it is optional, can be found in the
XML schema. The schema introduced with Mascot 2.1 is
mascot_search_results_1.xsd,
(documentation).
The need to add additional data structures for Mascot 2.2, including quantitation results, would have broken
this schema, so a new schema has been created:
mascot_search_results_2.xsd,
(documentation).
For general XML Schema considerations, see the section further down this page. Documentation was
auto-genarated using xs3p.
The CSV file contains
identical data, organised for display as a spreadsheet. The column headers of tables are the same as the XML
element names, but the row headers are plain text words and phrases. (If you need to change the delimiter to
something other than a comma, edit export_dat_2.pl and change the value of $delimiter, near the top of the script.)
Usage
For interactive use, the controls are divided into blocks, with the first block corresponding to the
format controls of a results report.
The Optional Search Information
block controls which ancillary information is exported. Most of the options are self-explanatory.
The data items in the Header section are
- Search title
- Timestamp (W3C Date and Time format, e.g. 2005-03-12T08:29:11Z)
- User
- Email
- Report URI (URL or relative path, if executed at command line)
- MS data path
- Search type
- Mascot version
- Database
- Fasta file
- Total sequences
- Total residues
- Sequences after taxonomy filter
- Number of entries searched in error tolerant mode (if applicable)
- Number of queries
- Warnings messages from the search (as required)
For speed and efficiency, leave the checkboxes marked with asterisks
under Optional Protein Hit Information unchecked. (See Optional Protein Hit
Information for further infomation on the use of these checkboxes).
pepXML is the interchange format
for database search results used in the
Institute for Systems Biology
Trans-Proteomic Pipeline.
The pepXML format is only applicable to MS/MS search results, and represents "raw" peptide match
data. Information is exported for all matches to all queries, (MS/MS spectra). For each match,
extensive information is provided for the first protein in which the peptide is found and more limited
information for all the other proteins. This can make the output file very large.
Precise details for individual data items, such as the data type and whether it is optional, can be found in the
XML schema.
Schema documentation has been
generated by xs3p. For general
XML Schema considerations, see the section further down this page.
Usage
For speed and efficiency, leave all the checkboxes
under Optional Protein Hit Information unchecked. (See Optional Protein Hit
Information for further infomation on the use of these checkboxes).
Limitations
- Where elements and attributes are required by the schema, but the data is not available from Mascot,
zero length strings are output. For example, the base_name, raw_data_type and
raw_data attributes of an msms_run_summary element.
- The schema includes extensive information for the first protein in which a peptide match is found, even
though this may not be the preferred or final assignment.
- The amino acid residues that bracket a peptide are only available
if the result file is from Mascot 2.1 or later.
- The num_matched_ions attribute of the search_hit element is the number of mass values
used to score the match, not the total number of mass values that could be matched to
all the calculated ion series.
- In a search_result element, the start_scan and end_scan attributes are always set to 0.
- modification_info elements are only exported for variable modifications, not for fixed.
DTASelect is an application that was written by David L. Tabb
at The Scripps Research Institute. Originally intended for analysing Sequest results, it groups peptide matches
into proteins and allows a variety of filters to be applied. Although DTASelect includes built-in support for Mascot result files,
the information in the result file is not fully utilised and the interface is prone to break with new Mascot releases.
Choosing DTASelect in this export utility creates a
DTASelect intermediate file, DTASelect.txt, containing a more complete picture of the search results. This intermediate file
is then read by DTASelect to create filtered reports.
The output file is compatible with DTASelect 1.9 only. DTASelect format is only applicable to MS/MS search results.
Usage
For speed and efficiency, it is advisable to choose MudPit scoring, an ions score cut-off of 10, and leave all the checkboxes
under Optional Protein Hit Information unchecked. (See Optional Protein Hit
Information for further infomation on the use of these checkboxes). Save the exported file to a directory,
make this the current directory, and execute DTASelect.
The DTASelect spectrum filters, which can be
supplied on the command line or taken from DTASelect.params, should include the following changes to the defaults:
- --Mascot
- to set Mascot mode
- -1 10.0
- to set the minimum ions score for 1+ peptides to 10
- -2 10.0
- to set the minimum ions score for 2+ peptides to 10
- -3 10.0
- to set the minimum ions score for 3+ peptides to 10
- -d 20
- to set the minimum for (1 / expectation value) to 20
- -p 1
- to set the distinct peptide threshold to 1
- --mw 100.0
- to set the minimum protein mass to 100
In a DTASelect report of Mascot results, the following columns are different from those in a
DTASelect report of Sequest results:
- Filename
- Mascot result filename, query number and precursor charge, separated by periods
- IonsScore
- Mascot ions score
- Signif
- 1 / expectation value
- SpR
- Peptide match rank, between 1 (highest) and 10 (lowest)
- SpScore
- Identity threshold score
Limitations
- The output file is compatible with DTASelect 1.9 only
- Hyperlinks to Sequest utilities will not work
- The number of tryptic termini for a peptide is not available
- The amino acid residues that bracket a peptide are only available if the result file
is from Mascot 2.1 or later. For result files from earlier versions, question marks are displayed
- DTASelect reports do not display variable terminus modifications
Only a limited amount of information about a protein hit is saved to a Mascot result file. For example, the protein
sequence is not saved because this would make the result files unacceptably large. When missing information is required for
a Mascot report, it has to be retrieved from the compressed database files.
Even though a single call for missing information may take only a fraction of a second, and is not noticable when loading
a Mascot report, this can become a problem if creating an export file requires thousands of calls. It is important to be
aware of this, and not waste time retrieving information that is not actually required. This is a particular issue
for an export format that represents "raw" result information, like pepXML. A list of all the proteins
that contain all the peptides that had any matches to any of the spectra can be an extremely long list.
Description
The Fasta description line is saved for all peptide mass fingerprint protein hits. For an MS/MS search, Mascot tries
to guess which protein hits will appear in the reports and saves their Fasta description lines to the result file. However,
the actual hit list depends on many factors, and some hits may be missed, requiring the descriptions to be retrieved
from the compressed database files.
Protein Mass
The protein mass is saved for all peptide mass fingerprint protein hits. For an MS/MS search, Mascot tries
to guess which protein hits will appear in the reports and saves their masses to the result file. However,
the actual hit list depends on many factors, and some hits may be missed, requiring the masses to be retrieved
from the compressed database files.
On the Matrx Science public web site, the description and mass of a protein can only be exported if this information was
saved to the result file. The following protein hit information options are not available on the public web site,
and attempting to use them will have no effect.
Percent coverage
Percent coverage is never saved to the result file. It is calculated on the fly from the length and the set of
peptides assigned to the protein.
Length in residues
Length in residues is never saved to the result file. It must be retrieved from the compressed database files.
pI
pI is never saved to the result file. The protein sequence must be retrieved from the compressed database files
and the pI value calculated.
Taxonomy
Taxonomy is never saved to the result file. It must be retrieved from the compressed database files.
Taxonomy ID
Taxonomy ID is never saved to the result file. It must be retrieved from the compressed database files.
Protein sequence
The entire protein sequence is never saved to the result file. It must be retrieved from the compressed database files.
Result file conversion can be automated by using the export script as a command line utility. It must be executed
in the cgi directory on a Mascot server. The command line arguments are URL-style name=value pairs, for example
export_dat_2.pl do_export=1 export_format=XML file=../data/20050223/F004651.dat
The Mascot 2.1 script, which exports an XML file conforming to mascot_search_results_1.xsd, is called export_dat.pl.
For backward compatibility, this script is supplied and supported as a component of Mascot 2.2.
A new script, called export_dat_2.pl, creates an XML file conforming to mascot_search_results_2.xsd. This script is
the one selected when you choose Export search results from a Mascot result report, and should be used by any
new applications.
Required Arguments
- do_export
- must be 1 to export results
- export_format
- XML or CSV or pepXML or DTASelect
- file
- relative or absolute path to result file
Formatting Arguments
- _ignoreionsscorebelow
- MS/MS ions scores below this value are set to zero, default set in mascot.dat (0)
- _mudpit (export_dat.pl only)
- number of queries at which protein score switches to MudPIT scoring, default set in mascot.dat (1000)
- _server_mudpit_switch (export_dat_2.pl only)
- if queries / entries greater than this value, switch to MudPIT scoring, default set in mascot.dat (0.001)
- _requireboldred
- 1 to report protein hits only if they include at least one bold, red peptide match, default set in mascot.dat (0)
- _showallfromerrortolerant
- 1 to display all hits from an error tolerant search, including garbage, default 0
- _onlyerrortolerant (export_dat_2.pl only)
- 1 to display only error tolerant matches from an automatic error tolerant search , default 0
- _noerrortolerant (export_dat_2.pl only)
- 1 to suppress error tolerant matches from an automatic error tolerant search , default 0
- show_same_sets
- 1 to display all proteins that match the same set of peptides, default 0
- _showsubsets
- display protein hits that are missing up to this fraction of the protein score of the main hit, default set in mascot.dat
- _sigthreshold
- probability significance threshold, default 0.05
- report
- max number of hits to be reported, 0 = AUTO, default taken from search parameters
- unigene
- UniGene index species to be used to cluster hits
- _show_decoy_report (export_dat_2.pl only)
- 1 to display report for the decoy results in an automatic decoy search
Search Level Information
Set these options to 1 to include the corresponding block. For export_dat_2.pl, you must also include
search_master=1. If search_master is missing or set to 0, this disables output of all the following except
show_unassigned.
- show_format
- format parameters
- show_header
- search level information
- show_masses
- residue and element masses
- show_params
- search parameters
- show_mods (export_dat_2.pl only)
- modifications information
- show_decoy (export_dat_2.pl only)
- automatic decoy search statistics
- show_queries (export_dat.pl only)
- query information
- show_unassigned
- peptide matches not assigned to protein hits
Protein Hit Fields
Set these options to 1 to include the corresponding field. Only selected fields are available if the
format is pepXML or DTASelect. For export_dat_2.pl, you must also include
protein_master=1. If protein_master is missing or set to 0, this disables output of the following fields,
(and also disables output of the peptide fields).
- prot_desc
- Fasta title / description line
- prot_score
- protein score
- prot_thresh
- protein score significance threshold (PMF only)
- prot_expect
- protein score expectation value (PMF only)
- prot_mass
- protein mass
- prot_matches
- number of assigned peptide matches
- prot_cover
- percentage of protein sequence covered by assigned peptide matches
- prot_len
- protein sequence length
- prot_pi
- calculated pI value for protein
- prot_tax_str
- protein taxonomy description
- prot_tax_id
- protein taxonomy ID number
- prot_seq (export_dat_2.pl only)
- complete protein sequence
- prot_empai (export_dat_2.pl only)
- emPAI
- prot_quant (export_dat_2.pl only)
- protein ratios from quantitation method
Peptide Match Fields
Set these options to 1 to include the corresponding field. Only selected fields are available if the
format is pepXML or DTASelect. For export_dat_2.pl, you must also include
peptide_master=1. If peptide_master is missing or set to 0, this disables output of the following fields.
- pep_exp_mr
- observed relative molecular mass
- pep_exp_z
- observed charge state
- pep_calc_mr
- calculated relative molecular mass
- pep_delta
- (pep_calc_mr - pep_exp_mr)
- pep_start
- 1 based residue number for peptide start in protein
- pep_end
- 1 based residue number for peptide end in protein
- pep_miss
- number of missed enzyme cleavage sites
- pep_score
- peptide match score
- pep_homol
- peptide score homology threshold
- pep_ident
- peptide score identity threshold
- pep_expect
- peptide match expectation value
- pep_rank (export_dat.pl only)
- rank of peptide match (1 - 10)
- pep_seq
- peptide sequence
- pep_frame
- peptide frame number (nucleic acid sequence databases only)
- pep_var_mod
- variable modifications used to get the match as comma separated string
- pep_num_match (export_dat_2.pl only)
- number of fragment ion matches used for scoring
- pep_scan_title (export_dat_2.pl only)
- scan title from peak list
- pep_quant (export_dat_2.pl only)
- peptide ratios from quantitation method
Query Fields (export_dat_2.pl only)
Set these options to 1 to include the corresponding field.
Only selected fields are available if the
format is pepXML or DTASelect. For export_dat_2.pl, you must also include
query_master=1. If query_master is missing or set to 0, this disables output of the following fields.
- query_title
- scan title from peak list
- query_qualifiers
- seq, comp, tag, etc.
- query_params
- search parameters in local scope of a single query
- query_peaks
- peak list
- query_raw
- all peptide matches for this query
Versioning
The Mascot Search Results
XML schema
uses versioning to avoid applications breaking when the schema is updated. The schema definition is
identified by a major version number and a minor version number.
When a change is made to the schema,
and any instance document that was valid against the previous schema could become invalid,
the major version number will be incremented. An example of such a change would be that a new type or element
is added to the schema that is not optional. If a change is made to a schema that cannot break the
validity of any existing document, such as adding a new type or element that is optional, then the
minor version will be incremented.
There will be a seperate schema file and name space for each major version and the file name contains the major
version number. The schema also includes the major and minor version numbers as attributes of the
top level element. An application that parses an instance document should compare the major and minor
version attribute values against those which it was coded to support. It should not rely on an XML
parser to verify the version numbers against the schema encoded restrictions, since the schema definition
file used by the parser may be newer than when the application was written.
Validation
The instance documents created by this export utility have been validated against the corresponding schema
definitions using XMLSpy. The following web tools can also
be used:
No complex software is ever completely free of bugs. If you find an XML file created by the
Export Search Results utility that fails to validate against the corresponding schema definition, please
email full details to support@matrixscience.com and
we will try to fix the problem as rapidly as possible.
On the other hand, if the XML file validates, but an error is reported by the application reading
the file, then this is a bug in the application. In the first instance, please report this to the
authors of the application.
Useful Resources
Standards and design:
Programming:
|