4.6 Annotation

The annotation consists of collecting MS peak lists and then formula and/or compound annotation:

Note that compound annotation is normally not dependent upon formula annotation. However, formula data can be used to improve ranking of candidates afterwards by the addFormulaScoring() function, which will be discussed later in this section. Furthermore, suspect annotation is not mandatory, and may use data from peak lists, formulae and/or comounds.

4.6.1 MS peak lists

Algorithm Usage Remarks
mzR generateMSPeakLists(algorithm = "mzr", ...) Uses mzR for spectra retrieval. Recommended default.
DataAnalysis generateMSPeakLists(algorithm = "bruker", ...) Loads data after automatically generating MS and MS/MS spectra in DataAnalysis
DataAnalysis FMF generateMSPeakLists(algorithm = "brukerfmf", ...) Uses spectra from the find molecular features algorithm.

The recommended default algorithm is mzr: this algorithm is generally faster and is not limited to a vendor data format as it will read the open mzML and mzXML file formats. On the other hand, when DataAnalysis is used with Bruker data the spectra can be automatically background subtracted and there is no need for file conversion. Note that the brukerfmf algorithm only works when findFeatures() was called with the bruker algorithm.

When generateMSPeakists() is called it will

  1. Find all MS and MS/MS spectra that ‘belong’ to a feature. For MS spectra this means that all spectra close to the retention time of a feature will be collected. In addition, for MS/MS normally only spectra will be considered that have a precursor mass close to that of the feature (however, this can be disabled for data that was recorded with data independent acquisition (DIA, MS^E, bbCID, …)).
  2. Average all MS and MS/MS spectra to produce peak lists for each feature.
  3. Average all peak lists for features within the same group.

Data from either (2) or (3) is used for subsequent annotation steps. Formula calculation can use either (as a trade-off between possibly more accurate results by outlier removal vs speed), whereas compound annotation will always use data from (3) since annotating single features (as opposed to their groups) would take a very long time.

There are several common function arguments to generateMSPeakLists() that can be used to optimize its behaviour:

Argument Algorithm(s) Remarks
maxMSRtWindow mzr, bruker Maximum time window +/- the feature retention time (in seconds) to collect spectra for averaging. Higher values may significantly increase processing times.
precursorMzWindow mzr Maximum precursor m/z search window to find MS/MS spectra. Set to NULL to disable (i.e. for DIA experiments).
topMost mzr Only retain feature data for no more than this amount analyses with highest intensity. For instance, a value of 1 will only keep peak lists for the feature with highest intensity in a feature group.
bgsubtr bruker Perform background subtraction (if the spectra type supports this, e.g. MS and bbCID)
minMSIntensity, minMSMSIntensity bruker, brukerfmf Minimum MS and MS/MS intensity. Note that DataAnalysis reports many zero intensity peaks so a value of at least 1 is recommended.
MSMSType bruker The type of spectra that should be used for MSMS: "BBCID" for bbCID experiments, otherwise "MSMS" (the default).

In addition, several parameters can be set that affect spectral averaging. These parameters are passed as a list to the avgFeatParams (mzr algorithm only) and avgFGroupParams arguments, which affect averaging of feature and feature group data, respectively. Some typical parameters include:

  • clusterMzWindow: Maximum m/z window used to cluster mass peaks when averaging. The better the MS resolution, the lower this value should be.
  • topMost: Retain no more than this amount of most intense mass peaks. Useful to filter out ‘noisy’ peaks.
  • minIntensityPre / minIntensityPost: Mass peaks below this intensity will be removed before/after averaging.

See ?generateMSPeakLists for all possible parameters.

A suitable list object to set averaging parameters can be obtained with the getDefAvgPListParams() function.

# lower default clustering window, other settings remain default
avgPListParams <- getDefAvgPListParams(clusterMzWindow = 0.001)

# Apply to both feature and feature group averaging
plists <- generateMSPeakLists(fGroups, "mzr", avgFeatParams = avgPListParams, avgFGroupParams = avgPListParams)

4.6.2 Formulae

Formulae can be automatically calculated for all features using the generateFormulas() function. The following algorithms are currently supported:

Algorithm Usage Remarks
GenForm generateFormulas(algorithm = "genform", ...) Bundled with patRoon. Reasonable default.
SIRIUS generateFormulas(algorithm = "sirius", ...) Requires MS/MS data.
DataAnalysis generateFormulas(algorithm = "bruker", ...) Requires FMF features (i.e. findFeatures(algorithm = "bruker", ...)). Uses SmartFormula algorithms.

Calculation with GenForm is often a good default. It is fast and basic rules can be applied to filter out obvious non-existing formulae. A possible drawback of GenForm, however, is that may become slow when many candidates are calculated, for instance, due to a relative high feature m/z (e.g. >600) or loose elemental restricitions. More thorough calculation is performed with SIRIUS: this algorithm often yields fewer and often more plausible results. However, SIRIUS requires MS/MS data (hence features without will not have results) and formula prediction may not work well for compounds that structurally deviate from the training sets used by SIRIUS. Calculation with DataAnalysis is only possible when features are obtained with DataAnalysis as well. An advantage is that analysis files do not have to be converted, however, compared to other algorithms calculation is often relative slow.

There are two methods for formula assignment:

  1. Formulae are first calculated for each individual feature within a feature group. These results are then pooled, outliers are removed and remaining formulae are assigned to the feature group (i.e. calculateFeatures = TRUE).
  2. Formulae are directly calculated for each feature group by using group averaged peak lists (see previous section) (i.e. calculateFeatures = FALSE).

The first method is more thorough and the possibility to remove outliers may sometimes result in better formula assignment. However, the second method is much faster and generally recommended for large number of analyses.

By default, formulae are either calculated by only MS/MS data (SIRIUS) or with both MS and MS/MS data (GenForm/Bruker). The latter also allows formula calculation when no MS/MS data is present. Furthermore, with Bruker algorithms, data from both MS and MS/MS formula data can be combined to allow inclusion of candidates that would otherwise be excluded by e.g. poor MS/MS data. However, a disadvantage is that formulae needs to be calculated twice. The MSMode argument (listed below) can be used to customize this behaviour.

An overview of common parameters that are typically set to customize formula calculation is listed below.

Argument Algorithm(s) Remarks
relMzDev genform, sirius The maximum relative m/z deviation for a formula to be considered (in ppm).
elements genform, sirius Which elements to consider. By default "CHNOP". Try to limit possible elements as much as possible.
calculateFeatures genform, sirius Whether formulae should be calculated first for all features (see discussion above) (always TRUE with DataAnalysis).
featThresholdAnn All Minimum relative amount (0-1) that a candidate formula for a feature group should be found among all annotated features (e.g. 1 means that a candidate is only considered if it was assigned to all annotated features).
adduct All The adduct to consider for calculation (e.g. "[M+H]+", "[M-H]-", more details in the adduct section). Don’t set this when adduct annotations are available.
MSMode genform, bruker Whether formulae should be generated only from MS data ("ms"), MS/MS data ("msms") or both ("both"). The latter is default, see discussion above.
profile sirius Instrument profile, e.g. "qtof", "orbitrap", "fticr".

Some typical examples:

formulasGF <- generateFormulas(fGroups, mslists, "genform") # GenForm, default settings
formulasGF2 <- generateFormulas(fGroups, mslists, "genform", calculateFeatures = FALSE) # direct feature group assignment (faster)
formulasSIR <- generateFormulas(fGroups, mslists, "sirius", elements = "CHNOPSClBr") # SIRIUS, common elements for pollutant
formulasSIR2 <- generateFormulas(fGroups, mslists, "sirius", adduct = "[M-H]-") # SIRIUS, negative ionization
formulasBr <- generateFormulas(fGroups, mslists, "bruker", MSMode = "MSMS") # Only consider MSMS data (SmartFormula3D)

4.6.3 Compounds

An important step in a typical non-target workflow is structural identification for features of interest, as this information may finally reveal what a feature is. In a first step all possible candidate structures for a feature are obtained from a database (based on e.g. monoisotopic mass or formula). These candidates are then ranked, for instance, by matching the feature MS/MS data with in-silico or library MS/MS spectra or its relevance to the environment.

Structure assignment in patRoon is performed automatically for all feature groups with the generateCompounds() function. Currently, this function supports the following algorithms:

Algorithm Usage Remarks
MetFrag generateCompounds(algorithm = "metfrag", ...) Supports many databases (including offline and custom), matching MS/MS data with in-silico and library MS/MS data, and many other scorings to rank candidates.
SIRIUS with CSI:FingerID generateCompounds(algorithm = "sirius", ...) Matches with in-silico MS/MS data, incorporates formula annotations to improve candidate selection.
Library generateCompounds(algorithm = "library", ...) Obtains candidates by matching MS/MS data with an offline MS library, e.g. obtained from MassBank.eu or MoNA.

All algorithms rank their candidates by matching MS/MS data with in-silico generated MS/MS data (MetFrag and SIRIUS) and/or experimental MS/MS data from an MS library (MetFrag with MoNA scoring and Library algorithm). The latter may yield better candidates, and the Library algorithm is also generally much faster. However, in-silico annotation is not limited by the availability of experimental MS/MS data.

Compound annotation is often a relative time and resource intensive procedure. For this reason, annotation occurs for each feature group and not individual features. Nevertheless, it is not uncommon that this is the most time consuming step in the workflow. For this reason, prioritization of features is highly important, even more so to avoid ‘abusing’ servers when an online database is used for compound retrieval.

4.6.3.1 Database selection for MetFrag and SIRIUS

Selecting the right database is important for proper candidate assignment. If the ‘right’ chemical compound is not present in the used database, it is impossible to assign the correct structure. Luckily, however, several large databases such as PubChem and ChemSpider are openly available which contain tens of millions of compounds. On the other hand, these databases may also lead to many unlikely candidates and therefore more specialized (or custom databases) may be preferred. Which database will be used is dictated by the database argument to generateCompounds(), currently the following options exist:

Database Algorithm(s) Remarks
pubchem "metfrag", "sirius" PubChem is currently the largest compound database and is used by default.
chemspider "metfrag" ChemSpider is another large database. Requires security token from here (see next section).
comptox "metfrag" The EPA CompTox contains many compounds and scorings relevant to environmental studies. Needs manual download (see next section).
pubchemlite "metfrag" A specialized subset of the PubChem database. Needs manual download (see next section).
for-ident "metfrag" The FOR-IDENT (STOFF-IDENT) database for water related substances.
kegg "metfrag", "sirius" The KEGG database for biological compounds
hmdb "metfrag", "sirius" The HMDB contains many human metabolites.
bio "sirius" Selects all supports biological databases.
csv, psv, sdf "metfrag" Custom database (see next section). CSV example.

4.6.3.2 Configuring MetFrag databases and scoring

Some extra configuration may be necessary when using certain databases with MetFrag. In order to use the ChemSpider database a security token should be requested and set with the chemSpiderToken argument to generateCompounds(). The CompTox and PubChemLite databases need to be manually downloaded from CompTox (or variations with smoking or wastewater metadata) and PubChemLite (or the PubChem derived OECD PFAS database). The file location of this and other local databases (csv, psv, sdf) needs to be manually configured, see the examples below and/or ?generateCompounds for more information on how to do this.

# PubChem: the default
compsMF <- generateCompounds(fGroups, mslists, "metfrag", adduct = "[M+H]+")

# ChemSpider: needs security token
compsMF2 <- generateCompounds(fGroups, mslists, "metfrag", database = "chemspider",
                              chemSpiderToken = "MY_TOKEN_HERE", adduct = "[M+H]+")

# CompTox: set global option to database path
options(patRoon.path.MetFragCompTox = "~/CompTox_17March2019_SelectMetaData.csv")
compsMF3 <- generateCompounds(fGroups, mslists, "metfrag", database = "comptox", adduct = "[M+H]+")

# CompTox: set database location without global option
compsMF4 <- generateCompounds(fGroups, mslists, "metfrag", database = "comptox", adduct = "[M+H]+",
                              extraOpts = list(LocalDatabasePath = "~/CompTox_17March2019_SelectMetaData.csv"))

# Same, but for custom database
compsMF5 <- generateCompounds(fGroups, mslists, "metfrag", database = "csv", adduct = "[M+H]+",
                              extraOpts = list(LocalDatabasePath = "~/mydb.csv"))

An example of a custom .csv database can be found here.

With MetFrag compound databases are not only used to retrieve candidate structures but are also used to obtain metadata for further ranking. Each database has its own scorings, a table with currently supported scorings can be obtained with the compoundScorings() function (some columns omitted):

name metfrag database default
score Score TRUE
fragScore FragmenterScore TRUE
metFusionScore OfflineMetFusionScore TRUE
individualMoNAScore OfflineIndividualMoNAScore TRUE
numberPatents PubChemNumberPatents pubchem TRUE
numberPatents Patent_Count pubchemlite TRUE
pubMedReferences PubChemNumberPubMedReferences pubchem TRUE
pubMedReferences ChemSpiderNumberPubMedReferences chemspider TRUE
pubMedReferences NUMBER_OF_PUBMED_ARTICLES comptox TRUE
pubMedReferences PubMed_Count pubchemlite TRUE
extReferenceCount ChemSpiderNumberExternalReferences chemspider TRUE
dataSourceCount ChemSpiderDataSourceCount chemspider TRUE
referenceCount ChemSpiderReferenceCount chemspider TRUE
RSCCount ChemSpiderRSCCount chemspider TRUE
formulaScore FALSE
RF_SMILES FALSE
RF_SIRFP FALSE
LC50_SMILES FALSE
LC50_SIRFP FALSE
smartsInclusionScore SmartsSubstructureInclusionScore FALSE
smartsExclusionScore SmartsSubstructureExclusionScore FALSE
suspectListScore SuspectListScore FALSE
retentionTimeScore RetentionTimeScore FALSE
CPDATCount CPDAT_COUNT comptox TRUE
TOXCASTActive TOXCAST_PERCENT_ACTIVE comptox TRUE
dataSources DATA_SOURCES comptox TRUE
pubChemDataSources PUBCHEM_DATA_SOURCES comptox TRUE
EXPOCASTPredExpo EXPOCAST_MEDIAN_EXPOSURE_PREDICTION_MG/KG-BW/DAY comptox TRUE
ECOTOX ECOTOX comptox TRUE
NORMANSUSDAT NORMANSUSDAT comptox TRUE
MASSBANKEU MASSBANKEU comptox TRUE
TOX21SL TOX21SL comptox TRUE
TOXCAST TOXCAST comptox TRUE
KEMIMARKET KEMIMARKET comptox TRUE
MZCLOUD MZCLOUD comptox TRUE
pubMedNeuro PubMedNeuro comptox TRUE
CIGARETTES CIGARETTES comptox TRUE
INDOORCT16 INDOORCT16 comptox TRUE
SRM2585DUST SRM2585DUST comptox TRUE
SLTCHEMDB SLTCHEMDB comptox TRUE
THSMOKE THSMOKE comptox TRUE
ITNANTIBIOTIC ITNANTIBIOTIC comptox TRUE
STOFFIDENT STOFFIDENT comptox TRUE
KEMIMARKET_EXPO KEMIMARKET_EXPO comptox TRUE
KEMIMARKET_HAZ KEMIMARKET_HAZ comptox TRUE
REACH2017 REACH2017 comptox TRUE
KEMIWW_WDUIndex KEMIWW_WDUIndex comptox TRUE
KEMIWW_StpSE KEMIWW_StpSE comptox TRUE
KEMIWW_SEHitsOverDL KEMIWW_SEHitsOverDL comptox TRUE
ZINC15PHARMA ZINC15PHARMA comptox TRUE
PFASMASTER PFASMASTER comptox TRUE
peakFingerprintScore AutomatedPeakFingerprintAnnotationScore FALSE
lossFingerprintScore AutomatedLossFingerprintAnnotationScore FALSE
agroChemInfo AgroChemInfo pubchemlite FALSE
bioPathway BioPathway pubchemlite FALSE
drugMedicInfo DrugMedicInfo pubchemlite FALSE
foodRelated FoodRelated pubchemlite FALSE
pharmacoInfo PharmacoInfo pubchemlite FALSE
safetyInfo SafetyInfo pubchemlite FALSE
toxicityInfo ToxicityInfo pubchemlite FALSE
knownUse KnownUse pubchemlite FALSE
disorderDisease DisorderDisease pubchemlite FALSE
identification Identification pubchemlite FALSE
annoTypeCount FPSum pubchemlite TRUE
annoTypeCount AnnoTypeCount pubchemlite TRUE
annotHitCount AnnotHitCount pubchemlite TRUE
libMatch TRUE

The first two columns contain the generic and original MetFrag naming schemes for each scoring type. While both naming schemes can be used, the generic is often shorter and harmonized with other algorithms (e.g. SIRIUS). The database column specifies for which databases a particular scoring is available (empty if not database specific). Most scorings are selected by default (as specified by the default column), however, this behaviour can be customized by using the scoreTypes argument:

# Only in-silico and PubChem number of patents scorings
compsMF1 <- generateCompounds(fGroups, mslists, "metfrag", adduct = "[M+H]+",
                              scoreTypes = c("fragScore" "numberPatents"))

# Custom scoring in custom database
compsMF2 <- generateCompounds(fGroups, mslists, "metfrag", adduct = "[M+H]+",
                              database = "csv",
                              extraOpts = list(LocalDatabasePath = "~/mydb.csv"),
                              scoreTypes = c("fragScore", "myScore", "myScore2"))

By default ranking is performed with equal weight (i.e. 1) for all scorings. This can be changed by the scoreWeights argument, which should be a vector containing the weights for all scorings following the order of scoreTypes, for instance:

compsMF <- generateCompounds(fGroups, mslists, "metfrag", adduct = "[M+H]+",
                             scoreTypes = c("fragScore" "numberPatents"),
                             scoreWeights = c(1, 2))

Sometimes thousands or more structural candidates are found when annotating a feature group. In this situation processing all these candidates will too involving (especially when external databases are used). To avoid this a default cut-off is set: when the number of candidates exceed a certain amount the search will be aborted and no results will be reported for that feature group. The maximum number of candidates can be set with the maxCandidatesToStop argument. The default value is relative conservative, especially for local databases it may be useful to increase this number.

4.6.3.3 MetFrag error and timeout handling

The use of online databases has the drawback that an error may occur, for instance, as a result of a connection error or when the aforementioned maximum number of candidates is reached (maxCandidatesToStop argument). By default, the processing is restarted if an error has occurred (configured by the errorRetries argument). Similarly, the timeoutRetries and timeout arguments can be used to avoid being ‘stuck’ on obtaining results, for instance, due to an unstable internet connection. If no compounds could be assigned due to an error a warning will be issued. In this case it is best to see what went wrong by manually checking the log files, which by default are stored in the log/metfrag folder.

4.6.3.4 Annotation with the Library algorithm

To use the Library algorithm we first need to load an MS library. Currently, MS libraries in the MSP and MoNA JSON formats are supported. Note that the former format is not so well standardized, and the support in patRoon was mainly tailored for MSP files from MassBank.eu and MoNA. To load the MS library the loadMSLibrary() function is used:

mslibrary <- loadMSLibrary("~/MassBank_NIST.msp", "msp") # MassBank.eu MSP library
mslibrary <- loadMSLibrary("~/MoNA-export-CASMI_2016.msp", "msp") # MoNA MSP library
mslibrary <- loadMSLibrary("~/MoNA-export-MassBank.json", "json") # MoNA JSON library

NOTE Currently it is only possible to load formula annotated MS/MS peaks with the MoNA JSON format.

Once loaded, the MS library can be post-processed with various filtering, subsetting and export functionality, which may be useful for more tailored compound annotation. This is further discussed in the advanced chapter.

The compound annotation is performed with generateCompounds():

compsLib <- generateCompounds(fGroups, mslists, "library", MSLibrary = mslibrary)

# set minimum MS/MS spectral match for candidates to 0.5
compsLib <- generateCompounds(fGroups, mslists, "library", MSLibrary = mslibrary, minSim = 0.5)

4.6.3.5 Formula scoring

Ranking of candidate structures may further be improved by incorporating formula information by using the addFormulaScoring() function:

comps <- addFormulaScoring(coms, formulas, updateScore = TRUE)

Here, corresponding formula and explained fragments will be used to calculate a formulaScore for each candidate. Note that SIRIUS candidates are already based on calculated formulae, hence, running this function on SIRIUS results is less sensible unless scoring from another formula calculation algorithm is desired.

4.6.3.6 Further options and parameters

There are many more options and parameters that affect compound annotation. For a full overview please have a look at the reference manual (e.g. by running ?generateCompounds).

4.6.4 Suspect annotation

The data obtained during the previously described annotation steps can be used to improve a suspect screening workflow. The annotateSuspects() method uses the annotation data to calculate various annotation properties for each suspect, such as their rank in formula/compound candidates, which fragments from the suspect list were matched, and a rough indication of the identification level according to Schymanski et al. (2014)

fGroupsSusp <- annotateSuspects(fGroupsSusp, MSPeakLists = mslists,
                                formulas = formulas, compounds = compounds)
#> Annotating 5 suspects...
#> ================================================================================

The calculation of identification levels is performed by a set of pre-defined rules. The genIDLevelRulesFile() can be used to inspect the default rules or to create your own rules file, which can subsequently passed to annotateSuspects() with the IDFile argument. See ?annotateSuspects for more details on the file format and options. The default identification levels can be summarized as follows:

Level Description Rules
1 Target match Retention time deviates <12 seconds from suspect list. At least 3 (or all if the suspect list contains less) fragments from the suspect list must match.
2a Good MS/MS library match Suspect is top ranked in the compounds results. The individualMoNAScore (MetFrag) or libMatch (Library algorithm) is at least 0.9 and no other candidates were matched with the MS library.
3a Fair library match The individualMoNAScore or libMatch is at least 0.4.
3b Known MS/MS match At least 3 (or all if the suspect list contains less) fragments from the suspect list must match.
3c Good in-silico MS/MS match The annotation MS/MS similarity (annSimComp column) is at least 0.7.
4a Good formula MS/MS match Suspect is top ranked formula candidate, annotation MS/MS similarity (annSimForm column) is at least 0.7 and isotopic match (isoScore) of at least 0.5. The latter two scores are at least 0.2 higher than next best ranked candidate.
4b Good formula isotopic pattern match Suspect is top ranked formula candidate and isotopic match (isoScore) of at least 0.9 and at least 0.2 higher than next best ranked candidate.
5 Unknown All else.

In general, the more data provided by the suspect list and to annotateSuspects(), the better identification level estimation works. For instance, when considering the default rules, either the fragments_mz or fragments_formula column is necessary to be able assign a level 3b. Similarly, the suspect list needs retention times (as well as fragment data) to be able to assign level 1. As you can imagine, providing the annotation workflow objects (i.e. MSPeakLists, formulas, compounds) to annotateSuspects() is necessary for calculation of most levels.

The annotateSuspects() function will log decisions for identification level assignments to the log/ sub-directory in the current working directory. This is useful to inspect level assignments and especially useful when you customized any rules.

NOTE: The current identification level rules are only optimized for GenForm and MetFrag annotation algorithms.

References

Schymanski, Emma L., Junho Jeon, Rebekka Gulde, Kathrin Fenner, Matthias Ruff, Heinz P. Singer, and Juliane Hollender. 2014. “Identifying Small Molecules via High Resolution Mass Spectrometry: Communicating Confidence.” Environmental Science and Technology 48 (4): 2097–98. https://doi.org/10.1021/es5002105.