4.5 Features
Collecting features from the analyses consists of finding all features, grouping them across analyses (optionally after retention time alignment), and if desired suspect screening:
4.5.1 Finding features
Several algorithms are available for finding features. These are listed in the table below alongside their usage and general remarks.
| Algorithm | Usage | Input | Remarks |
|---|---|---|---|
| OpenMS | findFeatures(algorithm = "openms", ...) |
centroid (mzML) |
Uses FeatureFinderMetabo |
| XCMS | findFeatures(algorithm = "xcms", ...) |
centroid (mzML or mzXML) |
Uses xcms::xcmsSet() |
| XCMS3 | findFeatures(algorithm = "xcms3", ...) |
centroid (mzML or mzXML) |
Uses xcms::findChromPeaks() |
piek |
findFeatures(algorithm = "piek", ...) |
raw data loaded with msdata | Introduced in patRoon 3.0 |
| enviPick | findFeatures(algorithm = "envipick", ...) |
centroid (mzXML) |
Uses enviPick::enviPickwrap() |
| KPIC2 | findFeatures(algorithm = "kpic2", ...) |
centroid (mzML or mzXML) |
|
| SIRIUS | findFeatures(algorithm = "sirius", ...) |
centroid (mzML or mzXML) |
|
| SAFD | findFeatures(algorithm = "safd", ...) |
centroid or profile (mzXML) |
Experimental |
| DataAnalysis | findFeatures(algorithm = "bruker", ...) |
raw (Bruker .d) |
Uses Find Molecular Features from DataAnalysis (Bruker only) |
The feature finding algorithms have different requirements on the data file types and formats (Input column). See the data conversion section to perform data conversion.
When using the XCMS algorithms both the ‘classical’ interface and latest XCMS3 interfaces are supported. The implementation and functionality is largely the same within patRoon, however, the latter is recommended as the classical interface is not further developed anymore.
Most often the performance of these algorithms heavily depends on the data and parameter settings that are used. Since obtaining a good feature dataset is crucial for the rest of the workflow, it is highly recommended to experiment with different settings (this process can also be automated, see the feature optimization section for more details). Some common parameters to look at are listed in the table below. However, there are many more parameters that can be set, please see the reference documentation for these (e.g. ?findFeatures).
| Algorithm | Common parameters |
|---|---|
| OpenMS | noiseThrInt, chromSNR, chromFWHM, mzPPM, minFWHM, maxFWHM (see ?findFeaturesOpenMS) |
| XCMS / XCMS3 | peakwidth, mzdiff, prefilter, noise (assuming default centWave algorithm, see ?findPeaks.centWave / ?CentWaveParam) |
piek |
genEICParams, peakParams, see next section and ?findFeaturesPiek |
| enviPick | dmzgap, dmzdens, drtgap, drtsmall, drtdens, drtfill, drttotal, minpeak, minint, maxint (see ?enviPickwrap) |
| KPIC2 | kmeans, level, min_snr (see ?findFeatures and ?getPIC / ?getPIC.kmeans) |
| SIRIUS | The sirius algorithm is currently parameterless |
| SAFD | mzRange, maxNumbIter, resolution, minInt (see ?findFeaturesSAFD) |
| DataAnalysis | See Find -> Parameters… -> Molecular Features in DataAnalysis. |
NOTE DataAnalysis feature settings have to be configured in DataAnalysis prior to calling
findFeatures().
Some examples of finding features are shown below.
# The anaInfo variable contains analysis information, see the previous section
# Finding features
fListOMS <- findFeatures(anaInfo, "openms") # OpenMS, with default settings
fListOMS2 <- findFeatures(anaInfo, "openms", noiseThrInt = 500, chromSNR = 10) # OpenMS, adjusted minimum intensity and S/N
fListXCMS <- findFeatures(anaInfo, "xcms", ppm = 10) # XCMS
fListXCMS3 <- findFeatures(anaInfo, "xcms3", CentWaveParam(peakwidth = c(5, 15))) # XCMS3
fListEP <- findFeatures(anaInfo, "envipick", minint = 1E3) # enviPick
fListKPIC2 <- findFeatures(anaInfo, "kpic2", kmeans = TRUE, level = 1E4) # KPIC2
fListSIRIUS <- findFeatures(anaInfo, "sirius") # SIRIUS4.5.1.1 Feature detection with piek
The piek algorithm was introduced in patRoon 3.0, and extends the work done by Dietrich et al. (2021). This algorithm finds features from extracted ion chromatograms (EICs) and can use various peak detection algorithms to find features. The EICs are generated from fixed-width m/z bins, therefore, it is important to define the proper range for the compounds of interest. The feature detection can be speed up by filtering the EIC bins from one of the following:
- data from a suspect list (the format follows that of suspect lists for suspect screening workflows)
- data from precursors detected in data-dependent MS/MS experiments (DDA)
This will considerably limit the number of EICs that need to be generated and processed.
The getPiekEICParams() function is used to create the parameters that are used to generate the EICs. Its output is passed to the genEICParams argument to findFeatures() function. Similarly, the getDefPeakParams() function is used to create the parameters that are used to detect peaks, and its output is passed to the peakParams argument of findFeatures().
Some examples are shown below:
# find features from m/z bins with piek's native peak detection algorithm
genEICParams <- getPiekEICParams()
peakParams <- getDefPeakParams("chrom", "piek")
fList <- findFeatures(anaInfo, "piek", genEICParams = genEICParams, peakParams = peakParams)
# as above, but customize binning
genEICParams <- getPiekEICParams(mzRange = c(50, 1000), mzStep = 0.01)
peakParams <- getDefPeakParams("chrom", "piek")
fList <- findFeatures(anaInfo, "piek", genEICParams = genEICParams, peakParams = peakParams)
# find features from a suspect list with XCMS peak detection
# see the suspect screening section for more details on the suspect list format
genEICParams <- getPiekEICParams(filter = "suspects")
peakParams <- getDefPeakParams("chrom", "xcms3")
fList <- findFeatures(anaInfo, "piek", genEICParams = genEICParams, peakParams = peakParams,
suspects = suspList, adduct = "[M+H]+")
# use DDA MS/MS data to find features
# only find features with data between two and ten minutes retention time.
# increase minimum spectrum TIC to focus on features with high intensity MS/MS data
genEICParams <- getPiekEICParams(filter = "ms2", retRange = c(120, 600), minTIC = 1E5)
peakParams <- getDefPeakParams("chrom", "openms")
fList <- findFeatures(anaInfo, "piek", genEICParams = genEICParams, peakParams = peakParams)For more details, please see the reference manual (?findFeaturesPiek).
4.5.2 Feature grouping
For grouping features across analyses several algorithms are supported.
| Algorithm | Usage | Remarks |
|---|---|---|
| OpenMS | groupFeatures(algorithm = "openms", ...) |
Uses the FeatureLinkerUnlabeled algorithm (and MapAlignerPoseClustering for retention alignment) |
| XCMS | groupFeatures(algorithm = "xcms", ...) |
Uses xcms::group() xcms::retcor() functions |
| XCMS3 | groupFeatures(algorithm = "xcms3", ...) |
Uses xcms::groupChromPeaks() and xcms::adjustRtime() functions |
| KPIC2 | groupFeatures(algorithm = "kpic2", ...) |
Uses the KPIC2 package |
greedy |
groupFeatures(algorithm = "greedy", ...) |
Introduced in patRoon 3.0 |
| SIRIUS | groupFeatures(anaInfo, algorithm = "sirius") |
Finds and groups features with SIRIUS |
NOTE: Grouping features with the
siriusalgorithm will perform both finding and grouping features with SIRIUS. This algorithm cannot work with features from another algorithm.
Just like finding features, each algorithm has their own set of parameters. Often the defaults are a good start, but it is recommended to have look at them. See ?groupFeatures for more details. The algorithm used to group features does not have to match to the algorithm that was used to find the features.
Some examples of grouping features are shown below.
# Group features, using the fList objects created in the previous feature finding section
fGroupsOMS <- groupFeatures(fListOMS, "openms") # OpenMS grouping, default settings
fGroupsOMS2 <- groupFeatures(fListOMS2, "openms", rtalign = FALSE) # OpenMS grouping, no RT alignment
fGroupsOMS3 <- groupFeatures(fListXCMS, "openms", maxGroupRT = 6) # group XCMS features with OpenMS, adjusted grouping parameter
# group enviPick features with XCMS3, disable minFraction
fGroupsXCMS <- groupFeatures(fListEP, "xcms3",
xcms::PeakDensityParam(sampleGroups = analInfo$replicate,
minFraction = 0))
# group with KPIC2 and set some custom grouping/aligning parameters
fGroupsKPIC2 <- groupFeatures(fListKPIC2, "kpic2", groupArgs = list(tolerance = c(0.002, 18)),
alignArgs = list(move = "loess"))
# greedy algorithm with custom tolerances and weights
fGroupsGreedy <- groupFeatures(fListXCMS3, "greedy", rtWindow = 5, mzWindow = 0.003,
scoreWeights = c(retention = 0.5, mz = 3, mobility = 1, intensity = 1))
fGroupsSIRIUS <- groupFeatures(anaInfo, "sirius") # find/group features with SIRIUS4.5.3 Suspect screening
After features have been grouped a so called suspect screening step may be performed. During this step a suspect list is used to screen the detected features and match them to the suspects in the list. Suspect screening can simplify the identification of unknown features, and can simplify the overall workflow by removing the features without matches. The screenSuspects() function is used for this purpose, for instance:
# Perform a very basic suspect screening workflow. The suspects are matched by m/z values, and any non-hits are removed.
suspects <- data.frame(name = c("1H-benzotriazole", "N-Phenyl urea", "2-Hydroxyquinoline"),
mz = c(120.0556, 137.0709, 146.0600))
fGroupsSusp <- screenSuspects(fGroups, suspects, onlyHits = TRUE)4.5.3.1 Suspect list format
The example above has a very simple suspect list with just three compounds. The format of the suspect list is quite flexible, and can contain the following columns:
name: The name of the suspect. Mandatory and should be unique and file-name compatible (if not, the name will be automatically re-named to make it compatible).rt: The retention time in seconds. Optional. If specified any feature groups with a different retention time will not be considered to match suspects.mz,SMILES,InChI,formula,neutralMass: at least one of these columns must hold data for each suspect row. Themzcolumn specifies the ionized mass of the suspect. If this is not available then data from any of the other columns is used to determine the suspect mass.adduct: The adduct of the suspect. Optional. Set this if you are sure that a suspect should be matched by a particular adduct ion and no data in themzcolumn is available.fragments_mzandfragments_formula: optional columns that may assist ID confidence estimation.mobilityandCCScolumns: these are for IMS workflows and can increase the confidence of a suspect match. This is discussed here.
In most cases a suspect list is best made as a csv file which can then be imported with e.g. the read.csv() function. This is exactly what happen when you specify a suspect list when using the newProject() function.
Quite often, the ionized masses are not readily available and these have to be calculated. In this case, data in any of the SMILES/InChI/formula/neutralMass columns should be provided. Whenever possible, it is strongly recommended to fill in SMILES column (or InChI), as this will assist ID confidence estimation. Applying this to the above example:
suspects <- data.frame(name = c("1H-benzotriazole", "N-Phenyl urea", "2-Hydroxyquinoline"),
SMILES = c("[nH]1nnc2ccccc12", "NC(=O)Nc1ccccc1", "Oc1ccc2ccccc2n1"))
fGroupsSusp <- screenSuspects(fGroups, suspects, adduct = "[M+H]+", onlyHits = TRUE)#> Calculating/Validating chemical data... Done!
#> ================================================================================
#> Found 3/3 suspects (100.00%) with 5 hits in total
NOTE: It is highly recommended to install OpenBabel to automatically validate and amend chemical properties such as SMILES, InChI, formulae etc in the suspect list.
Since suspect matching now occurs by the neutral mass it is required that the adduct information for the feature groups are set. This is done either by setting the adduct function argument to screenSuspects or by feature group adduct annotations.
Finally, when the adduct is known for a suspect it can be specified in the suspect list:
# Aldicarb is measured with a sodium adduct.
suspects <- data.frame(name = c("1H-benzotriazole", "N-Phenyl urea", "Aldicarb"),
SMILES = c("[nH]1nnc2ccccc12", "NC(=O)Nc1ccccc1", "CC(C)(C=NOC(=O)NC)SC"),
adduct = c("[M+H]+", "[M+H]+", "[M+Na]+"))
fGroupsSusp <- screenSuspects(fGroups, suspects, onlyHits = TRUE)To summarize:
- If a suspect has data in the
mzcolumn it will be directly matched with the m/z value of a feature group. - Otherwise, if the suspect has data in the
adductcolumn, them/zvalue for the suspect is calculated from its neutral mass and the adduct and then matched with them/zof a feature group. - Otherwise, suspects and feature groups are matched by their the neutral mass.
The fragments_mz and fragments_formula columns in the suspect list can be used to specify known fragments for a suspect, which can help ID confidence estimation for suspects. The former specifies the ionized m/z of known MS/MS peaks, whereas the second specifies known formulas. Multiple values can be given by separating them with a semicolon:
4.5.3.2 Removing feature groups without hits
Note that any feature groups that were not matched to a suspect are not removed by default. If you want to remove these, you can use the onlyHits parameter:
The advantage of removing non-hits is that it may significantly reduce the complexity of your dataset. On the other hand, retaining all features allows you to mix a full non-target analysis with a suspect screening workflow. The filter() function (discussed here) can also be used to remove feature groups without a hit at a later stage.
4.5.3.3 Combining screening results
The amend function argument to screenSuspects can be used to combine screening results from different suspect lists.
fGroupsSusp <- screenSuspects(fGroups, suspects)
fGroupsSusp <- screenSuspects(fGroupsSusp, suspects2, onlyHits = TRUE, amend = TRUE)In this example the suspect lists defined in suspects and suspects2 are both used for screening. By setting amend=TRUE the original screening results (i.e. from suspects) are preserved. Note that onlyHits should only be set in the final call to screenSuspects to ensure that all feature groups are screened.
4.5.4 Importing feature data
The importFeatures() and importFeatureGroups() functions can be used to import existing feature data. This is useful when you have already performed feature finding and/or grouping with another software or package, or when you want to use features from a previous workflow run.
The most important input types are:
| Input type | Remarks |
|---|---|
"xcms" |
Imports an xcmsSet object from XCMS. |
"xcms3" |
Imports an XCMSnExp object from XCMS. |
"kpic2" |
Imports a PIC set object from KPIC2. |
"table" |
Imports data from a table (e.g. data.frame or .csv file). |
The table input type allows the import of data from any external feature detection algorithm, and follows the the format of the as.data.table() function introduced later.
A workflow where data is first exported and later re-imported can be useful to modify feature data outside patRoon. For instance, the export functionality can be used to export XCMS feature data, process it with XCMS and then re-import it with importFeatures() or importFeatureGroups to continue the patRoon workflow. Similarly, the as.data.table() function can be used to export feature data to a data.table, which can then be modified and re-imported.
Some examples are shown below:
fListXCMSImp <- importFeatures(xset, "xcms", anaInfo) # import XCMS xcmsSet object
fGroupsXCMS3 <- importFeatureGroups(xs, "xcms3", anaInfo) # import XCMSnExp object with grouped features
# export, modify and re-import features
fListTab <- as.data.table(fList)
fListTab[, mz := mz * 1.0001] # modify the m/z values
fList <- importFeatures(fListTab, "table", anaInfo) # re-import the modified featuresSee the reference manual for more details, especially for tabular data import (?importFeatures, ?importFeatureGroups).