4.3 Features

Collecting features from the analyses consists of finding all features, grouping them across analyses (optionally after retention time alignment), and if desired suspect screening:

4.3.1 Finding and grouping features

Several algorithms are available for finding features. These are listed in the table below alongside their usage and general remarks.

Algorithm	Usage	Remarks
OpenMS	`findFeatures(algorithm = "openms", ...)`	Uses the FeatureFinderMetabo algorithm
XCMS	`findFeatures(algorithm = "xcms", ...)`	Uses `xcms::xcmsSet()` function
XCMS (import)	`importFeatures(algorithm = "xcms", ...)`	Imports an existing `xcmsSet` object
XCMS3	`findFeatures(algorithm = "xcms3", ...)`	Uses `xcms::findChromPeaks()` from the new XCMS3 interface
XCMS3 (import)	`importFeatures(algorithm = "xcms3", ...)`	Imports an existing `XCMSnExp` object
enviPick	`findFeatures(algorithm = "envipick", ...)`	Uses `enviPick::enviPickwrap()`
KPIC2	`findFeatures(algorithm = "kpic2", ...)`	Uses the KPIC2 `R` package
KPIC2 (import)	`importFeatures(algorithm = "kpic2", ...)`	Imports features from KPIC2
SIRIUS	`findFeatures(algorithm = "sirius", ...)`	Uses SIRIUS to find features
SAFD	`findFeatures(algorithm = "safd", ...)`	Uses the SAFD algorithm (experimental)
DataAnalysis	`findFeatures(algorithm = "bruker", ...)`	Uses Find Molecular Features from DataAnalysis (Bruker only)

Most often the performance of these algorithms heavily depend on the data and parameter settings that are used. Since obtaining a good feature dataset is crucial for the rest of the workflow, it is highly recommended to experiment with different settings (this process can also be automated, see the feature optimization section for more details). Some common parameters to look at are listed in the table below. However, there are many more parameters that can be set, please see the reference documentation for these (e.g. ?findFeatures).

Algorithm	Common parameters
OpenMS	`noiseThrInt`, `chromSNR`, `chromFWHM`, `mzPPM`, `minFWHM`, `maxFWHM` (see `?findFeatures`)
XCMS / XCMS3	`peakwidth`, `mzdiff`, `prefilter`, `noise` (assuming default `centWave` algorithm, see `?findPeaks.centWave` / `?CentWaveParam`)
enviPick	`dmzgap`, `dmzdens`, `drtgap`, `drtsmall`, `drtdens`, `drtfill`, `drttotal`, `minpeak`, `minint`, `maxint` (see `?enviPickwrap`)
KPIC2	`kmeans`, `level`, `min_snr` (see `?findFeatures` and `?getPIC` / `?getPIC.kmeans`)
SIRIUS	The `sirius` algorithm is currently parameterless
SAFD	`mzRange`, `maxNumbIter`, `resolution`, `minInt` (see `?findFeatures`)
DataAnalysis	See Find -> Parameters… -> Molecular Features in DataAnalysis.

NOTE Support for SAFD is still experimental and some extra work is required to set everything up. Please see the reference documentation for this algorithm (?findFeatures).

NOTE DataAnalysis feature settings have to be configured in DataAnalysis prior to calling findFeatures().

Similarly, for grouping features across analyses several algorithms are supported.

Algorithm	Usage	Remarks
OpenMS	`groupFeatures(algorithm = "openms", ...)`	Uses the FeatureLinkerUnlabeled algorithm (and MapAlignerPoseClustering for retention alignment)
XCMS	`groupFeatures(algorithm = "xcms", ...)`	Uses `xcms::group()` `xcms::retcor()` functions
XCMS (import)	`importFeatureGroupsXCMS(...)`	Imports an existing `xcmsSet` object.
XCMS3	`groupFeatures(algorithm = "xcms3", ...)`	Uses `xcms::groupChromPeaks()` and `xcms::adjustRtime()` functions
XCMS3 (import)	`importFeatureGroupsXCMS3(...)`	Imports an existing `XCMSnExp` object.
KPIC2	`groupFeatures(algorithm = "kpic2", ...)`	Uses the KPIC2 package
KPIC2 (import)	`importFeatureGroupsKPIC2(...)`	Imports a `PIC set` object
SIRIUS	`groupFeatures(anaInfo, algorithm = "sirius")`	Finds and groups features with SIRIUS
ProfileAnalysis	`importFeatureGroups(algorithm = "brukerpa", ...)`	Import `.csv` file exported from Bruker ProfileAnalysis
TASQ	`importFeatureGroups(algorithm = "brukertasq", ...)`	Imports a Global result table (exported to Excel file and then saved as `.csv` file)

NOTE: Grouping features with the sirius algorithm will perform both finding and grouping features with SIRIUS. This algorithm cannot work with features from another algorithm.

Just like finding features, each algorithm has their own set of parameters. Often the defaults are a good start but it is recommended to have look at them. See ?groupFeatures for more details.

When using the XCMS algorithms both the ‘classical’ interface and latest XCMS3 interfaces are supported. Currently, both interfaces are mostly the same regarding functionalities and implementation. However, since future developments of XCMS are primarily focused the latter this interface is recommended.

Some examples of finding and grouping features are shown below.

# The anaInfo variable contains analysis information, see the previous section

# Finding features
fListOMS <- findFeatures(anaInfo, "openms") # OpenMS, with default settings
fListOMS2 <- findFeatures(anaInfo, "openms", noiseThrInt = 500, chromSNR = 10) # OpenMS, adjusted minimum intensity and S/N
fListXCMS <- findFeatures(anaInfo, "xcms", ppm = 10) # XCMS
fListXCMSImp <- importFeatures(anaInfo, "xcms", xset) # import XCMS xcmsSet object
fListXCMS3 <- findFeatures(anaInfo, "xcms3", CentWaveParam(peakwidth = c(5, 15))) # XCMS3
fListEP <- findFeatures(anaInfo, "envipick", minint = 1E3) # enviPick
fListKPIC2 <- findFeatures(anaInfo, "kpic2", kmeans = TRUE, level = 1E4) # KPIC2
fListSIRIUS <- findFeatures(anaInfo, "sirius") # SIRIUS

# Grouping features
fGroupsOMS <- groupFeatures(fListOMS, "openms") # OpenMS grouping, default settings
fGroupsOMS2 <- groupFeatures(fListOMS2, "openms", rtalign = FALSE) # OpenMS grouping, no RT alignment
fGroupsOMS3 <- groupFeatures(fListXCMS, "openms", maxGroupRT = 6) # group XCMS features with OpenMS, adjusted grouping parameter
# group enviPick features with XCMS3, disable minFraction
fGroupsXCMS <- groupFeatures(fListEP, "xcms3",
                             xcms::PeakDensityParam(sampleGroups = analInfo$group,
                                                    minFraction = 0))
# group with KPIC2 and set some custom grouping/aligning parameters
fGroupsKPIC2 <- groupFeatures(fListKPIC2, "kpic2", groupArgs = list(tolerance = c(0.002, 18)),
                              alignArgs = list(move = "loess"))
fGroupsSIRIUS <- groupFeatures(anaInfo, "sirius") # find/group features with SIRIUS

4.3.2 Suspect screening

After features have been grouped a so called suspect screening step may be performed to find features that may correspond to suspects within a given suspect list. The screenSuspects() function is used for this purpose, for instance:

suspects <- data.frame(name = c("1H-benzotriazole", "N-Phenyl urea", "2-Hydroxyquinoline"),
                       mz = c(120.0556, 137.0709, 146.0600))
fGroupsSusp <- screenSuspects(fGroups, suspects)

4.3.2.1 Suspect list format

The example above has a very simple suspect list with just three compounds. The format of the suspect list is quite flexible, and can contain the following columns:

name: The name of the suspect. Mandatory and should be unique and file-name compatible (if not, the name will be automatically re-named to make it compatible).
rt: The retention time in seconds. Optional. If specified any feature groups with a different retention time will not be considered to match suspects.
mz, SMILES, InChI, formula, neutralMass: at least one of these columns must hold data for each suspect row. The mz column specifies the ionized mass of the suspect. If this is not available then data from any of the other columns is used to determine the suspect mass.
adduct: The adduct of the suspect. Optional. Set this if you are sure that a suspect should be matched by a particular adduct ion and no data in the mz column is available.
fragments_mz and fragments_formula: optional columns that may assist suspect annotation.

In most cases a suspect list is best made as a csv file which can then be imported with e.g. the read.csv() function. This is exactly what happen when you specify a suspect list when using the newProject() function.

Quite often, the ionized masses are not readily available and these have to be calculated. In this case, data in any of the SMILES/InChI/formula/neutralMass columns should be provided. Whenever possible, it is strongly recommended to fill in SMILES column (or InChI), as this will assist annotation. Applying this to the above example:

suspects <- data.frame(name = c("1H-benzotriazole", "N-Phenyl urea", "2-Hydroxyquinoline"),
                       SMILES = c("[nH]1nnc2ccccc12", "NC(=O)Nc1ccccc1", "Oc1ccc2ccccc2n1"))
fGroupsSusp <- screenSuspects(fGroups, suspects, adduct = "[M+H]+")

#> Calculating/Validating chemical data... Done!
#> ================================================================================
#> Found 3/3 suspects (100.00%)

NOTE: It is highly recommended to install OpenBabel to automatically validate and amend chemical properties such as SMILES, InChI, formulae etc in the suspect list.

Since suspect matching now occurs by the neutral mass it is required that the adduct information for the feature groups are set. This is done either by setting the adduct function argument to screenSuspects or by feature group adduct annotations.

Finally, when the adduct is known for a suspect it can be specified in the suspect list:

# Aldicarb is measured with a sodium adduct.
suspects <- data.frame(name = c("1H-benzotriazole", "N-Phenyl urea", "Aldicarb"),
                       SMILES = c("[nH]1nnc2ccccc12", "NC(=O)Nc1ccccc1", "CC(C)(C=NOC(=O)NC)SC"),
                       adduct = c("[M+H]+", "[M+H]+", "[M+Na]+"))
fGroupsSusp <- screenSuspects(fGroups, suspects)

To summarize:

If a suspect has data in the mz column it will be directly matched with the m/z value of a feature group.
Otherwise, if the suspect has data in the adduct column, the m/z value for the suspect is calculated from its neutral mass and the adduct and then matched with the m/z of a feature group.
Otherwise, suspects and feature groups are matched by their the neutral mass.

The fragments_mz and fragments_formula columns in the suspect list can be used to specify known fragments for a suspect, which can help suspect annotation. The former specifies the ionized m/z of known MS/MS peaks, whereas the second specifies known formulas. Multiple values can be given by separating them with a semicolon:

suspects <- data.frame(name = c("1H-benzotriazole", "N-Phenyl urea", "2-Hydroxyquinoline"),
                       SMILES = c("[nH]1nnc2ccccc12", "NC(=O)Nc1ccccc1", "Oc1ccc2ccccc2n1"),
                       fragments_formula = c("C6H6N", "C6H8N;C7H6NO", ""),
                       fragments_mz = c("", "", "118.0652"))

4.3.2.2 Removing feature groups without hits

Note that any feature groups that were not matched to a suspect are not removed by default. If you want to remove these, you can use the onlyHits parameter:

fGroupsSusp <- screenSuspects(fGroups, suspects, onlyHits = TRUE) # remove any non-hits immediately

The advantage of removing non-hits is that it may significantly reduce the complexity of your dataset. On the other hand, retaining all features allows you to mix a full non-target analysis with a suspect screening workflow. The filter() function (discussed here) can also be used to remove feature groups without a hit at a later stage.

4.3.2.3 Combining screening results

The amend function argument to screenSuspects can be used to combine screening results from different suspect lists.

fGroupsSusp <- screenSuspects(fGroups, suspects)
fGroupsSusp <- screenSuspects(fGroupsSusp, suspects2, onlyHits = TRUE, amend = TRUE)

In this example the suspect lists defined in suspects and suspects2 are both used for screening. By setting amend=TRUE the original screening results (i.e. from suspects) are preserved. Note that onlyHits should only be set in the final call to screenSuspects to ensure that all feature groups are screened.