5.3 Subsetting

The previous section discussed the filter() generic function to perform various data cleaning operations. A more generic way to select data is by subsetting: here you can manually specify which parts of an object should be retained. Subsetting is supported for all workflow objects and is performed by the R subset operator ("["). This operator either subsets by one or two arguments, which are referred to as the i and j arguments.

Class Argument i Argument j Remarks
features analyses
featureGroups analyses feature groups
MSPeakLists analyses feature groups peak lists for feature groups will be re-averaged when subset on analyses (by default)
formulas feature groups
compounds feature groups
components components feature groups

For objects that support two-dimensional subsetting (e.g. featureGroups, MSPeakLists), either the i or j argument is optional. Furthermore, unlike subsetting a data.frame, the position of i and j does not change when only one argument is specified:

df[1, 1] # subset data.frame by first row/column
df[1] # subset by first column
df[1, ] # subset by first row

fGroups[1, 1] # subset by first analysis/feature group
fGroups[, 1] # subset by first feature group (i.e. column)
fGroups[1] # subset by first analysis (i.e. row)

The subset operator allows three types of input:

  • A logical vector: elements are selected if corresponding values are TRUE.
  • A numeric vector: select elements by numeric index.
  • A character vector: select elements by their name.

When a logical vector is used as input it will be re-cycled if necessary. For instance, the following will select by the first, third, fifth, etc. analysis.

fGroups[c(TRUE, FALSE)]

In order to select by a character you will need to know the names for each element. These can, for instance, be obtained by the groupNames() (feature group names), analyses() (analysis names) and names() (names for components or feature groups for featureGroups objects) generic functions.

Some more examples of common subsetting operations are shown below.

# select first three analyses
fList[1:3]

# select first three analyses and first 500 feature groups
fGroups[1:3, 1:500]

# select all feature groups from first component
fGroupsNT <- fGroups[, componNT[[1]]$group]

# only keep feature groups with formula annotation results
fGroupsForms <- fGroups[, groupNames(formulas)]

# only keep feature groups with either formula or compound annotation results
fGroupsAnn <- fGroups[, union(groupNames(formulas), groupNames(compounds))]

# select first 15 components
components[1:15]

# select by name
components[c("CMP1", "CMP5")]

# only retain feature groups in components for which compound annotations are
# available
components[, groupNames(compounds)]

In addition, feature groups can also be subset by given replicate groups or annotation/componentization results (similar to filter()). Similarly, suspect screening results can also be subset by given suspect names.

# equal as filter(fGroups, rGroups = ...)
fGroups[rGroups = c("repl1", "repl2")]
# equal as filter(fGroups, results = ...)
fGroups[results = compounds]
# only keep feature groups assigned to given suspects
fGroupsSusp[suspects = c("1H-benzotriazole", "2-Hydroxyquinoline")]

NOTE As of patRoon 2.0 MS peak lists are not re-generated after a subsetting operation (unless the reAverage parameter is explicity set to TRUE). The reason for this change is that re-averaging invalidates any formula/compound annotation data (e.g. used for plotting and reporting) that were generated prior to the subset operation.

5.3.1 Prioritization workflow

An important use case of subsetting is prioritization of data. For instance, after statistical analysis only certain feature groups are deemed relevant for the rest of the workflow. A common prioritization workflow is illustrated below:

During the first step the workflow object is converted to a suitable format, most often using the as.data.frame() function. The converted data is then used as input for the prioritization strategy. Finally, these results are then used to select the data of interest in the original object.

A very simplified example of such a process is shown below.

featTab <- as.data.frame(fGroups, average = TRUE)

# prioritization: sort by (averaged) intensity of the "sample" replicate group
# (from high to low) and then obtain the feature group identifiers of the top 5.
featTab <- featTab[order(featTab$standard, decreasing = TRUE), ]
groupsOfInterest <- featTab$group[1:5]

# subset the original data
fGroups <- fGroups[, groupsOfInterest]

# fGroups now only contains the feature groups for which intensity values in the
# "sample" replicate group were in the top 5