6.2 Generating sets workflow data

As was shown in the previous section, the generation of workflow data with a sets workflow largely follows that as what was discussed in the previous chapters. The same generator functions are used:

Workflow step	Function	Output S4 class
Grouping features	`groupFeatures()`	`featureGroupsSet`
Suspect screening	`screenSuspects()`	`featureGroupsScreeningSet`
MS peak lists	`generateMSPeakLists()`	`MSPeakListsSet`
Formula annotation	`generateFormulas()`	`formulasSet`
Compound annotation	`generateCompounds()`	`compoundsSet`
Componentization	`generateComponents()`	algorithm dependent

(the data pre-treatment and feature finding steps have been omitted as they are not specific to sets workflows).

While the same function generics are used to generate data, the class of the output objects differ (e.g. formulasSet instead of formulas). However, since all these classes inherit from their non-sets workflow counterparts, using the workflow data in a sets workflow is nearly identical to what was discussed in the previous chapters (further discussed in the next section).

As discussed before, an important step is the neutralization of features. Other workflow steps also have internal mechanics to deal with data from different sets:

Workflow step	Handling of set data
Finding/Grouping features	Neutralization of m/z values
Suspect screening	Merging results from screening performed for each set
Componentization	Algorithm dependent (discussed below)
MS peak lists	MS data is obtained and stored per set. The final peak lists are combined (not averaged)
Formula/Compound annotation	Annotation is performed for each set separately and used to generate a final consensus

In most cases the algorithms of the workflow steps are first performed for each set, and this data is then merged. To illustrate the importance of this, consider these examples

A suspect screening with a suspect list that contains known MS/MS fragments
Annotation where MS/MS fragments are used to predict the chemical formula
Componentization in order to establish adduct assignments for the features

In all cases data is used that is highly dependent on the MS method (eg polarity) that was used to acquire the sample data. Nevertheless, all the steps needed to obtain and combine set data are performed automatically in the background, and are therefore largely invisible.

NOTE Because feature groups in sets workflows always have adduct annotations, it is never required to specify the adduct or ionization mode when generating annotations, components or do suspect screening (i.e. the adduct/ionization arguments should not be specified).

6.2.1 Componentization

When the componentization algorithms related to adduct/isotope annotations (e.g. CAMERA, RAMClustR and cliqueMS) and nontarget are used, then componentization occurs per set and the final object (a componentsSet or componentsNTSet) contains all the components together. Since these algorithms are highly dependent upon MS data polarity, no attempt is made to merge components from different sets.

The other componentization algorithms work on the complete data. For more details, see the reference manual (?generateComponents).

6.2.2 Formula and compound annotation

For formula and compound annotation, the data generated for each set is combined to generate a set consensus. The annotation tables are merged, scores are averaged and candidates are re-ranked. More details can be found in the reference manual (e.g. ?generateCompounds). In addition, it possible to only keep candidates that exist in a minimum number of sets. For this, the setThreshold and setThresholdAnn argument can be used:

# candidate must be present in all sets
formulas <- generateFormulas(fGroups, mslists, "genform", setThreshold = 1)
# candidate must be present in all sets with annotation data
compounds <- generateCompounds(fGroups, mslists, "metfrag", setThresholdAnn = 1)

In the first example, a formula candidate for a feature group is only kept if it was found for all of the sets. In the second example, a compound candidate is only kept if it was present in all of the sets with annotation data available. The following examples of a common positive/negative sets workflow illustrate the differences:

Candidate	annotations	candidate present	`setThreshold=1`	`setThresholdAnn=1`
#1	`+`, `-`	`+`, `-`	Keep	Keep
#2	`+`, `-`	`+`	Remove	Remove
#3	`+`	`+`	Remove	Keep

For more information refer to the reference manual (e.g. ?generateCompounds).