7.2 Linking parent and transformation product features
This section discusses one of the most important steps in a TP screening workflow, which is to link feature groups of parents with those of candidate transformation products. During this step, TP components are made, where each component consist of one or more feature groups of detected TPs for a particular parent. Note that componentization was already introduced before, but for very different algorithms. However, the data format for TP componentization is quite similar. After componentization, several filters are available to clean and prioritize the data. These can even allow workflows without obtaining potential TP candidates in advance, which is discussed in the last subsection.
7.2.1 Componentization
Like other algorithms, the generateComponents generic function is used to generate TP components, by setting the algorithm parameter to "tp".
The following arguments are of importance:
| Argument | Remarks |
|---|---|
fGroups |
The input feature groups for the parents |
fGroupsTPs |
The input feature groups for the TPs |
ignoreParents |
Set to TRUE to ignore feature groups in fGroupsTPs that also occur in fGroups |
TPs |
The input transformation products, ie as generated by generateTPs() |
MSPeakLists, formulas, compounds |
Annotation objects used for similarity calculation between the parent and its TPs |
minRTDiff |
The minimum retention time difference (seconds) of a TP for it to be considered to elute differently than its parent. |
7.2.1.1 Feature group input
The fGroups, fGroupsTPs and ignoreParents arguments are used by the componentization algorithm to identify which feature groups can be considered as parents and which as TPs. Three scenarios are possible:
fGroups=fGroupsTPsandignoreParents=FALSE: in this case no distinction is made, and all feature groups are considered a parent or TP (default iffGroupsTPsis not specified).fGroupsandfGroupsTPscontain different subsets of the samefeatureGroupsobject andignoreParents=FALSE: only the feature groups infGroups/fGroupsTPsare considered as parents/TPs.- As above, but with
ignoreParents=TRUE: the same distinction is made as above, but any feature groups infGroupsTPsare ignored for TP candidate selection if also present infGroups.
The first scenario is often used if it is unknown which feature groups may be parents or which are TPs. Furthermore, this scenario may also be used if the dataset is sufficiently simple, for instance, because a suspect screening with the results from convertToSuspects (discussed in the previous section) would reliably discriminate between parents and TPs. A workflow with the first scenario is demonstrated in the second example.
In all other cases it is recommended to use either the second or third scenario, since making a prior distinction between parent and TP feature groups greatly simplifies the dataset and reduces false positives. A relative simple example where this can be used is when there are two sample groups: before and after treatment.
componTP <- generateComponents(algorithm = "tp",
fGroups = fGroups[ni = treatment == "before"],
fGroupsTPs = fGroups[ni = treatment == "after"])In this example, only those feature groups with features present in the “before” treatment group are considered as parents, and those in “after” may be considered as a TP (see this section on how to group sample analyses with metadata). Since it is likely that there will be some overlap in feature groups between both sample groups, the ignoreParents flag can be used to not consider any of the overlap for TP assignments:
componTP <- generateComponents(algorithm = "tp",
fGroups = fGroups[ni = treatment == "before"],
fGroupsTPs = fGroups[ni = treatment == "after"],
ignoreParents = TRUE)More sophisticates ways are of course possible to provide an upfront distinction between parent/TP feature groups. In the fourth example a workflow is demonstrated where fold changes are used.
NOTE The feature groups specified for
fGroups/fGroupsTPsmust always originate from the samefeatureGroupsobject.
If TPs were generated with an algorithm that requires parent input (see Obtaining TPs), then it is often mandatory that a suspect screening of parents and TPs is performed prior to componentization. This is necessary for the componentization algorithm to map the feature groups that belong to a particular parent or TP. Note that this step should not be performed for the ann_form and ann_comp algorithms, as these algorithms already provide the necessary mappings. The convertToSuspects function is used to prepare the suspect list:
# perform suspect screening
# NOTE: set includeParents to TRUE since both the parents and TPs should be screened
# NOTE: for the ann_form and ann_comp algorithms no suspect screening is necessary
suspects <- convertToSuspects(TPs, includeParents = TRUE)
fGroupsScr <- screenSuspects(fGroups, suspects, onlyHits = TRUE)
# do the componentization
# a similar distinction between fGroups/fGroupsScr as discussed above can of course also be done
componTP <- generateComponents(fGroups = fGroupsScr, ...)If a parent screening was already performed in advance, for instance when the input parents to generateTPs are screening results, the screening results for parents and TPs can also be combined. The second example demonstrates this.
Note that in the case a parent suspect is matched to multiple feature groups, a component is made for each match. Similarly, if multiple feature groups match to the same TP suspect, all of them will be incorporated in the component.
When TPs were generated with the logic algorithm a suspect screening must also be carried out in advance. However, in this case it is not necessary to include the parents (since each parent equals a feature group no mapping is necessary). The onlyHits variable to screenSuspects must not be set in order to keep the parents.
7.2.1.2 Annotation similarity calculation
If additional annotation data for parents and TPs is given to the componentization algorithm, it will be used to calculate various similarity metrics. Often, the chemical structure for a transformation product is similar to that of its parent. Hence, there is a good chance that a parent and its TPs also share similar MS/MS data.
Firstly, if MS peak lists are provided, then the spectrum similarity is calculated between each parent and its potential TP candidates. This is performed with all the three different alignment shifts (see the spectrum similarity section for more details).
In case formulas and/or compounds objects are given as input to generateComponents(), then a parent/TP comparison is made by counting the number of fragments and neutral losses that they share (based on the formulae assigned to the MS/MS fragments). The counts are calculated in two different ways:
- from the matches between the parent and the TP candidate
- from the matches between the parent and all the annotation candidates in the input
formulas/compoundsobject for the TP feature group
The latter is mainly used (and only available) in workflows where componentization is performed without previously generated TPs. To improve the usefulness of the total similarity metric, it is highly recommend to pre-treat the annotation objects with e.g. the topMost filter. Both calculation methods pool the data from the input formulas and compounds and only count unique fragment/neutral loss matches.
7.2.2 Processing data
The output of TP componentization is an object of the componentsTPs class. This derives from the ‘regular’ components class, therefore, all the data processing functionality described before (extraction, subsetting, filtering etc) are also valid for TP components.
For the as.data.table() method function (and as.data.frame()) the candidates argument can be used to specify if individual candidates for each feature group in the component should be included in the output:
# only output feature group data
as.data.table(componTP, candidates = FALSE)[name == "CMP2", .(name, parent_name, group)]#> name parent_name group
#> <char> <char> <char>
#> 1: CMP2 Dimethametryn M228_R353_323
#> 2: CMP2 Dimethametryn M226_R342_312
#> 3: CMP2 Dimethametryn M214_R341_264
# also include the candidates for each feature group
as.data.table(componTP, candidates = TRUE)[name == "CMP2", .(name, parent_name, group, candidate_name, formula)]#> name parent_name group candidate_name formula
#> <char> <char> <char> <char> <char>
#> 1: CMP2 Dimethametryn M228_R353_323 Dimethametryn-TP2 C9H17N5S
#> 2: CMP2 Dimethametryn M226_R342_312 Dimethametryn-TP5 C10H19N5O
#> 3: CMP2 Dimethametryn M214_R341_264 Dimethametryn-TP9 C8H15N5S
Several additional filters are available to prioritize the data:
| Filter | Remarks |
|---|---|
retDirMatch |
If TRUE only keep TPs with an expected chromatographic retention direction compared to the parent. |
minSpecSim, minSpecPrec, minSpecSimBoth |
The minimum spectrum similarity between the parent and TP. Calculated with no, "precursor" and "both" alignment shifting (see spectrum similarity). |
minFragMatches, minNLMatches |
Minimum number of formula fragment/neutral loss matches between parent and TP (discussed in previous section). |
minTotFragMatches, minTotNLMatches |
As above, but for total matches. |
formulas |
A formulas object used to further verify candidate TPs that were generated by the logic algorithm. |
The retDirMatch filter compares the expected and observed retention time direction of a TP in order to decide if it should be kept. The direction is a value of either -1 (TP elutes before parent), +1 (TP elutes after parent) or 0 (TP elutes very close to the parent or its direction is unknown). The directions are taken from the generated transformation products. In most cases the log P values are compared of a TP and its parent. Here, it is assumed that lower log P values result in earlier elution (i.e. typical with reversed phase LC). For the logic algorithm the retention time direction is taken from the transformation rules table. Note that specifying a large enough value for the minRTDiff argument to generateComponents is important to ensure that some tolerance exists while comparing retention time directions of parent and TPs. Furthermore, the TPStructParams argument to generateTPs() can be used to tweak the calculation of expected retention time directions from log P values (see ?getDefTPStructParams). This filter does nothing if either the observed or expected direction is zero.
When TP data was generated with the logic algorithm it is recommended to use the formulas filter. This filter uses formula annotations to verify that (1) a parent feature group contains the elements that are subtracted during the transformation and (2) the TP feature group contains the elements that were added during the transformation. Since the ‘right’ candidate formula is most likely not yet known, this filter looks at all candidates. Therefore, it is recommended to filter the formulas object, for instance, with the topMost filter.
Finally, the plotGraph() method function that was introduced exploring transformation hierarchies for structure TPs, can also incorporate componentization results to simplify the plot and mark TP hits:
7.2.3 Omitting transformation product input
The TPs argument to generateComponents can also be omitted (i.e. TPs=NULL). In this case every feature group from fGroupTPs is considered to be a potential TP for the potential parents specified for fGroups. An advantage is that the screening workflow is not limited to any known TPs or transformations. However, such a workflow has high demands on prioritiation steps before and after the componentization to rule out the many false positives that may occur.
When no transformation data is supplied it is crucial to make a prior distinction between parent and TP feature groups. Afterwards, the MS/MS spectral and other annotation similarity filters mentioned in the previous section may helpful to further prioritize data.
The fourth example demonstrates such a workflow.
7.2.4 Reporting TP components
The TP components can be reported with the report function. This is done by setting the components function argument (i.e. equally to all other component types). The results will be displayed with a customized format that allows easy exploring of each parent with its TPs. In addition, the TPs argument can be set to include additional data such as transformation pathways.