Obtain transformation products (TPs) from compound annotation candidates

Transforms and prioritizes compound annotation candidates to obtain TPs.

generateTPsAnnComp(
  parents,
  compounds,
  TPsRef = NULL,
  fGroupsComps = NULL,
  minRTDiff = 20,
  minFitFormula = 0.94,
  minFitCompound = 0,
  minSimSusp = 0,
  minFitCompOrSimSusp = c(0.54, 0.65),
  extraOptsFMCSR = NULL,
  skipInvalid = TRUE,
  prefCalcChemProps = TRUE,
  neutralChemProps = FALSE,
  neutralizeTPs = TRUE,
  TPStructParams = getDefTPStructParams(),
  parallel = TRUE
)

Arguments

parents

The parents for which transformation products should be obtained. This can be

a suspect list (see suspect screening for more information)
the output of screenSuspects in which case the suspects hits are used as parents

The parents need to have SMILES or InChI information available.

compounds

The compounds object containing the compound candidates.

TPsRef

A transformationProductsStructure object containing suspect TP candidates obtained for the same parents from another TP generation algorithm (e.g. generateTPsBioTransformer). This is used for the calculation of simSusps (see Details). Set to NULL to skip its calculation.

fGroupsComps

The featureGroups object for which the compounds were generated. This is used to obtain retention times for the calculation for retention order directions. Set to NULL to skip its calculation.

minRTDiff

Minimum retention time (in seconds) difference between the parent and a TP to calculate the retention order direction. Candidates with unexpected retention orders are filtered out.

minFitFormula, minFitCompound, minSimSusp

Thresholds to filter out unlikely candidates. For fitFormula: see generateTPsAnnForm, for the others see the Details section.

minFitCompOrSimSusp

A two-sized numeric vector specifying the thresholds for fitCompound or simSusp, respectively.

extraOptsFMCSR

A list with additional options passed to the fmcsR::fmcs function. The following defaults are set: au=1, bu=4, matching.mode="aromatic".

skipInvalid

If set to TRUE then the parents will be skipped (with a warning) for which insufficient information (e.g. SMILES) is available.

prefCalcChemProps

If TRUE then calculated chemical properties such as the formula and InChIKey are preferred over what is already present in the parent suspect list. For efficiency reasons it is recommended to set this to TRUE. See the Validating and calculating chemical properties section for more details.

neutralChemProps

If TRUE then the neutral form of the molecule is considered to calculate SMILES, formulae etc. Enabling this may improve feature matching when considering common adducts (e.g. [M+H]+, [M-H]-). See the Validating and calculating chemical properties section for more details.

neutralizeTPs

If TRUE then all resulting TP structure information is neutralized. This argument has a similar meaning as neutralChemProps. This is defaulted to TRUE for prediction algorithms, as these may output charged molecules. NOTE: if neutralization results in duplicate TPs, i.e. when the neutral form of the TP was also generated by the algorithm, then the neutralized TP will be removed.

TPStructParams

Parameters that influence the calculation of structural properties. See getDefTPStructParams.

parallel

If set to TRUE then code is executed in parallel through the future package. Please see the parallelization section in the handbook for more details.

Value

generateTPsAnnComp returns an object of the class transformationProductsAnnComp. Please see its documentation for e.g. filtering steps that can be performed on this object.

Details

This function uses compound annotations to obtain transformation products. This function is called when calling generateTPs with algorithm="ann_comp".

The generateTPsAnnComp function implements the unknown TP screening from compound candidates approach as described in (Helmus et al. 2025) . This algorithm does not rely on any known or predicted TPs and is therefore suitable for 'full non-target' workflows. All compound candidates are considered as potential TPs and are ranked by the TP score:

$$TP score = max(fitCompound,simSusp) + annSim$$

With:

annSim: the annotation similarity
fitCompound: the structural fit of the compound candidate into the parent (or vice versa, maximum is taken). Calculated as the "Overlap coefficient" with fmcsR::fmcs. The molecular data is prepared with rcdk and ChemmineR.
simSusp: the maximum structural similarity with TP suspect candidates for this parent, i.e. obtained from other algorithms of generateTPs). The calculation is configured by the TPStructParams.

To speed up the calculation process, several thresholds are applied to rule out unlikely candidates. These thresholds are defaulted to those derived in (Helmus et al. 2025) . Nevertheless, calculations can take a very long time (multiple hours), especially when processing large numbers of candidates from e.g. PubChem.

Unlike most other TP generation algorithms, no additional suspect screening step is required.

Note

Setting parallel=TRUE can speed up calculations considerably on multi-core systems. but will also add to RAM usage. Furthermore, parallelization is only favorable for long calculations due to the overhead of setting up multiple R processes. Note that the parallel workers must be on the same system, i.e. this will not work on e.g. clusters.

It is possible that candidates are equal to their parent. To remove these the removeParentIsomers filter can be used afterwards.

Validating and calculating chemical properties

Chemical properties such as SMILES, InChIKey and formulae in the parent suspect list are automatically validated and calculated if missing/invalid.

The internal validation/calculation process performs the following steps:

Validation of SMILES, InChI, InChIKey and formula data (if present). Invalid entries will be set to NA.
If neutralChemProps=TRUE then chemical data (SMILES, formulae etc.) is neutralized by (de-)protonation (using the –neutralized option of OpenBabel). An additional column molNeutralized is added to mark those molecules that were neutralized. Note that neutralization requires either SMILES or InChI data to be available.
The SMILES and InChI data are used to calculate missing or invalid SMILES, InChI, InChIKey and formula data. If prefCalcChemProps=TRUE then existing InChIKey and formula data is overwritten by calculated values whenever possible.
The chemical formulae which were not calculated are verified and normalized. This process may be time consuming, and is potentially largely avoided by setting prefCalcChemProps=TRUE.
Neutral masses are calculated for missing values (prefCalcChemProps=FALSE) or whenever possible (prefCalcChemProps=TRUE).

Note that calculation of formulae for molecules that are isotopically labelled is currently only supported for deuterium (2H) elements.

This functionality relies heavily on OpenBabel, please make sure it is installed.

References

OBoyle NM, Banck M, James CA, Morley C, Vandermeersch T, Hutchison GR (2011). “Open Babel: An open chemical toolbox.” Journal of Cheminformatics, 3(1). doi:10.1186/1758-2946-3-33 .

Helmus R, Bagdonaite I, de Voogt P, van Bommel MR, Schymanski EL, van Wezel AP, ter Laak TL (2025). “Comprehensive Mass Spectrometry Workflows to Systematically Elucidate Transformation Processes of Organic Micropollutants: A Case Study on the Photodegradation of Four Pharmaceuticals.” Environmental Science & Technology, 59(7), 3723–3736. ISSN 1520-5851. doi:10.1021/acs.est.4c09121 . http://dx.doi.org/10.1021/acs.est.4c09121.

Wang Y, Backman TWH, Horan K, Girke T (2013). “fmcsR: mismatch tolerant maximum common substructure searching in R.” Bioinformatics, 29(21), 2792–2794. ISSN 1367-4811. doi:10.1093/bioinformatics/btt475 . http://dx.doi.org/10.1093/bioinformatics/btt475.

Guha R (2007). “Chemical Informatics Functionality in R.” Journal of Statistical Software, 18(6).

Cao Y, Charisi A, Cheng L, Jiang T, Girke T (2008). “ChemmineR: a compound mining framework for R.” Bioinformatics, 24(15), 1733–1734. ISSN 1367-4803. doi:10.1093/bioinformatics/btn307 . http://dx.doi.org/10.1093/bioinformatics/btn307.