8.10 Predicting toxicities and concentrations (MS2Tox and MS2Quant integration)

The MS2Tox and MS2Quant R packages predict toxicities and feature concentrations using a machine learning approach. The predictions are performed with either SMILES data or fingerprints calculated from MS/MS data with SIRIUS+CSI:FingerID. While using SMILES data is generally more accurate, using MS/MS fingerprints is generally faster and may be more suitable for features without know or suspected structure.

In patRoon the predictions are done in two steps:

  1. The LC50 values (toxicity prediction) or response factors (concentration prediction) are calculated for given SMILES or MS/MS fingerprint data using MS2Tox/MS2Quant. This step is performed by the predictTox()/predictConcs() method function.
  2. The predicted LC50 values are used to assign toxicities/concentrations to feature data. This is performed by the calculateTox()/calculateConcs() method function.

Various workflow data can be used to perform the predictions for step 1:

  1. Suspect hits that were obtained with screenSuspects (see suspect screening).
  2. Formula annotations obtained with SIRIUS+CSI:FingerID.
  3. Compound annotations obtained with SIRIUS+CSI:FingerID.
  4. Compound annotations obtained with other algorithms, e.g. MetFrag.

For a and d SMILES is used to perform the calculations, for b MS/MS fingerprints are used and for c either can be used.

NOTE For option b, make sure that getFingerprints=TRUE when running generateFormulas() in order to obtain fingerprints.

8.10.1 Predicting toxicities

Some example workflows are shown below:

# Calculate toxicity for suspect hits.
fGroupsSuspTox <- predictTox(fGroupsSusp)
fGroupsSuspTox <- calculateTox(fGroupsSuspTox)

# Calculate toxicity for compound hits. Limit to the top 5 to reduce calculation time.
compoundsTop5 <- filter(compounds, topMost = 5)
compoundsTox <- predictTox(compoundsTop5)
fGroupsTox <- calculateTox(fGroups, compoundsTox)

It is also possible to calculate toxicities from multiple workflow objects:

fGroupsSuspTox <- predictTox(fGroupsSusp) # as above

# Predict toxicities from compound candidates, using both SMILES and MS/MS fingerprints
# compoundsSuspSIR is an object produced with generateCompounds() with algorithm="sirius"
compoundsSuspSIRTox <- predictTox(compoundsSuspSIR, type = "both")

# Assign toxicities to feature groups from both suspect hits and SIRIUS annotations
fGroupsSuspTox <- calculateTox(fGroupsSuspTox, compoundsSuspSIRTox)

More details are in the reference manual: ?`pred-tox`.

8.10.2 Predicting concentrations

The workflow to predict concentrations is quite similar to predicting toxicities. However, before we can start we first have to specify the calibrants and the LC gradient elution program.

The calibrant data is used by MS2Quant to convert predicted ionization efficiencies to actual response factors, which are specific to the used LC instrument and methodology. For this purpose, several mixtures with known concentrations (i.e. standards) should be measured alongside your samples. The calibrants should be specified as a data.frame, for instance:

#> Calculating/Validating chemical data... Done!
#> ================================================================================
#> Found 24/30 suspects (80.00%)
name SMILES intensity conc rt
Atrazine CCNc1nc(nc(n1)Cl)NC(C)C 32708 1 336.6
Atrazine CCNc1nc(nc(n1)Cl)NC(C)C 66880 2 336.6
Atrazine CCNc1nc(nc(n1)Cl)NC(C)C 174087 5 336.6
Atrazine CCNc1nc(nc(n1)Cl)NC(C)C 371192 10 336.6
Atrazine CCNc1nc(nc(n1)Cl)NC(C)C 806749 25 336.6
Atrazine CCNc1nc(nc(n1)Cl)NC(C)C 1852591 50 336.6
Carbamazepine c1ccc2c(c1)C=Cc3ccccc3N2C(=N)O 25231 1 349.2
Carbamazepine c1ccc2c(c1)C=Cc3ccccc3N2C(=N)O 47831 2 349.2
Carbamazepine c1ccc2c(c1)C=Cc3ccccc3N2C(=N)O 118843 5 349.2
Carbamazepine c1ccc2c(c1)C=Cc3ccccc3N2C(=N)O 211395 10 349.2
Carbamazepine c1ccc2c(c1)C=Cc3ccccc3N2C(=N)O 545192 25 349.2
Carbamazepine c1ccc2c(c1)C=Cc3ccccc3N2C(=N)O 1083568 50 349.2
DEET CCN(CC)C(=O)c1cccc(c1)C 45061 1 355.8
DEET CCN(CC)C(=O)c1cccc(c1)C 84859 2 355.8
DEET CCN(CC)C(=O)c1cccc(c1)C 228902 5 355.8
DEET CCN(CC)C(=O)c1cccc(c1)C 434161 10 355.8
DEET CCN(CC)C(=O)c1cccc(c1)C 1133166 25 355.8
DEET CCN(CC)C(=O)c1cccc(c1)C 2385472 50 355.8
Venlafaxine CN(C)CC(C1=CC=C(C=C1)OC)C2(CCCCC2)O 41465 1 324.0
Venlafaxine CN(C)CC(C1=CC=C(C=C1)OC)C2(CCCCC2)O 89684 2 324.0
Venlafaxine CN(C)CC(C1=CC=C(C=C1)OC)C2(CCCCC2)O 230890 5 324.0
Venlafaxine CN(C)CC(C1=CC=C(C=C1)OC)C2(CCCCC2)O 400385 10 324.0
Venlafaxine CN(C)CC(C1=CC=C(C=C1)OC)C2(CCCCC2)O 1094329 25 324.0
Venlafaxine CN(C)CC(C1=CC=C(C=C1)OC)C2(CCCCC2)O 1965139 50 324.0

The intensity column should contain either the peak intensity (height) or area. Note that some feature detection algorithms can sometimes produce inaccurate peak areas, and the area determination methodology is often different among algorithms. For this reason, using peak intensities may be more reliable, however, it is worth testing this with your data.

It is also possible to use the getQuantCalibFromScreening() function to automatically create the calibrant table from feature group data:

calibList <- data.frame(...) # this should be a suspect list with your calibrants
fGroups <- screenSuspects(fGroups, calibList) # screen for the calibrants
concs <- data.frame(...) # concentration data for each calibrant compound, see below
calibrants <- getQuantCalibFromScreening(fGroups, concs)
calibrants <- getQuantCalibFromScreening(fGroups, concs, areas = TRUE) # obtain feature areas instead of intensities

The first step is to perform a screening for the calibrant compounds. Please ensure that this list should contains SMILES data, and to ensure correct feature assignment it is highly recommended to include retention times. The second requirement for getQuantCalibFromScreening() is a table with concentrations for each calibrant compound, e.g.:

concs <- data.frame(
    name = c("DEET", "1h-benzotriazole", "Caffeine", "Atrazine", "Carbamazepine", "Venlafaxine"),
    standard_1 = c(1.00, 1.05, 1.10, 0.99, 1.01, 1.12),
    standard_2 = c(2.00, 2.15, 2.20, 1.98, 2.02, 1.82),
    standard_5 = c(5.01, 5.05, 5.22, 5.00, 4.88, 4.65),
    standard_10 = c(10.2, 10.11, 10.23, 11.77, 11.75, 12.13),
    standard_25 = c(25.3, 25.12, 25.34, 24.89, 24.78, 24.68),
    standard_50 = c(50.34, 50.05, 50.10, 49.97, 49.71, 50.52)
)
concs
#>               name standard_1 standard_2 standard_5 standard_10 standard_25 standard_50
#> 1             DEET       1.00       2.00       5.01       10.20       25.30       50.34
#> 2 1h-benzotriazole       1.05       2.15       5.05       10.11       25.12       50.05
#> 3         Caffeine       1.10       2.20       5.22       10.23       25.34       50.10
#> 4         Atrazine       0.99       1.98       5.00       11.77       24.89       49.97
#> 5    Carbamazepine       1.01       2.02       4.88       11.75       24.78       49.71
#> 6      Venlafaxine       1.12       1.82       4.65       12.13       24.68       50.52

The concentrations are specified separately for each calibrant compound. The column names should follow the names of the replicate groups assigned to the standards. The concentration unit is µg/l by default. The next section describes how this can be changed.

The gradient elution program is also specified by a data.frame, which for every time point (in seconds!) describes the percentage of ‘B’. In this case, ‘B’ represents the total amount of organic modifier.

eluent <- data.frame(
    time = c(0, 180, 600, 900, 960),
    B = c(10, 30, 100, 100, 10)
)
plot(eluent, type = "l")

eluent
#>   time   B
#> 1    0  10
#> 2  180  30
#> 3  600 100
#> 4  900 100
#> 5  960  10

Then, the workflow to predict concentrations is very similar then predicting toxicities (previous section):

# Calculate concentrations for suspect hits.
fGroupsSuspConc <- predictRespFactors(
    fGroupsSusp,
    calibrants = calibrants, eluent = eluent,
    organicModifier = "MeCN", # organic modifier: MeOH or MeCN
    pHAq = 4 # pH of the aqueous part of the mobile phase
)
# set areas to TRUE if the calibrant table contains areas
fGroupsSuspConc <- calculateConcs(fGroupsSuspConc, areas = FALSE)

As was shown for toxicities it is possible to use different data sources (e.g. compound annotations, suspects) for predictions.

More details are in the reference manual: ?`pred-quant`.

8.10.3 Toxicity and concentration units

The default unit for toxicity and concentration data is µg/l. However, this can be configured when calling the predictTox()/predictRespFactors() functions:

fGroupsSuspTox <- predictTox(fGroupsSusp) # default unit: ug/l
fGroupsSuspTox <- predictTox(fGroupsSusp, concUnit = "ug/l") # same as above
fGroupsSuspTox <- predictTox(fGroupsSusp, concUnit = "mM") # millimolar
fGroupsSuspTox <- predictTox(fGroupsSusp, concUnit = "log mM") # unit used by MS2Tox

# calculated concentrations are ng/l, calibrants are specified in ug/l
# (by default calibConcUnit=concUnit)
fGroupsSuspConc <- predictRespFactors(
    fGroupsSusp, calibrants = calibrants, eluent = eluent,
    organicModifier = "MeCN", pHAq = 4,
    concUnit = "ng/l", calibConcUnit = "ug/l"
)

See the reference manuals (?`pred-tox`/?`pred-quant`) For more details on which units can be specified

8.10.4 Inspecting predicted values

The raw toxicity and concentration data assigned to feature groups can be retrieved with the toxicities() and concentrations() method functions, respectively.

toxicities(fGroupsSuspTox)
#>           group    type        candidate     candidate_name      LC50
#>          <char>  <char>           <char>             <char>     <num>
#> 1: M120_R268_30 suspect [nH]1nnc2ccccc12   1H-benzotriazole  59755.77
#> 2: M137_R249_53 suspect  NC(=O)Nc1ccccc1      N-Phenyl urea 101025.89
#> 3: M146_R225_70 suspect  Oc1ccc2ccccc2n1 2-Hydroxyquinoline  48833.62
#> 4: M146_R248_69 suspect  Oc1ccc2ccccc2n1 2-Hydroxyquinoline  48833.62
#> 5: M146_R309_68 suspect  Oc1ccc2ccccc2n1 2-Hydroxyquinoline  48833.62
concentrations(fGroupsSuspConc)
#>           group    type        candidate     candidate_name standard-pos-1 standard-pos-2 standard-pos-3
#>          <char>  <char>           <char>             <char>          <num>          <num>          <num>
#> 1: M120_R268_30 suspect [nH]1nnc2ccccc12   1H-benzotriazole      43.070773      39.905306       35.21956
#> 2: M137_R249_53 suspect  NC(=O)Nc1ccccc1      N-Phenyl urea      18.485430      19.864756       17.36268
#> 3: M146_R225_70 suspect  Oc1ccc2ccccc2n1 2-Hydroxyquinoline      15.700200      17.662215       18.49330
#> 4: M146_R248_69 suspect  Oc1ccc2ccccc2n1 2-Hydroxyquinoline      19.030263      20.207821       19.54181
#> 5: M146_R309_68 suspect  Oc1ccc2ccccc2n1 2-Hydroxyquinoline       7.978394       8.646156        8.67157

If there were multiple candidates for a single feature group then these are split over the table rows:

toxicities(fGroupsTox)
#>             group     type                  candidate                     candidate_name      LC50
#>            <char>   <char>                     <char>                             <char>     <num>
#>  1:  M120_R268_30 compound          C1=CC(=NC=C1N)C#N     5-aminopyridine-2-carbonitrile 132007.55
#>  2:  M120_R268_30 compound        C1=CC2=C(N=C1)N=CN2          1H-imidazo[4,5-b]pyridine 125872.29
#>  3:  M120_R268_30 compound          C1=CC2=NNN=C2C=C1                   2H-benzotriazole  65155.26
#>  4:  M120_R268_30 compound          C1=CN=CC(=C1N)C#N     4-aminopyridine-3-carbonitrile  83309.12
#>  5:  M120_R268_30 compound          C1=CNC2=CN=CN=C21        5H-pyrrolo[3,2-d]pyrimidine  88213.59
#> ---                                                                                               
#> 16: M192_R355_191 compound CCN(CC)C(=O)C1=CC=C(C=C1)C      N,N-diethyl-4-methylbenzamide  86651.19
#> 17: M192_R355_191 compound CCN(CC)C(=O)C1=CC=CC(=C1)C      N,N-diethyl-3-methylbenzamide 114064.57
#> 18: M192_R355_191 compound   CCN(CC)C(=O)C1=CC=CC=C1C      N,N-diethyl-2-methylbenzamide  81126.15
#> 19: M192_R355_191 compound   CCN(CC)C(=O)CC1=CC=CC=C1      N,N-diethyl-2-phenylacetamide  88445.02
#> 20: M192_R355_191 compound CCN(CC)C1=CC=C(C=C1)C(=O)C 1-[4-(diethylamino)phenyl]ethanone  96005.28

The as.data.table() method function, which was discussed previously, can be used to summarize toxicity and concentration values.

# NOTE: NA values are filtered and columns are subset for readability
as.data.table(fGroupsTox)[!is.na(LC50), c("group", "LC50", "LC50_types")]
#>            group      LC50 LC50_types
#>           <char>     <num>     <char>
#> 1:  M120_R268_30  98911.56   compound
#> 2:  M137_R249_53 149557.04   compound
#> 3:  M146_R309_68  41178.78   compound
#> 4: M192_R355_191  93258.44   compound
concCols <- c("group", paste0(analyses(fGroupsSuspConc), "_conc"), "conc_types")
as.data.table(fGroupsSuspConc)[!is.na(conc_types), concCols, with = FALSE]
#>           group standard-pos-1_conc standard-pos-2_conc standard-pos-3_conc conc_types
#>          <char>               <num>               <num>               <num>     <char>
#> 1: M120_R268_30           43.070773           39.905306            35.21956    suspect
#> 2: M137_R249_53           18.485430           19.864756            17.36268    suspect
#> 3: M146_R309_68            7.978394            8.646156             8.67157    suspect
#> 4: M146_R248_69           19.030263           20.207821            19.54181    suspect
#> 5: M146_R225_70           15.700200           17.662215            18.49330    suspect

The as.data.table() method function aggregates the data for a feature group in case multiple candidates were assigned to it. By default the values are mean averaged, but this be changed with the toxAggrParams/concAggrParams arguments, for instance:

# as above, but aggregate by taking maximum values
as.data.table(fGroupsTox, toxAggrParams = getDefPredAggrParams(max))[!is.na(LC50), c("group", "LC50", "LC50_types")]
#>            group      LC50 LC50_types
#>           <char>     <num>     <char>
#> 1:  M120_R268_30 132007.55   compound
#> 2:  M137_R249_53 314718.38   compound
#> 3:  M146_R309_68  49051.68   compound
#> 4: M192_R355_191 114064.57   compound

If the as.data.table() method is used on suspect screening results, and predictions were performed directly for suspect hits, then predicted values can be reported for individual suspect match instead of aggregating them per feature group:

# Reports predicted values for each suspect separately. If multiple suspects are assigned to a feature group,
# then each suspect match is split into a different row. 
as.data.table(fGroupsSuspTox, collapseSuspects = NULL)

Finally, the reporting functionality can be used to overview all predicted values, both aggregated and raw.

8.10.5 Using predicted values to prioritize data

The filter() method function that was introduced before can also be used to filter data based on predicted toxicities, response factors and concentrations. For instance, this allows you to remove annotation candidates which are unlikely to be toxic or sensitive enough to be detected or any features with very low concentrations. Some examples are shown below.

# compoundsSuspSIRTox is an object with predicted toxicities (LC50 values) for each candidate
# we can use the common scoreLimits filter to select a range of allowed values (min/max)
compoundsSuspSIRToxF <- filter(compoundsSuspSIRTox, scoreLimits = list(LC50_SMILES = c(0, 1E4)))

# for suspects with predicted toxicities/response factors there are dedicated filters
fGroupsSuspConcF <- filter(fGroupsSuspConc, minRF = 5E4) # remove suspect hits with response factor <5E4
fGroupsSuspToxF <- filter(ffGroupsSuspTox, maxLC50 = 100) # remove suspect hits with LC50 values > 100

# similarly, for feature data there are dedicated filters.
# note that these aggregate data prior to filtering (see previous section)
fGroupsConcF <- filter(fGroupsConc, absMinConc = 0.02)
# only keep features with concentrations that are at least 1% of their toxicity
# note that both concentrations/toxicity values should have been calculated with calculateConcs()/calculateTox()
fGroupsConcToxF <- filter(fGroupsConcTox, absMinConcTox = 0.01)

# also get rid of features without concentrations (these are ignored by default)
fGroupsConcF <- filter(fGroupsConc, absMinConc = 0.02, removeNA = TRUE)
# like as.data.table we can configure how values are aggregated
# here the minimum is used instead of the default mean
fGroupsToxF <- filter(fGroupsTox, absMaxTox = 5E3, predAggrParams = getDefPredAggrParams(min))

More details are found in the reference manual (?`feature-filtering`).