8.4 Chromatographic peak qualities

The algorithms used by findFeatures detect chromatographic peaks automatically to find the features. However, it is common that not all detected features have ‘proper’ chromatographic peaks, and some features could be just noise. The MetaClean R package supports various quality measures for chromatographic peaks. The quality measures include Gaussian fit, symmetry, sharpness and others. In addition, MetaClean averages all feature data for each feature group and adds a few additional group specific quality measures (e.g. retention time consistency). Please see Chetnik, Petrick, and Pandey (2020) for more details. The calculations are integrated into patRoon, and are easily performed with the calculatePeakQualities() generic function.

fList <- calculatePeakQualities(fList) # calculate for all features

#> Verifying if your data is centroided... Done!
#> Calculating feature peak qualities and scores...

fGroups <- calculatePeakQualities(fGroups) # calculate for all features and groups

#> Verifying if your data is centroided... Done!
#> Calculating feature peak qualities and scores...
#> Verifying if your data is centroided... Done!
#> Calculating group peak qualities and scores...
#> ================================================================================

Most often the featureGroups method is only used, unless you want to filter features (discussed below) prior to grouping.

An extension in patRoon is that the qualities are used to calculate peak scores. The score for each quality measure is calculated by normalizing and scaling the values into a 0-1 range, where zero is the worst and one the best. Note that most scores are relative, hence, the values should only be used to compare features among each other. Finally, a totalScore is calculated which sums all individual scores and serves as a rough overall score indicator for a feature (group).

The qualities and scores are easily obtained with the as.data.table() function.

# (limit rows/columns for clarity)
as.data.table(fList)[1:5, 26:30]

#>    GaussianSimilarityScore SharpnessScore TPASRScore ZigZagScore totalScore
#>                      <num>          <num>      <num>       <num>      <num>
#> 1:               0.6314046   3.443351e-02  0.9956949   0.9103221   6.302180
#> 2:               0.9633994   9.900530e-10  0.9944988   0.3565674   6.513205
#> 3:               0.3613087   7.565147e-10  0.8006569   0.9999449   5.651379
#> 4:               0.9151027   8.600747e-03  0.9405262   0.9637153   5.892201
#> 5:               0.3676623   1.000000e+00  0.9907657   0.8435805   5.825267

# the qualities argument is necessary to include the scores.
# valid values are: "quality", "score" or "both"
as.data.table(fGroups, qualities = "both")[1:5, 25:29]

#>    TPASRScore ZigZagScore ElutionShiftScore RetentionTimeCorrelationScore totalScore
#>         <num>       <num>             <num>                         <num>      <num>
#> 1:  0.7305554   0.9962254         0.8421657                     0.9955769   7.932541
#> 2:  0.0000000   0.9744541         0.9960804                     0.7746038   6.029360
#> 3:  0.6140008   0.9171568         0.9015949                     0.9776651   7.480675
#> 4:  0.8227904   0.8907734         0.9403958                     0.9963785   8.451631
#> 5:  0.9848653   0.8667116         0.5754979                     0.9984902   8.740135

The feature quality values can also be reviewed interactively with reports generated with report (see Reporting) and with checkFeatures (see here). The filter function can be used filter out low scoring features and feature groups:

# only keep features with at least 0.3 Modality score and 0.5 symmetry score
fList <- filter(fList, qualityRange = list(ModalityScore = c(0.3, Inf),
                                           SymmetryScore = c(0.5, Inf)))

# same as above
fGroups <- filter(fGroups, featQualityRange = list(ModalityScore = c(0.3, Inf),
                                                   SymmetryScore = c(0.5, Inf)))

# filter group averaged data
fGroups <- filter(fGroups, groupQualityRange = list(totalScore = c(0.5, Inf)))

8.4.1 Applying machine learning with MetaClean

An important feature of MetaClean is to use the quality measures to train a machine learning model to automatically recognize ‘good’ and ‘bad’ features. patRoon provides a few extensions to simplify training and using a model. Furthermore, while MetaClean was primarily designed to work with XCMS, the extensions of patRoon allow the usage of data from all the algorithms supported by patRoon.

The getMCTrainData function can be used to convert data from a feature check session to training data that can be used by MetaClean. This allows you to use interactively select good/bad peaks. The workflow looks like this:

# untick the 'keep' checkbox for all 'bad' feature groups
checkFeatures(fGroupsTrain, "train_session.yml")

# get train data. This gives comparable data as MetaClean::getPeakQualityMetrics()
trainData <- getMCTrainData(fGroupsTrain, "train_session.yml")

# use train data with MetaClean with MetaClean::runCrossValidation(),
# MetaClean::getEvaluationMeasures(), MetaClean::trainClassifier() etc
# --> see the MetaClean vignette for details

Once you have created a model with MetaClean it can be used with the predictCheckFeaturesSession() function:

predictCheckFeaturesSession(fGroups, "model_session.yml", model)

This will generate another check session file: all the feature groups that are considered good will be with a ‘keep’ state, the others without. As described elsewhere, the checkFeatures function is used to review the results from a session and the filter function can be used to remove unwanted feature groups. Note that calculatePeakQualitites() must be called before getMCTrainData/predictCheckFeaturesSession can be used.

NOTE MetaClean only predicts at the feature group level. Thus, only the kept feature groups from a feature check session will be used for training, and any indivual features that were marked as removed will be ignored.

References

Chetnik, Kelsey, Lauren Petrick, and Gaurav Pandey. 2020. “MetaClean: A Machine Learning-Based Classifier for Reduced False Positive Peak Detection in Untargeted LC-MS Metabolomics Data.” Metabolomics 16 (11). https://doi.org/10.1007/s11306-020-01738-3.