5.2 Filtering

During a non-target workflow it is not uncommon that some kind of data-cleanup is necessary. Datasets are often highly complex, which makes separating data of interest from the rest highly important. Furthermore, general cleanup typically improves the quality of the dataset, for instance by removing low scoring annotation results or features that are unlikely to be ‘correct’ (e.g. noise or present in blanks). For this reason patRoon supports many different filters that easily clean data produced during the workflow in a highly customizable way.

All major workflow objects (e.g. featureGroups, compounds, components etc.) support filtering operations by the filter() generic. This function takes the object to be filtered as first argument and any remaining arguments describe the desired filter options. The filter() generic function then returns the modified object back. Some examples are shown below.

# remove low intensity (<500) features
features <- filter(features, absMinIntensity = 500)

# remove features with intensities lower than 5 times the blank
fGroups <- filter(fGroups, blankThreshold = 5)

# only retain compounds with >1 explained MS/MS peaks
compounds <- filter(compounds, minExplainedPeaks = 1)

The following sections will provide a more detailed overview of available data filters.

NOTE Some other R packages (notably dplyr) also provide a filter() generic function. To use the filter() function from different packages you may need to explicitly specify which one to use in your script. This can be done by prefixing it with the package name, e.g. patRoon::filter(...), dplyr::filter(...) etc.

5.2.1 Features

There are many filters available for feature data:

Filter Classes Remarks
absMinIntensity, relMinIntensity features, featureGroups Minimum intensity
preAbsMinIntensity, preRelMinIntensity featureGroups Minimum intensity prior to other filtering (see below)
retentionRange, mzRange, mzDefectRange, chromWidthRange features, featureGroups Filter by feature properties
absMinAnalyses, relMinAnalyses featureGroups Minimum feature abundance in all analyses
absMinReplicates, relMinReplicates featureGroups Minimum feature abundance in different replicates
absMinFeatures, relMinFeatures featureGroups Only keep analyses with at least this amount of features
absMinReplicateAbundance, relMinReplicateAbundance featureGroups Minimum feature abundance in a replicate group
maxReplicateIntRSD featureGroups Maximum relative standard deviation of feature intensities in a replicate group.
blankThreshold featureGroups Minimum intensity factor above blank intensity
rGroups featureGroups Only keep (features of) these replicate groups
results featureGroups Only keep feature groups with formula/compound annotations or componentization results

Application of filters to feature data is important for (environmental) non-target analysis. Especially blank and replicate filters (i.e. blankThreshold and absMinReplicateAbundance/relMinReplicateAbundance) are important filters and are highly recommended to always apply for cleaning up your dataset.

All filters are available for feature group data, whereas only a subset is available for feature objects. The main reason is that other filters need grouping of features between analyses. Regardless, in patRoon filtering feature data is less important, and typically only needed when the number of features are extremely large and direct grouping is undesired.

From the table above you can notice that many filters concern both absolute and relative data (i.e. as prefixed with abs and rel). When a relative filter is used the value is scaled between 0 and 1. For instance:

# remove features not present in at least half of the analyses within a replicate group
fGroups <- filter(fGroups, relMinReplicateAbundance = 0.5)

An advantage of relative filters is that you will not have to worry about the data size involved. For instance, in the above example the filter always takes half of the number of analyses within a replicate group, even when replicate groups have different number of analyses.

Note that multiple filters can be specified at once. Especially for feature group data the order of filtering may impact the final results, this is explained further in the reference manual (i.e. ?`feature-filtering`).

Some examples are shown below.

# filter features prior to grouping: remove any features eluting before first 2 minutes
fList <- filter(fList, retentionRange = c(120, Inf))

# common filters for feature groups
fGroups <- filter(fGroups,
                  absMinIntensity = 500, # remove features <500 intensity
                  relMinReplicateAbundance = 1, # features should be in all analysis of replicate groups
                  maxReplicateIntRSD = 0.75, # remove features with intensity RSD in replicates >75%
                  blankThreshold = 5, # remove features <5x intensity of (average) blank intensity
                  removeBlanks = TRUE) # remove blank analyses from object afterwards

# filter by feature properties
fGroups <- filter(mzDefectRange = c(0.8, 0.9),
                  chromWidthRange = c(6, 120))

# remove features not present in at least 3 analyses
fGroups <- filter(fGroups, absMinAnalyses = 3)

# remove features not present in at least 20% of all replicate groups
fGroups <- filter(fGroups, relMinReplicates = 0.2)

# only keep data present in replicate groups "repl1" and "repl2"
# all other features and analyses will be removed
fGroups <- filter(fGroups, rGroups = c("repl1", "repl2"))

# only keep feature groups with compound annotations
fGroups <- filter(fGroups, results = compounds)
# only keep feature groups with formula or compound annotations
fGroups <- filter(fGroups, results = list(formulas, compounds))

5.2.2 Suspect screening

Several additional filters are available for feature groups obtained with screenSuspects():

Filter Classes Remarks
onlyHits featureGroupsScreening Only retain feature groups assigned to one or more suspects.
selectHitsBy featureGroupsScreening Select the feature group that matches best with a suspect (in case there are multiple).
selectBestFGroups featureGroupsScreening Select the suspect that matches best with a feature group (in case there are multiple).
maxLevel, maxFormRank, maxCompRank featureGroupsScreening Only retain suspect hits with identification/annotation ranks below a threshold.
minAnnSimForm, minAnnSimComp, minAnnSimBoth featureGroupsScreening Remove suspect hits with annotation similarity scores below this value.
absMinFragMatches, relMinFragMatches featureGroupsScreening Only keep suspect hits with a minimum (relative) number of fragment matches from the suspect list.

NOTE: most filters only remove suspect hit results. Set onlyHits=TRUE to also remove any feature groups that end up without suspect hits.

The selectHitsBy and selectBestFGroups filters are useful to remove duplicate hits (one suspect assigned to multiple feature groups or multiple feature groups assigned to the same suspect, respectively). The former selects based on either best identification level (selectHitsBy="level") or highest mean intensity (selectHitsBy="intensity"). The selectBestFGroups can only be TRUE/FALSE and always selects by best identification level.

Some examples are shown below.

# only keep feature groups assigned to at least one suspect
fGroupsSusp <- filter(fGroupsSusp, onlyHits = TRUE)
# remove duplicate suspect to feature group matches and keep the best
fGroupsSusp <- filter(fGroupsSusp, selectHitsBy = "level")
# remove suspect hits with ID levels >3 and make sure no feature groups
# are present without suspect hits afterwards
fGroupsSusp <- filter(fGroupsSusp, maxLevel = 3, onlyHits = TRUE)

5.2.3 Annotation

There are various filters available for handling annotation data:

Filter Classes Remarks
absMSIntThr, absMSMSIntThr, relMSIntThr, relMSMSIntThr MSPeakLists Minimum intensity of mass peaks
topMSPeaks, topMSMSPeaks MSPeakLists Only keep most intense mass peaks
withMSMS MSPeakLists Only keep results with MS/MS data
minMSMSPeaks MSPeakLists Only keep an MS/MS peak list if it contains a minimum number of peaks (excluding the precursor peak)
annotatedBy MSPeakLists Only keep MS/MS peaks that have formula or compound annotations
minExplainedPeaks formulas, compounds Minimum number of annotated mass peaks
elements, fragElements, lossElements formulas, compounds Restrain elemental composition
topMost formulas, compounds Only keep highest ranked candidates
minScore, minFragScore, minFormulaScore compounds Minimum compound scorings
scoreLimits formulas, compounds Minimum/Maximum scorings
OM formulas, compounds Only keep candidates with likely elemental composition found in organic matter

Several intensity related filters are available to clean-up MS peak list data. For instance, the topMSPeaks/topMSMSPeaks filters provide a simple way to remove noisy data by only retaining a defined number of most intense mass peaks. Note that none of these filters will remove the precursor mass peak of the feature itself.

The filters applicable to formula and compound annotation generally concern minimal scoring or chemical properties. The former is useful to remove unlikely candidates, whereas the second is useful to focus on certain study specific chemical properties (e.g. known neutral losses).

Common examples are shown below.

# intensity filtering
mslists <- filter(mslists,
                  absMSIntThr = 500, # minimum MS mass peak intensity of 500
                  relMSMSIntThr = 0.1) # minimum MS/MS mass peak intensity of 10%

# only retain 10 most intens mass peaks
# (feature mass is always retained)
mslists <- filter(mslists, topMSPeaks = 10)

# remove MS/MS peaks without compound annotations
mslists <- filter(mslists, annotatedBy = compounds)

# remove MS/MS peaks not annotated by either a formula or compound candidate
mslists <- filter(mslists, annotatedBy = list(formulas, compounds))

# only keep formulae with 1-10 sulphur or phosphorus elements
formulas <- filter(formulas, elements = c("S1-10", "P1-10"))

# only keep candidates with MS/MS fragments that contain 1-10 carbons and 0-2 oxygens
formulas <- filter(formulas, fragElements = "C1-10O0-2")

# only keep candidates with CO2 neutral loss
formulas <- filter(formulas, lossElements = "CO2")

# only keep the 15 highest ranked candidates with at least 1 annotated MS/MS peak
compounds <- filter(compounds, minExplainedPeaks = 1, topMost = 15)

# minimum in-silico score
compounds <- filter(compounds, minFragScore = 10)

# candidate should be referenced in at least 1 patent
# (only works if database lists number of patents, e.g. PubChem)
compounds <- filter(compounds,
                    scoreLimits = list(numberPatents = c(1, Inf))

NOTE As of patRoon 2.0 MS peak lists are not re-generated after a filtering operation (unless the reAverage parameter is explicity set to TRUE). The reason for this change is that re-averaging invalidates any formula/compound annotation data (e.g. used for plotting and reporting) that were generated prior to the filter operation.

5.2.4 Components

Finally several filters are available for components:

Filter Remarks
size Minimum component size
adducts, isotopes Filter features by adduct/istopes annotation
rtIncrement, mzIncrement Filter homologs by retention/mz increment range

Note that these filters are only applied if the components contain the data the filter works on. For instance, filtering by adducts will not affect components obtained from homologous series.

As before, some typical examples are shown below.

# only keep components with at least 4 features
componInt <- filter(componInt, minSize = 4)

# remove all features from components are not annotated as an adduct
componRC <- filter(componRC, adducts = TRUE)

# only keep protonated and sodium adducts
componRC <- filter(componRC, adducts = c("[M+H]+", "[M+Na]+"))

# remove all features not recognized as isotopes
componRC <- filter(componRC, isotopes = FALSE)

# only keep monoisotopic mass
componRC <- filter(componRC, isotopes = 0)

# min/max rt/mz increments for homologs
componNT <- filter(componNT, rtIncrement = c(10, 30),
                   mzIncrement = c(16, 50))

NOTE As mentioned before, components are still in a relative young development phase and results should always be verified!

5.2.5 Negation

All filters support negation: if enabled all specified filters will be executed in an opposite manner. Negation may not be so commonly used, but allows greater flexibility which is sometimes needed for advanced filtering steps. Furthermore, it is also useful to specifically isolate the data that otherwise would have been removed. Some examples are shown below.

# keep all features/analyses _not_ present from replicate groups "repl1" and "repl2"
fGroups <- filter(fGroups, rGroups = c("repl1", "repl2"), negate = TRUE)

# only retain features with a mass defect outside 0.8-0.9
fGroups <- filter(mzDefectRange = c(0.8, 0.9), negate = TRUE)

# remove duplicate suspect hits and only keep the _worst_ hit
fGroupsSusp <- filter(fGroupsSusp, selectHitsBy = "level", negate = TRUE)

# remove candidates with CO2 neutral loss
formulas <- filter(formulas, lossElements = "CO2", negate = TRUE)

# select 15 worst ranked candidates
compounds <- filter(compounds, topMost = 15, negate = TRUE)

# only keep components with <5 features
componInt <- filter(componInt, minSize = 5, negate = TRUE)