8.13 Parallelization
Some steps in the non-target screening workflow are inherently computationally intensive. To reduce computational times patRoon
is able to perform parallelization for most of the important functionality. This is especially useful if you have a modern system with multiple CPU cores and sufficient RAM.
For various technical reasons several parallelization techniques are used, these can be categorized as parallelization of R
functions and multiprocessing. The next sections describe both parallelization approaches in order to let you optimize the workflow.
8.13.1 Parellization of R functions
Several functions of patRoon
support parallelization.
Function | Purpose | Remarks |
---|---|---|
findFeatures |
Obtain feature data | Only envipick and kpic2 algorithms. |
generateComponents |
Generate components | Only cliquems algorithm. |
report |
Reporting data | |
generateTPs |
Obtain transformation products | Only cts algorithm. |
optimizeFeatureFinding , optimizeFeatureGrouping |
Optimize feature finding/grouping parameters | Discussed here. |
calculatePeakQualities |
Calculate feature (group) qualities | Discussed here. |
predictTox / predictRespFactors |
Prediction of toxicities/concentrations } | Only compounds methods. Discussed here. |
The parallelization is achieved with the future and future.apply R
packages. To enable parallelization of these functions the parallel
argument must be set to TRUE
and the future framework must be properly configured in advance. For example:
# setup three workers to run in parallel
future::plan("multisession", workers = 3)
# find features with enviPick in parallel
fList <- findFeatures(anaInfo, "envipick", parallel = TRUE)
It is important to properly configure the right future plan. Please see the documentation of the future package for more details.
8.13.2 Multiprocessing
patRoon
relies on several external (command-line) tools to generate workflow data. These commands may be executed in parallel to reduce computational times (‘multiprocessing’). The table below outlines the tools that are executed in parallel.
Tool | Used by | Notes |
---|---|---|
msConvert |
convertMSFiles(algorithm="pwiz", ...) |
|
FileConverter |
convertMSFiles(algorithm="openms", ...) |
|
FeatureFinderMetabo |
findFeatures(algorithm="openms", ...) |
|
julia |
findFeatures(algorithm="safd", ...) |
|
SIRIUS |
findFeatures(algorithm="sirius", ...) |
|
MetaboliteAdductDecharger |
generateComponents(algorithm="openms", ...) |
|
GenForm |
generateFormulas(agorithm="genform", ...) |
|
SIRIUS |
generateFormulas(agorithm="sirius", ...) , generateCompounds(agorithm="sirius", ...) |
Only if splitBatches=TRUE |
MetFrag |
generateCompounds(agorithm="metfrag", ...) |
|
pngquant |
reportHTML(...) |
Only if optimizePng=TRUE |
BioTransformer |
generateTPs(algorithm = "biotransformer") |
Disabled by default (see ?generateTPs for details). |
Multiprocessing is either performed by executing processes in the background with the processx R
package (classic interface) or by futures, which were introduced in the previous section. An overview of the characteristics of both parallelization techniques is shown below.
classic |
future |
---|---|
requires little or no configuration | configuration needed to setup |
works with all tools | doesn’t work with pngquant and slower with GenForm |
only supports parallelization on the local computer | allows both local and cluster computing |
Which method is used is controlled by the patRoon.MP.method
package option. Note that reportHTML()
will always use the classic method for pngquant
.
8.13.2.1 Classic multiprocessing interface
The classic interface is the ‘original’ method implemented in patRoon
, and is therefore well tested and optimized. It is easier to setup, works well with all tools, and is therefore the default method. It is enabled as follows:
The number of parallel processes is configured through the patRoon.MP.maxProcs
option. By default it is set to the number of available CPU cores, which results usually in the best performance. However, you may want to lower this, for instance, to keep your computer more responsive while processing or limit the RAM used by the data processing workflow.
This will change the parallelization for the complete workflow. However, it may be desirable to change this for only a part the workflow. This is easily achieved with the withOpt()
function.
# do not execute more than two tools in parallel.
options(patRoon.MP.maxProcs = 2)
# ... but execute up to four GenForm processes
withOpt(MP.maxProcs = 4, {
formulas <- generateFormulas(fGroups, "genform", ...)
})
The withOpt
function will temporarily change the given option(s) while executing a given code block and restore it afterwards (it is very similar to the with_options()
function from the withr R
package). Furthermore, notice how withOpt()
does not require you to prefix the option names with patRoon.
.
8.13.2.2 Multiprocessing with futures
The primary goal of the “future” method is to allow parallel processing on one or more external computers. Since it uses the future R
package, many approaches are supported, such as local parallelization (similar to the classic
method), cluster computing via multiple networked computers and more advanced HPC approaches such as slurm
via the future.batchtools R
package. This parallelization method can be activated as follows:
options(patRoon.MP.method = "future")
# set a future plan
# example 1: start a local cluster with four nodes
future::plan("cluster", workers = 4)
# example 2: start a networked cluster with four nodes on PC with hostname "otherpc"
future::plan("cluster", workers = rep("otherpc", 4))
Please see the documentation of the respective packages (e.g. future and future.batchtools) for more details on how to configure the workers.
The withOpt()
function introduced in the previous subsection can also be used to temporarily switch between parallelization approaches, for instance:
8.13.2.3 Logging
Most tools that are executed in parallel will log their output to text files. These files may contain valuable information, for instance, when an error occurred. By default, the logfiles are stored in the log
directory placed in the current working directory. However, you can change this location by setting the patRoon.MP.logPath
option. If you set this option to FALSE
then no logging occurs.
8.13.3 Notes when using parallelization with futures
Some important notes when using the future
parallelization method:
GenForm
currently performs less optimal with future multiprocessing to theclassic
approach. Nevertheless, it may still be interesting to use thefuture
method to move the computations to another system to free up resources on your local system.- Behind the scenes the future.apply package is used to schedule the tools to be executed. The
patRoon.MP.futureSched
option sets the value for thefuture.scheduling
argument to thefuture_lapply()
function, and therefore allows you to tweak the scheduling. - Make sure that
patRoon
is present and with the same version on all computing hosts. - Make sure that any external dependencies used by multiprocessing, such as
MetFrag
andSIRIUS
, and local compound databases, such as asPubChemLite
, are also with the same version and are configured properly. See the Installation section for more details. - If you encounter errors then it may be handy to switch to
future::plan("sequential")
and see if it works or you get more descriptive error messages. - In order to restart the nodes, for instance after re-configuring
patRoon
, updatingR
packages etc, simply re-executefuture::plan(...)
. - Setting the
future.debug
package option toTRUE
may give you more insight what is happening to find problems.