8.13 Parallelization

Some steps in the non-target screening workflow are inherently computationally intensive. To reduce computational times patRoon is able to perform parallelization for most of the important functionality. This is especially useful if you have a modern system with multiple CPU cores and sufficient RAM.

For various technical reasons several parallelization techniques are used, these can be categorized as parallelization of R functions and multiprocessing. The next sections describe both parallelization approaches in order to let you optimize the workflow.

8.13.1 Parellization of R functions

Several functions of patRoon support parallelization.

Function Purpose Remarks
findFeatures Obtain feature data Only envipick and kpic2 algorithms.
generateComponents Generate components Only cliquems algorithm.
report Reporting data
generateTPs Obtain transformation products Only cts algorithm.
optimizeFeatureFinding, optimizeFeatureGrouping Optimize feature finding/grouping parameters Discussed here.
calculatePeakQualities Calculate feature (group) qualities Discussed here.
predictTox / predictRespFactors Prediction of toxicities/concentrations } Only compounds methods. Discussed here.

The parallelization is achieved with the future and future.apply R packages. To enable parallelization of these functions the parallel argument must be set to TRUE and the future framework must be properly configured in advance. For example:

# setup three workers to run in parallel
future::plan("multisession", workers = 3)

# find features with enviPick in parallel
fList <- findFeatures(anaInfo, "envipick", parallel = TRUE)

It is important to properly configure the right future plan. Please see the documentation of the future package for more details.

8.13.2 Multiprocessing

patRoon relies on several external (command-line) tools to generate workflow data. These commands may be executed in parallel to reduce computational times (‘multiprocessing’). The table below outlines the tools that are executed in parallel.

Tool Used by Notes
msConvert convertMSFiles(algorithm="pwiz", ...)
FileConverter convertMSFiles(algorithm="openms", ...)
FeatureFinderMetabo findFeatures(algorithm="openms", ...)
julia findFeatures(algorithm="safd", ...)
SIRIUS findFeatures(algorithm="sirius", ...)
MetaboliteAdductDecharger generateComponents(algorithm="openms", ...)
GenForm generateFormulas(agorithm="genform", ...)
SIRIUS generateFormulas(agorithm="sirius", ...), generateCompounds(agorithm="sirius", ...) Only if splitBatches=TRUE
MetFrag generateCompounds(agorithm="metfrag", ...)
pngquant reportHTML(...) Only if optimizePng=TRUE
BioTransformer generateTPs(algorithm = "biotransformer") Disabled by default (see ?generateTPs for details).

Multiprocessing is either performed by executing processes in the background with the processx R package (classic interface) or by futures, which were introduced in the previous section. An overview of the characteristics of both parallelization techniques is shown below.

classic future
requires little or no configuration configuration needed to setup
works with all tools doesn’t work with pngquant and slower with GenForm
only supports parallelization on the local computer allows both local and cluster computing

Which method is used is controlled by the patRoon.MP.method package option. Note that reportHTML() will always use the classic method for pngquant.

8.13.2.1 Classic multiprocessing interface

The classic interface is the ‘original’ method implemented in patRoon, and is therefore well tested and optimized. It is easier to setup, works well with all tools, and is therefore the default method. It is enabled as follows:

options(patRoon.MP.method = "classic")

The number of parallel processes is configured through the patRoon.MP.maxProcs option. By default it is set to the number of available CPU cores, which results usually in the best performance. However, you may want to lower this, for instance, to keep your computer more responsive while processing or limit the RAM used by the data processing workflow.

options(patRoon.MP.maxProcs = 2) # do not execute more than two tools in parallel. 

This will change the parallelization for the complete workflow. However, it may be desirable to change this for only a part the workflow. This is easily achieved with the withOpt() function.

# do not execute more than two tools in parallel.
options(patRoon.MP.maxProcs = 2)

# ... but execute up to four GenForm processes
withOpt(MP.maxProcs = 4, {
    formulas <- generateFormulas(fGroups, "genform", ...)
})

The withOpt function will temporarily change the given option(s) while executing a given code block and restore it afterwards (it is very similar to the with_options() function from the withr R package). Furthermore, notice how withOpt() does not require you to prefix the option names with patRoon..

8.13.2.2 Multiprocessing with futures

The primary goal of the “future” method is to allow parallel processing on one or more external computers. Since it uses the future R package, many approaches are supported, such as local parallelization (similar to the classic method), cluster computing via multiple networked computers and more advanced HPC approaches such as slurm via the future.batchtools R package. This parallelization method can be activated as follows:

options(patRoon.MP.method = "future")

# set a future plan

# example 1: start a local cluster with four nodes
future::plan("cluster", workers = 4)

# example 2: start a networked cluster with four nodes on PC with hostname "otherpc"
future::plan("cluster", workers = rep("otherpc", 4)) 

Please see the documentation of the respective packages (e.g. future and future.batchtools) for more details on how to configure the workers.

The withOpt() function introduced in the previous subsection can also be used to temporarily switch between parallelization approaches, for instance:

# default to future parallelization
options(patRoon.MP.method = "future")
future::plan("cluster", workers = 4)

# ... do workflow

# do classic parallelization for GenForm
withOpt(MP.method = "classic", {
    formulas <- generateFormulas(fGroups, "genform", ...)
})

# .. do more workflow

8.13.2.3 Logging

Most tools that are executed in parallel will log their output to text files. These files may contain valuable information, for instance, when an error occurred. By default, the logfiles are stored in the log directory placed in the current working directory. However, you can change this location by setting the patRoon.MP.logPath option. If you set this option to FALSE then no logging occurs.

8.13.3 Notes when using parallelization with futures

Some important notes when using the future parallelization method:

  • GenForm currently performs less optimal with future multiprocessing to the classic approach. Nevertheless, it may still be interesting to use the future method to move the computations to another system to free up resources on your local system.
  • Behind the scenes the future.apply package is used to schedule the tools to be executed. The patRoon.MP.futureSched option sets the value for the future.scheduling argument to the future_lapply() function, and therefore allows you to tweak the scheduling.
  • Make sure that patRoon is present and with the same version on all computing hosts.
  • Make sure that any external dependencies used by multiprocessing, such as MetFrag and SIRIUS, and local compound databases, such as as PubChemLite, are also with the same version and are configured properly. See the Installation section for more details.
  • If you encounter errors then it may be handy to switch to future::plan("sequential") and see if it works or you get more descriptive error messages.
  • In order to restart the nodes, for instance after re-configuring patRoon, updating R packages etc, simply re-execute future::plan(...).
  • Setting the future.debug package option to TRUE may give you more insight what is happening to find problems.