4.3 Sample analyses

In patRoon a sample analysis, or analysis, refers to a single HRMS measurement of a sample. The raw data for an analysis is typically stored in different file types and file formats, which are discussed in the next section. The analysis information informs patRoon which analyses should be processed, where to find the raw data and is used to store any other metadata. The data pre-treatment section describes how to convert and prepare the raw data.

4.3.1 Analysis file types and formats and the `msdata` interface

In patRoon a distinction is made between four types of raw data files:

raw: the original raw data files from the HRMS instrument, with formats such as .raw (Thermo, Waters) or .d (Bruker, Agilent).
centroid: exported and centroided data files in the mzML or mzXML format.
profile: exported but not centroided (i.e. profile) data files in mzML or mzXML formats.
ims: exported ion mobility HRMS data files in the mzML format.

Unfortunately, algorithms within the workflow may require different file types/formats, and it is often necessary to convert raw data to one or more other file types/formats. However, for ‘classical’ (non_IMS) workflows it is often sufficient to convert raw data to centroid data in mzML format. In patRoon the choice of file type and format is primarily based on:

The feature detection algorithm that is used. An overview of requirements is listed in the feature detection section.
The internal code of patRoon to process raw data. This is called the msdata interface.

The msdata interface is used throughout many operations within patRoon, such as loading mass spectra for feature annotation and generating extracted ion chromatograms (EICs) for plotting and reporting data. The msdata interface itself supports different backends, each of which support different file types and formats. By default the most suitable backend is chosen automatically, depending on the available raw data and which backends are available on your system. The currently supported backends are:

Backend	Supported file types and formats
`"opentims"`	Uses OpenTIMS for highly efficient reading of raw Bruker TIMS data (only available on Windows and Linux). Requires the Bruker TDF-SDK, see the Installation chapter.
`"mzr"`	uses mzR to read `centroid` files in the `mzML` and `mzXML` formats. This was always used before `patRoon 3.0`.
`"mstoolkit"`	Uses mstoolkit to read `ims` files (`mzML` format) and `centroid` files in the `mzML` and `mzXML` formats. Requires the Rmstoolkitlib `R` package (see the Installation chapter).
`"streamcraft"`	Uses StreamCraft to read `ims` files (`mzML` format) and `centroid` files in the `mzML` and `mzXML` formats.

NOTE The piek feature detection algorithm uses the msdata interface directly and therefore supports a wide range of raw data file types and formats.

See the reference manual for more details on the msdata interface (?msdata).

4.3.2 Analysis information

In patRoon, the analysis information describes the analyses that are to be processed, where they are located and holds any metadata such as replicate information. The analysis information should be a data.frame and is often stored in a variable called anaInfo (of course you are free to choose a different name!).

The analysis information table has a few mandatory columns:

path_raw,path_centroid,path_profile,path_ims: the directory path to the analyses in the raw, centroided, profile and ims formats, respectively. See the previous section for details on the file types. Leave empty if the file type is not present.
analysis: the name of the analysis. This should be the file name without file extension and without directory path (e.g. C:\\MyAnalysis\\sample1.d becomes sample1). Each value in the analysis column must be unique.
replicate: to which replicate the analysis belongs. The analysis which are replicates of each other get the same name.
blank: which replicate should be used for blank subtraction. Can be left empty if no subtraction is desired.

If a workflow requires multiple file formats of a same file type, e.g. centroided mzML and mzXML files, then simply store both file formats in the directory specified in the path_XXX column. If data needs to be exported (discussed in the next section), simply assign its destination path to the respective path_XXX column.

The analysis information table can be manually constructed in R (e.g. through import of an CSV file), through a graphical interface with newProject() (discussed previously) or automatically by the generateAnalysisInfo() function. Here is an example of the latter for the example data in the patRoonData package:

# Take example data from patRoonData package (triplicate solvent blank + triplicate standard)
generateAnalysisInfo(fromCentroid = patRoonData::exampleDataPath(),
                     replicate = c(rep("solvent-pos", 3), rep("standard-pos", 3)),
                     blank = "solvent-pos")

#>         analysis                                         path_centroid path_raw path_profile path_ims    replicate       blank
#> 1  solvent-pos-1 /usr/local/lib/R/site-library/patRoonData/extdata/pos                                 solvent-pos solvent-pos
#> 2  solvent-pos-2 /usr/local/lib/R/site-library/patRoonData/extdata/pos                                 solvent-pos solvent-pos
#> 3  solvent-pos-3 /usr/local/lib/R/site-library/patRoonData/extdata/pos                                 solvent-pos solvent-pos
#> 4 standard-pos-1 /usr/local/lib/R/site-library/patRoonData/extdata/pos                                standard-pos solvent-pos
#> 5 standard-pos-2 /usr/local/lib/R/site-library/patRoonData/extdata/pos                                standard-pos solvent-pos
#> 6 standard-pos-3 /usr/local/lib/R/site-library/patRoonData/extdata/pos                                standard-pos solvent-pos

(Note that for the example data the patRoonData::exampleAnalysisInfo() function can also be used.)

It is possible to add more columns to the analysis information: these can be used to attach additional metadata to each sample analysis. These columns can be added later to the table, or specified directly to generateAnalysisInfo():

# As above, but add some (nonsensical) metadata: location and exposure
generateAnalysisInfo(fromCentroid = patRoonData::exampleDataPath(),
                     replicate = c(rep("solvent-pos", 3), rep("standard-pos", 3)),
                     blank = "solvent-pos",
                     location = c("NL", "NL", "NL", "DE", "DE", "DE"),
                     exposure = c(0, 0, 0, 2, 2, 2))

#>         analysis                                         path_centroid path_raw path_profile path_ims    replicate       blank location exposure
#> 1  solvent-pos-1 /usr/local/lib/R/site-library/patRoonData/extdata/pos                                 solvent-pos solvent-pos       NL        0
#> 2  solvent-pos-2 /usr/local/lib/R/site-library/patRoonData/extdata/pos                                 solvent-pos solvent-pos       NL        0
#> 3  solvent-pos-3 /usr/local/lib/R/site-library/patRoonData/extdata/pos                                 solvent-pos solvent-pos       NL        0
#> 4 standard-pos-1 /usr/local/lib/R/site-library/patRoonData/extdata/pos                                standard-pos solvent-pos       DE        2
#> 5 standard-pos-2 /usr/local/lib/R/site-library/patRoonData/extdata/pos                                standard-pos solvent-pos       DE        2
#> 6 standard-pos-3 /usr/local/lib/R/site-library/patRoonData/extdata/pos                                standard-pos solvent-pos       DE        2

The metadata (location and exposure in the example above) can be used in various ways later in the workflow to process the non-target data.

See the reference manual for more details on the analysis information and generateAnalysisInfo() (?`analysis-information `).

4.3.3 Data conversion and pre-treatment

As noted in the previous sections, analyses are typically stored in different file types and formats, and algorithms in the workflow typically only support some of these. Hence, it is often required to perform file conversion.

The convertMSFiles() function supports various algorithms to perform the necessary file conversions:

Algorithm	Usage	Input file types and formats	Output file types and formats	Remarks
ProteoWizard	`convertMSFiles(algorithm = "pwiz", ...)`	all formats and types	all except raw	most popular and versatile converter
OpenMS	`convertMSFiles(algorithm = "openms", ...)`	centroid and profile (`mzML` and `mzXML`)	centroid and profile (`mzML` and `mzXML`)	Does not support centroiding.
DataAnalysis	`convertMSFiles(algorithm = "bruker", ...)`	raw (Bruker `.d`)	centroid and profile (`mzML` and `mzXML`)
IMS collapse	`convertMSFiles(algorithm = "imscollapse", ...)`	raw (Bruker TIMS) and ims (mzML) (uses msdata)	centroid (`mzML` and `mzXML`)	Omits MS2 data by default.
TIMSCONVERT	`convertMSFiles(algorithm = "timsconvert", ...)`	raw (Bruker TIMS)	centroid, profile, ims (`mzML`)

NOTE For the conversion of IMS to centroided data it is highly recommended to use the IMS collapse or TIMSCONVERT algorithms, as ProteoWizard currently does not support accurate centroiding of IMS data. For the conversion of Agilent IMS data, ProteoWizard can be used to convert the raw .d files to the ims (mzML) files, and subsequently IMS collapse can be used to convert these to centroided files.

The convertMSFiles() function uses the analysis information to locate the input files and the destination paths for the output. The path_XXX columns should contain the desired destination directories for those file types that should be exported. For instance:

anaInfoConv <- data.frame(
    analysis = c("sample1", "sample2"),
    replicate = "replicate",
    blank = "",
    path_raw = "raw_files", # directory containing the raw HRMS instrument files (.d, .raw, ...)
    path_centroid = "centroid_files" # destination directory where the centroided files will be placed
)
anaInfoConv

#>   analysis replicate blank  path_raw  path_centroid
#> 1  sample1 replicate       raw_files centroid_files
#> 2  sample2 replicate       raw_files centroid_files

The convertMSFiles() takes the analysis information and performs the necessary conversions:

# Convert thermo raw files to centroided mzML files
convertMSFiles(anaInfo, typeFrom = "raw", formatFrom = "thermo", typeTo = "centroid", formatTo = "mzML",
               algorithm = "pwiz")

# convert TIMS data to LC-MS like centroided data
convertMSFiles(anaInfo, typeFrom = "raw", formatFrom = "bruker_ims", typeTo = "centroid", formatTo = "mzML",
               algorithm = "timsconvert")

# convert Agilent IMS-HRMS data to ims data in mzML format
convertMSFiles(anaInfo, typeFrom = "raw", formatFrom = "agilent_ims", typeTo = "ims", formatTo = "mzML",
               algorithm = "pwiz")
# ... and then use IM collapse to LC-MS like centroided mzML files
convertMSFiles(anaInfo, typeFrom = "ims", formatFrom = "mzML", typeTo = "centroid", formatTo = "mzML",
               algorithm = "imscollapse")

The newProject() utility can automatically generate a proper analysis information table and the required code to perform the desired file conversions.

NOTE The IMS collapse algorithm omits MS/MS data by default to save space and speed up file conversion. This algorithm is typically used in post mobility assignment IMS workflows, which do not use MS/MS data from centroided files. Set includeMSMS=TRUE to include MS/MS data.

Besides conversion, other types of data pre-treatment may also need to be performed. For instance, ProteoWizard can be used to apply various data filters, and several utility functions exist to apply mass re-calibration of Bruker data files.

# Use ProteoWizard to perform conversion and apply a filter to only keep MS 1 data
# See http://proteowizard.sourceforge.net/tools/msconvert.html for supported filters
convertMSFiles(anaInfo, typeFrom = "raw", formatFrom = "thermo", typeTo = "centroid", formatTo = "mzML",
               algorithm = "pwiz", filters = "msLevel 1")

# perform m/z re-calibration of Bruker data (should be performed prior to file conversion!)
# NOTE: this requires Bruker DataAnalysis
setDAMethod(anaInfo, "path/to/DAMethod.m") # configure Bruker files with given method that has automatic calibration configured
recalibrarateDAFiles(anaInfo) # trigger re-calibration for each analysis
getDACalibrationError(anaInfo) # get calibration error for each analysis

Please see the reference manual for more details (?convertMSFiles, ?`bruker-utils`).