1 The QFeatures class

The QFeatures class stores data as a list of SummarizedExperiment objects that contain data processed at different levels. For instance, a QFeatures object may contain data at the peptide-to-spectrum-match (PSM) level, at the peptide level and at the protein level. We call each SummarizedExperiment object contained in a QFeatures object a set. Because the different sets are often related, they often share the same samples (columns). QFeatures automatically creates links between the related samples and their annotations (stored in a single colData table). Similarly, different sets often share related features (rows). For instance, proteins are composed of peptides and peptides are composed of PSMs. QFeatures automatically creates links between the related features through an AssayLinks object.

The `QFeatures` data class. The `QFeatures` object contains a list of `SummarizedExperiment` ojects (see [class description](https://bioconductor.org/packages/release/bioc/vignettes/SummarizedExperiment/inst/doc/SummarizedExperiment.html)) on `SingleCellExperiment` and `QFeatures` objects

Figure 1: The QFeatures data class
The QFeatures object contains a list of SummarizedExperiment ojects (see class description) on SingleCellExperiment and QFeatures objects

library("QFeatures")

2 Converting tabular data

QFeatures is designed to process and manipulate the MS-based proteomics data obtained after identification and quantification of the raw MS files. The identification and quantification steps are generally performed by dedicated software (e.g. Sage, FragPipe, Proteome Discoverer, MaxQuant, …) that return a set of tabular data. readQFeatures() converts these tabular data into a QFeatures object. We refer to these tables as the assayData tables.

We distinguish between two use cases: the single-set case and the multi-set case.

2.1 The single-set case

The single-set case will generate a QFeatures object with a single SummarizedExperiment object. This is generally the case when reading data at the peptide or protein level, or when the samples where multiplexed (e.g. using TMT) within a single MS run. There are two types of columns:

  • Quantitative columns (quantCols): 1 to n (depending on technology)
  • Feature annotations: e.g. peptide sequence, ion charge, protein name

In this case, each quantitative column contains information for a single sample. This can be schematically represented as below:

Schematic representation of a data table under the single-set case. Quantification columns (`quantCols`) are represented by different shades of red.

Figure 2: Schematic representation of a data table under the single-set case
Quantification columns (quantCols) are represented by different shades of red.

The hyperLOPIT data is an example data that falls under the single-set case (see ?hlpsms for more details). The quantCols are X126, X127N, X127C, …, X130N, X130C, X131 and correspond to different TMT labels.

In this toy example, there are 3,010 rows corresponding to features (quantified PSMs) and 28 columns corresponding to different data fields generated by MaxQuant during the analysis of the raw MS spectra. The table is converted to a QFeatures object as follows:

data("hlpsms")
quantCols <- grep("^X", colnames(hlpsms))
(qfSingle <- readQFeatures(hlpsms, quantCols = quantCols))
#> Checking arguments.
#> Loading data as a 'SummarizedExperiment' object.
#> Formatting sample annotations (colData).
#> Formatting data as a 'QFeatures' object.
#> An instance of class QFeatures containing 1 set(s):
#>  [1] quants: SummarizedExperiment with 3010 rows and 10 columns

The object returned by readQFeatures() is a QFeatures object containing 1 SummarizedExperiment set. The set is named quants by default, but we could name it psms by providing the name argument:

(qfSingle <- readQFeatures(hlpsms, quantCols = quantCols, name = "psms"))
#> Checking arguments.
#> Loading data as a 'SummarizedExperiment' object.
#> Formatting sample annotations (colData).
#> Formatting data as a 'QFeatures' object.
#> An instance of class QFeatures containing 1 set(s):
#>  [1] psms: SummarizedExperiment with 3010 rows and 10 columns

2.2 The multi-set case

The multi-set case will generate a QFeatures object with multiple SummarizedExperiment objects. This is generally the case when reading data at the PSM level that has been acquired as part of multiple runs. In this case, the identification and quantification software concatenates the results across MS runs in a single table. There are three types of columns:

  • Run identifier column (runCol): e.g. file name.
  • Quantification columns (quantCols): 1 to n (depending on technology).
  • Feature annotations: e.g. peptide sequence, ion charge, protein name.

Each quantitative column contains information for multiple samples. This can be schematically represented as below:

Schematic representation of a data table under the multi-set case. Quantification columns (`quantCols`) are coloured by run and shaded by label. Every sample is uniquely represented by a colour and shade. Note that every `quantCol` contains multiple samples.

Figure 3: Schematic representation of a data table under the multi-set case
Quantification columns (quantCols) are coloured by run and shaded by label. Every sample is uniquely represented by a colour and shade. Note that every quantCol contains multiple samples.

We will again use hyperLOPIT data and simulate it was acquired as part of multiple runs, hence falling under the multi-set case. The MS run is often identified with the name of the file it generated.

hlpsms$FileName <- rep(
    rep(paste0("run", 1:3, ".raw"), each = 4), 
    length.out = nrow(hlpsms)
)

Note that the data set now has a column called “FileName” with 3 different runs:

To avoid that a quantification column contains data from multiple samples, readQFeatures() splits the table into mulitple set depending on the runCol column, here given as FileName:

(qfMulti <- readQFeatures(hlpsms, quantCols = quantCols, runCol = "FileName"))
#> Checking arguments.
#> Loading data as a 'SummarizedExperiment' object.
#> Splitting data in runs.
#> Formatting sample annotations (colData).
#> Formatting data as a 'QFeatures' object.
#> An instance of class QFeatures containing 3 set(s):
#>  [1] run1.raw: SummarizedExperiment with 1004 rows and 10 columns 
#>  [2] run2.raw: SummarizedExperiment with 1004 rows and 10 columns 
#>  [3] run3.raw: SummarizedExperiment with 1002 rows and 10 columns

The object returned by readQFeatures() is a QFeatures object containing 3 SummarizedExperiment sets. The sets are automatically named based on the values found in runCol.

3 Including sample annotations

Data often comes with sample annotations that provide information about the experimental design. These data are generally created by the user. To facilitate sample annotations, readQFeatures() also allows providing the annotation table as the colData argument. Depending on the use case, one or multiple columns are required.

For the single-set case, the colData table must contain a column named quantCols.

`colData` for the single-set case

Figure 4: colData for the single-set case

Let’s simulate such a table:

(coldata <- DataFrame(
    quantCols = quantCols, 
    condition = rep(c("A", "B"), 5), 
    batch = rep(c("batch1", "batch2"), each = 5)
))
#> DataFrame with 10 rows and 3 columns
#>    quantCols   condition       batch
#>    <integer> <character> <character>
#> 1          1           A      batch1
#> 2          2           B      batch1
#> 3          3           A      batch1
#> 4          4           B      batch1
#> 5          5           A      batch1
#> 6          6           B      batch2
#> 7          7           A      batch2
#> 8          8           B      batch2
#> 9          9           A      batch2
#> 10        10           B      batch2

We can now provide the table to readQFeatures():

(qfSingle <- readQFeatures(hlpsms, quantCols = quantCols, colData = coldata))
#> Checking arguments.
#> Loading data as a 'SummarizedExperiment' object.
#> Formatting sample annotations (colData).
#> Formatting data as a 'QFeatures' object.
#> An instance of class QFeatures containing 1 set(s):
#>  [1] quants: SummarizedExperiment with 3010 rows and 10 columns

For convenience, the quantCols argument can be omitted when providing colData (quantCols are then fetched from this table):

(qfSingle <- readQFeatures(hlpsms, colData = coldata))
#> Checking arguments.
#> Loading data as a 'SummarizedExperiment' object.
#> Formatting sample annotations (colData).
#> Formatting data as a 'QFeatures' object.
#> An instance of class QFeatures containing 1 set(s):
#>  [1] quants: SummarizedExperiment with 3010 rows and 10 columns

The annotations are retrieved as follows:

colData(qfSingle)
#> DataFrame with 10 rows and 3 columns
#>       quantCols   condition       batch
#>       <integer> <character> <character>
#> X126          1           A      batch1
#> X127C         2           B      batch1
#> X127N         3           A      batch1
#> X128C         4           B      batch1
#> X128N         5           A      batch1
#> X129C         6           B      batch2
#> X129N         7           A      batch2
#> X130C         8           B      batch2
#> X130N         9           A      batch2
#> X131         10           B      batch2

For the multi-set case, the colData table must contain a column named quantCols and a column called runCol.

`colData` for the multi-set case

Figure 5: colData for the multi-set case

Let’s simulate an annotation table based on our previous example by duplicating the table for each run:

coldataMulti <- DataFrame()
for (run in paste0("run", 1:3, ".raw")) {
    coldataMulti <- rbind(coldataMulti, DataFrame(runCol = run, coldata))
}
coldataMulti
#> DataFrame with 30 rows and 4 columns
#>          runCol quantCols   condition       batch
#>     <character> <integer> <character> <character>
#> 1      run1.raw         1           A      batch1
#> 2      run1.raw         2           B      batch1
#> 3      run1.raw         3           A      batch1
#> 4      run1.raw         4           B      batch1
#> 5      run1.raw         5           A      batch1
#> ...         ...       ...         ...         ...
#> 26     run3.raw         6           B      batch2
#> 27     run3.raw         7           A      batch2
#> 28     run3.raw         8           B      batch2
#> 29     run3.raw         9           A      batch2
#> 30     run3.raw        10           B      batch2

We can provide the table to readQFeatures():

(qfMulti <- readQFeatures(
    hlpsms, quantCols = quantCols, colData = coldataMulti, 
    runCol = "FileName"
))
#> Checking arguments.
#> Loading data as a 'SummarizedExperiment' object.
#> Splitting data in runs.
#> Formatting sample annotations (colData).
#> Formatting data as a 'QFeatures' object.
#> An instance of class QFeatures containing 3 set(s):
#>  [1] run1.raw: SummarizedExperiment with 1004 rows and 10 columns 
#>  [2] run2.raw: SummarizedExperiment with 1004 rows and 10 columns 
#>  [3] run3.raw: SummarizedExperiment with 1002 rows and 10 columns

4 Additional information

4.1 Sample names

readQFeatures() automatically assigns names that are unique across all samples in all sets. In the single-set case, sample names are provided by quantCols.

colnames(qfSingle)
#> CharacterList of length 1
#> [["quants"]] X126 X127C X127N X128C X128N X129C X129N X130C X130N X131

In the multi-set case, sample names are the concatenation of the run name and the quantCols (separated by a _).

colnames(qfMulti)
#> CharacterList of length 3
#> [["run1.raw"]] run1.raw_X126 run1.raw_X127C ... run1.raw_X130N run1.raw_X131
#> [["run2.raw"]] run2.raw_X126 run2.raw_X127C ... run2.raw_X130N run2.raw_X131
#> [["run3.raw"]] run3.raw_X126 run3.raw_X127C ... run3.raw_X130N run3.raw_X131

4.2 Special case: empty samples

In some rare cases, it can be beneficial to remove samples where all quantifications are NA. This can occur when the raw data are searched for labels that were not used during the experiment. For instance, some may quantifying the raw data expecting TMT-16 labelling while the experiment used TMT-11 labels, or used half of the TMT-16 labels. The missing label channels are filled with NAs. When setting removeEmptyCols = TRUE, readQFeatures() automatically detects and removes columns containing only NAs.

hlpsms$X126 <- NA
(qfNoEmptyCol <- readQFeatures(
    hlpsms, quantCols = quantCols, removeEmptyCols = TRUE
))
#> Checking arguments.
#> Loading data as a 'SummarizedExperiment' object.
#> Formatting sample annotations (colData).
#> Formatting data as a 'QFeatures' object.
#> An instance of class QFeatures containing 1 set(s):
#>  [1] quants: SummarizedExperiment with 3010 rows and 9 columns

Note that we have set all values in X126 to missing. Hence, the set contains only 9 columns instead of the previous 10.

4.3 Reducing verbose

Every call to readQFeatures() prints progression to the console. To disable the console output, you can use the verbose argument:

(qfSingle <- readQFeatures(
    hlpsms, quantCols = quantCols, verbose = FALSE
))
#> An instance of class QFeatures containing 1 set(s):
#>  [1] quants: SummarizedExperiment with 3010 rows and 10 columns

5 Under the hood

readQFeatures proceeds as follows:

  1. The assayData table must be provided as a data.frame (or any format that can be coerced to a data.frame). readQFeatures() converts the table to a SingleCellExperiment object using quantCols to identify the quantitative values that are stored in the assay slot. Any other column is considered as feature annotation and will be stored as rowData.
Step1: Convert the input table to a `SingleCellExperiment` object

Figure 6: Step1: Convert the input table to a SingleCellExperiment object

  1. (Only for the multi-set case:) The SingleCellExperiment object is split according to the acquisition run provided by the runCol column in assayData.
Step2: Split by acquisition run

Figure 7: Step2: Split by acquisition run

  1. The sample annotations are generated. If no colData is provided, the sample annotations are empty. Otherwise, readQFeatures() matches the information from assayData and colData based on quantCols (single-set case) or quantCols and runCol (multi-set case). Sample annotations are stored in the colData slot of the QFeatures object.
Step3: Adding and matching the sample annotations

Figure 8: Step3: Adding and matching the sample annotations

  1. Finally, the SummarizedExperiment sets and the colData are converted to a QFeatures object.
Step4: Converting to a `QFeatures`

Figure 9: Step4: Converting to a QFeatures

6 What about other input formats?

readQFeatures() should work with any PSM quantification table that is output by a pre-processing software. For instance, you can easily import the PSM tables generated by Proteome Discoverer. The run names are contained in the File ID column (that should be supplied as the runCol argument to readQFeatures()). The quantification columns are contained in the columns starting with Abundance, eventually followed by a multiplexing tag name. These columns should be stored in a dedicated column in the colData data to be supplied as runCol to readQFeatures().

The QFeatures package is meant for both label-free and multiplexed proteomics data. Importing LFQ data is similar to the examples above with the only difference that quantCols would have only 1 element.

The readSCPfromDIANN() function is adapted to import label-free and plexDIA/mTRAQ Report.tsv files generated by DIA-NN.

For more information, see the ?readQFeatures() and ?readQFeaturesFromDIANN() manual pages, that described the main principle that concern the data import and formatting.

If your input cannot be loaded using the procedure described in this vignette, you can submit a feature request (see next section).

7 Need help?

You can open an issue on the GitHub repository in case of troubles when loading your data with readQFeatures(). Any suggestion or feature request about the function or the documentation are also warmly welcome.

Session information

R Under development (unstable) (2024-10-21 r87258)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 24.04.1 LTS

Matrix products: default
BLAS:   /home/biocbuild/bbs-3.21-bioc/R/lib/libRblas.so 
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB              LC_COLLATE=C              
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: America/New_York
tzcode source: system (glibc)

attached base packages:
[1] grid      stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] DT_0.33                     ComplexHeatmap_2.23.0      
 [3] gplots_3.2.0                dplyr_1.1.4                
 [5] ggplot2_3.5.1               QFeatures_1.17.1           
 [7] MultiAssayExperiment_1.33.4 SummarizedExperiment_1.37.0
 [9] Biobase_2.67.0              GenomicRanges_1.59.1       
[11] GenomeInfoDb_1.43.2         IRanges_2.41.2             
[13] S4Vectors_0.45.2            BiocGenerics_0.53.3        
[15] generics_0.1.3              MatrixGenerics_1.19.1      
[17] matrixStats_1.5.0           BiocStyle_2.35.0           

loaded via a namespace (and not attached):
 [1] bitops_1.0-9            rlang_1.1.4             magrittr_2.0.3         
 [4] clue_0.3-66             GetoptLong_1.0.5        compiler_4.5.0         
 [7] png_0.1-8               vctrs_0.6.5             reshape2_1.4.4         
[10] stringr_1.5.1           ProtGenerics_1.39.1     pkgconfig_2.0.3        
[13] shape_1.4.6.1           crayon_1.5.3            fastmap_1.2.0          
[16] magick_2.8.5            XVector_0.47.2          labeling_0.4.3         
[19] caTools_1.18.3          rmarkdown_2.29          UCSC.utils_1.3.0       
[22] tinytex_0.54            purrr_1.0.2             xfun_0.50              
[25] cachem_1.1.0            jsonlite_1.8.9          DelayedArray_0.33.3    
[28] parallel_4.5.0          cluster_2.1.8           R6_2.5.1               
[31] bslib_0.8.0             stringi_1.8.4           RColorBrewer_1.1-3     
[34] limma_3.63.3            jquerylib_0.1.4         Rcpp_1.0.13-1          
[37] bookdown_0.42           iterators_1.0.14        knitr_1.49             
[40] BiocBaseUtils_1.9.0     Matrix_1.7-1            igraph_2.1.3           
[43] tidyselect_1.2.1        abind_1.4-8             yaml_2.3.10            
[46] doParallel_1.0.17       codetools_0.2-20        lattice_0.22-6         
[49] tibble_3.2.1            plyr_1.8.9              withr_3.0.2            
[52] evaluate_1.0.1          circlize_0.4.16         pillar_1.10.1          
[55] BiocManager_1.30.25     KernSmooth_2.23-26      foreach_1.5.2          
[58] munsell_0.5.1           scales_1.3.0            gtools_3.9.5           
[61] glue_1.8.0              lazyeval_0.2.2          tools_4.5.0            
[64] Cairo_1.6-2             tidyr_1.3.1             crosstalk_1.2.1        
[67] MsCoreUtils_1.19.0      msdata_0.47.0           colorspace_2.1-1       
[70] GenomeInfoDbData_1.2.13 cli_3.6.3               S4Arrays_1.7.1         
[73] AnnotationFilter_1.31.0 gtable_0.3.6            sass_0.4.9             
[76] digest_0.6.37           SparseArray_1.7.2       rjson_0.2.23           
[79] htmlwidgets_1.6.4       farver_2.1.2            htmltools_0.5.8.1      
[82] lifecycle_1.0.4         httr_1.4.7              GlobalOptions_0.1.2    
[85] statmod_1.5.0           MASS_7.3-64            

License

This vignette is distributed under a CC BY-SA license license.

Reference