readQFeatures()
QFeatures 1.17.1
QFeatures
classThe QFeatures
class stores data as a list of SummarizedExperiment
objects that contain data processed at different levels. For instance,
a QFeatures
object may contain data at the peptide-to-spectrum-match
(PSM) level, at the peptide level and at the protein level. We call
each SummarizedExperiment
object contained in a QFeatures
object
a set. Because the different sets are often related, they often share
the same samples (columns). QFeatures
automatically creates links
between the related samples and their annotations (stored in a single
colData
table). Similarly, different sets often share related
features (rows). For instance, proteins are composed of peptides and
peptides are composed of PSMs. QFeatures
automatically creates links
between the related features through an AssayLinks
object.
library("QFeatures")
QFeatures
is designed to process and manipulate the MS-based
proteomics data obtained after identification and quantification of
the raw MS files. The identification and quantification steps are
generally performed by dedicated software (e.g. Sage, FragPipe,
Proteome Discoverer, MaxQuant, …) that return a set of tabular data.
readQFeatures()
converts these tabular data into a QFeatures
object. We refer to these tables as the assayData
tables.
We distinguish between two use cases: the single-set case and the multi-set case.
The single-set case will generate a QFeatures
object with a single
SummarizedExperiment
object. This is generally the case when reading
data at the peptide or protein level, or when the samples where
multiplexed (e.g. using TMT) within a single MS run. There are two
types of columns:
quantCols
): 1 to n (depending on
technology)In this case, each quantitative column contains information for a single sample. This can be schematically represented as below:
The hyperLOPIT data is an example data that falls under the single-set
case (see ?hlpsms
for more details). The quantCols
are X126
,
X127N
, X127C
, …, X130N
, X130C
, X131
and correspond to
different TMT labels.
In this toy example, there are 3,010 rows corresponding to features
(quantified PSMs) and 28 columns corresponding to different data
fields generated by MaxQuant during the analysis of the raw MS
spectra. The table is converted to a QFeatures
object as follows:
data("hlpsms")
quantCols <- grep("^X", colnames(hlpsms))
(qfSingle <- readQFeatures(hlpsms, quantCols = quantCols))
#> Checking arguments.
#> Loading data as a 'SummarizedExperiment' object.
#> Formatting sample annotations (colData).
#> Formatting data as a 'QFeatures' object.
#> An instance of class QFeatures containing 1 set(s):
#> [1] quants: SummarizedExperiment with 3010 rows and 10 columns
The object returned by readQFeatures()
is a QFeatures
object containing 1 SummarizedExperiment
set. The set is named
quants
by default, but we could name it psms
by providing the
name
argument:
(qfSingle <- readQFeatures(hlpsms, quantCols = quantCols, name = "psms"))
#> Checking arguments.
#> Loading data as a 'SummarizedExperiment' object.
#> Formatting sample annotations (colData).
#> Formatting data as a 'QFeatures' object.
#> An instance of class QFeatures containing 1 set(s):
#> [1] psms: SummarizedExperiment with 3010 rows and 10 columns
The multi-set case will generate a QFeatures
object with multiple
SummarizedExperiment
objects. This is generally the case when
reading data at the PSM level that has been acquired as part of
multiple runs. In this case, the identification and quantification
software concatenates the results across MS runs in a single table.
There are three types of columns:
runCol
): e.g. file name.quantCols
): 1 to n (depending on
technology).Each quantitative column contains information for multiple samples. This can be schematically represented as below:
We will again use hyperLOPIT data and simulate it was acquired as part of multiple runs, hence falling under the multi-set case. The MS run is often identified with the name of the file it generated.
hlpsms$FileName <- rep(
rep(paste0("run", 1:3, ".raw"), each = 4),
length.out = nrow(hlpsms)
)
Note that the data set now has a column called “FileName” with 3 different runs:
To avoid that a quantification column contains data from multiple
samples, readQFeatures()
splits the table into mulitple set
depending on the runCol
column, here given as FileName
:
(qfMulti <- readQFeatures(hlpsms, quantCols = quantCols, runCol = "FileName"))
#> Checking arguments.
#> Loading data as a 'SummarizedExperiment' object.
#> Splitting data in runs.
#> Formatting sample annotations (colData).
#> Formatting data as a 'QFeatures' object.
#> An instance of class QFeatures containing 3 set(s):
#> [1] run1.raw: SummarizedExperiment with 1004 rows and 10 columns
#> [2] run2.raw: SummarizedExperiment with 1004 rows and 10 columns
#> [3] run3.raw: SummarizedExperiment with 1002 rows and 10 columns
The object returned by readQFeatures()
is a QFeatures
object containing 3 SummarizedExperiment
sets. The sets are
automatically named based on the values found in runCol
.
Data often comes with sample annotations that provide information
about the experimental design. These data are generally created by the
user. To facilitate sample annotations, readQFeatures()
also allows
providing the annotation table as the colData
argument. Depending
on the use case, one or multiple columns are required.
For the single-set case, the colData
table must contain a column
named quantCols
.
Let’s simulate such a table:
(coldata <- DataFrame(
quantCols = quantCols,
condition = rep(c("A", "B"), 5),
batch = rep(c("batch1", "batch2"), each = 5)
))
#> DataFrame with 10 rows and 3 columns
#> quantCols condition batch
#> <integer> <character> <character>
#> 1 1 A batch1
#> 2 2 B batch1
#> 3 3 A batch1
#> 4 4 B batch1
#> 5 5 A batch1
#> 6 6 B batch2
#> 7 7 A batch2
#> 8 8 B batch2
#> 9 9 A batch2
#> 10 10 B batch2
We can now provide the table to readQFeatures()
:
(qfSingle <- readQFeatures(hlpsms, quantCols = quantCols, colData = coldata))
#> Checking arguments.
#> Loading data as a 'SummarizedExperiment' object.
#> Formatting sample annotations (colData).
#> Formatting data as a 'QFeatures' object.
#> An instance of class QFeatures containing 1 set(s):
#> [1] quants: SummarizedExperiment with 3010 rows and 10 columns
For convenience, the quantCols
argument can be omitted when
providing colData
(quantCols
are then fetched from this table):
(qfSingle <- readQFeatures(hlpsms, colData = coldata))
#> Checking arguments.
#> Loading data as a 'SummarizedExperiment' object.
#> Formatting sample annotations (colData).
#> Formatting data as a 'QFeatures' object.
#> An instance of class QFeatures containing 1 set(s):
#> [1] quants: SummarizedExperiment with 3010 rows and 10 columns
The annotations are retrieved as follows:
colData(qfSingle)
#> DataFrame with 10 rows and 3 columns
#> quantCols condition batch
#> <integer> <character> <character>
#> X126 1 A batch1
#> X127C 2 B batch1
#> X127N 3 A batch1
#> X128C 4 B batch1
#> X128N 5 A batch1
#> X129C 6 B batch2
#> X129N 7 A batch2
#> X130C 8 B batch2
#> X130N 9 A batch2
#> X131 10 B batch2
For the multi-set case, the colData
table must contain a column
named quantCols
and a column called runCol
.
Let’s simulate an annotation table based on our previous example by duplicating the table for each run:
coldataMulti <- DataFrame()
for (run in paste0("run", 1:3, ".raw")) {
coldataMulti <- rbind(coldataMulti, DataFrame(runCol = run, coldata))
}
coldataMulti
#> DataFrame with 30 rows and 4 columns
#> runCol quantCols condition batch
#> <character> <integer> <character> <character>
#> 1 run1.raw 1 A batch1
#> 2 run1.raw 2 B batch1
#> 3 run1.raw 3 A batch1
#> 4 run1.raw 4 B batch1
#> 5 run1.raw 5 A batch1
#> ... ... ... ... ...
#> 26 run3.raw 6 B batch2
#> 27 run3.raw 7 A batch2
#> 28 run3.raw 8 B batch2
#> 29 run3.raw 9 A batch2
#> 30 run3.raw 10 B batch2
We can provide the table to readQFeatures()
:
(qfMulti <- readQFeatures(
hlpsms, quantCols = quantCols, colData = coldataMulti,
runCol = "FileName"
))
#> Checking arguments.
#> Loading data as a 'SummarizedExperiment' object.
#> Splitting data in runs.
#> Formatting sample annotations (colData).
#> Formatting data as a 'QFeatures' object.
#> An instance of class QFeatures containing 3 set(s):
#> [1] run1.raw: SummarizedExperiment with 1004 rows and 10 columns
#> [2] run2.raw: SummarizedExperiment with 1004 rows and 10 columns
#> [3] run3.raw: SummarizedExperiment with 1002 rows and 10 columns
readQFeatures()
automatically assigns names that are unique across
all samples in all sets. In the single-set case, sample names are
provided by quantCols
.
colnames(qfSingle)
#> CharacterList of length 1
#> [["quants"]] X126 X127C X127N X128C X128N X129C X129N X130C X130N X131
In the multi-set case, sample names are the concatenation of the run
name and the quantCols (separated by a _
).
colnames(qfMulti)
#> CharacterList of length 3
#> [["run1.raw"]] run1.raw_X126 run1.raw_X127C ... run1.raw_X130N run1.raw_X131
#> [["run2.raw"]] run2.raw_X126 run2.raw_X127C ... run2.raw_X130N run2.raw_X131
#> [["run3.raw"]] run3.raw_X126 run3.raw_X127C ... run3.raw_X130N run3.raw_X131
In some rare cases, it can be beneficial to remove samples where all
quantifications are NA
. This can occur when the raw data are
searched for labels that were not used during the experiment. For
instance, some may quantifying the raw data expecting TMT-16 labelling
while the experiment used TMT-11 labels, or used half of the TMT-16
labels. The missing label channels are filled with NA
s. When setting
removeEmptyCols = TRUE
, readQFeatures()
automatically detects and
removes columns containing only NA
s.
hlpsms$X126 <- NA
(qfNoEmptyCol <- readQFeatures(
hlpsms, quantCols = quantCols, removeEmptyCols = TRUE
))
#> Checking arguments.
#> Loading data as a 'SummarizedExperiment' object.
#> Formatting sample annotations (colData).
#> Formatting data as a 'QFeatures' object.
#> An instance of class QFeatures containing 1 set(s):
#> [1] quants: SummarizedExperiment with 3010 rows and 9 columns
Note that we have set all values in X126
to missing. Hence, the set
contains only 9 columns instead of the previous 10.
Every call to readQFeatures()
prints progression to the console. To
disable the console output, you can use the verbose
argument:
(qfSingle <- readQFeatures(
hlpsms, quantCols = quantCols, verbose = FALSE
))
#> An instance of class QFeatures containing 1 set(s):
#> [1] quants: SummarizedExperiment with 3010 rows and 10 columns
readQFeatures
proceeds as follows:
assayData
table must be provided as a data.frame
(or any
format that can be coerced to a data.frame
). readQFeatures()
converts the table to a SingleCellExperiment
object using
quantCols
to identify the quantitative values that are stored in
the assay
slot. Any other column is considered as feature
annotation and will be stored as rowData
.SingleCellExperiment
object is
split according to the acquisition run provided by the runCol
column in assayData
.colData
is provided,
the sample annotations are empty. Otherwise, readQFeatures()
matches the information from assayData
and colData
based on
quantCols
(single-set case) or quantCols
and runCol
(multi-set case). Sample annotations are stored in the colData
slot of the QFeatures
object.SummarizedExperiment
sets and the colData
are
converted to a QFeatures
object.readQFeatures()
should work with any PSM quantification table that is
output by a pre-processing software. For instance, you can easily
import the PSM tables generated by Proteome Discoverer. The run names
are contained in the File ID
column (that should be supplied as the
runCol
argument to readQFeatures()
). The quantification columns are
contained in the columns starting with Abundance
, eventually
followed by a multiplexing tag name. These columns should be stored in
a dedicated column in the colData
data to be supplied as runCol
to readQFeatures()
.
The QFeatures
package is meant for both label-free and multiplexed
proteomics data. Importing LFQ data is similar to the examples above
with the only difference that quantCols
would have only 1 element.
The readSCPfromDIANN()
function is adapted to import label-free and
plexDIA/mTRAQ Report.tsv
files generated by DIA-NN.
For more information, see the ?readQFeatures()
and
?readQFeaturesFromDIANN()
manual pages, that described the main
principle that concern the data import and formatting.
If your input cannot be loaded using the procedure described in this vignette, you can submit a feature request (see next section).
You can open an issue on the GitHub
repository
in case of troubles when loading your data with readQFeatures()
. Any
suggestion or feature request about the function or the documentation
are also warmly welcome.
R Under development (unstable) (2024-10-21 r87258)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 24.04.1 LTS
Matrix products: default
BLAS: /home/biocbuild/bbs-3.21-bioc/R/lib/libRblas.so
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_GB LC_COLLATE=C
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
time zone: America/New_York
tzcode source: system (glibc)
attached base packages:
[1] grid stats4 stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] DT_0.33 ComplexHeatmap_2.23.0
[3] gplots_3.2.0 dplyr_1.1.4
[5] ggplot2_3.5.1 QFeatures_1.17.1
[7] MultiAssayExperiment_1.33.4 SummarizedExperiment_1.37.0
[9] Biobase_2.67.0 GenomicRanges_1.59.1
[11] GenomeInfoDb_1.43.2 IRanges_2.41.2
[13] S4Vectors_0.45.2 BiocGenerics_0.53.3
[15] generics_0.1.3 MatrixGenerics_1.19.1
[17] matrixStats_1.5.0 BiocStyle_2.35.0
loaded via a namespace (and not attached):
[1] bitops_1.0-9 rlang_1.1.4 magrittr_2.0.3
[4] clue_0.3-66 GetoptLong_1.0.5 compiler_4.5.0
[7] png_0.1-8 vctrs_0.6.5 reshape2_1.4.4
[10] stringr_1.5.1 ProtGenerics_1.39.1 pkgconfig_2.0.3
[13] shape_1.4.6.1 crayon_1.5.3 fastmap_1.2.0
[16] magick_2.8.5 XVector_0.47.2 labeling_0.4.3
[19] caTools_1.18.3 rmarkdown_2.29 UCSC.utils_1.3.0
[22] tinytex_0.54 purrr_1.0.2 xfun_0.50
[25] cachem_1.1.0 jsonlite_1.8.9 DelayedArray_0.33.3
[28] parallel_4.5.0 cluster_2.1.8 R6_2.5.1
[31] bslib_0.8.0 stringi_1.8.4 RColorBrewer_1.1-3
[34] limma_3.63.3 jquerylib_0.1.4 Rcpp_1.0.13-1
[37] bookdown_0.42 iterators_1.0.14 knitr_1.49
[40] BiocBaseUtils_1.9.0 Matrix_1.7-1 igraph_2.1.3
[43] tidyselect_1.2.1 abind_1.4-8 yaml_2.3.10
[46] doParallel_1.0.17 codetools_0.2-20 lattice_0.22-6
[49] tibble_3.2.1 plyr_1.8.9 withr_3.0.2
[52] evaluate_1.0.1 circlize_0.4.16 pillar_1.10.1
[55] BiocManager_1.30.25 KernSmooth_2.23-26 foreach_1.5.2
[58] munsell_0.5.1 scales_1.3.0 gtools_3.9.5
[61] glue_1.8.0 lazyeval_0.2.2 tools_4.5.0
[64] Cairo_1.6-2 tidyr_1.3.1 crosstalk_1.2.1
[67] MsCoreUtils_1.19.0 msdata_0.47.0 colorspace_2.1-1
[70] GenomeInfoDbData_1.2.13 cli_3.6.3 S4Arrays_1.7.1
[73] AnnotationFilter_1.31.0 gtable_0.3.6 sass_0.4.9
[76] digest_0.6.37 SparseArray_1.7.2 rjson_0.2.23
[79] htmlwidgets_1.6.4 farver_2.1.2 htmltools_0.5.8.1
[82] lifecycle_1.0.4 httr_1.4.7 GlobalOptions_0.1.2
[85] statmod_1.5.0 MASS_7.3-64
This vignette is distributed under a CC BY-SA license license.