Progenetix is an open data resource that provides curated individual cancer copy number variation (CNV) profiles along with associated metadata sourced from published oncogenomic studies and various data repositories. This vignette provides a comprehensive guide on accessing genomic variant data within the Progenetix database.
If your focus lies in cancer cell lines, you can access data from cancercelllines.org by setting the domain
parameter to "https://cancercelllines.org"
in pgxLoader
function. This data repository originates from CNV profiling data of cell lines initially collected as part of Progenetix and currently includes additional types of genomic mutations.
library(pgxRpi)
library(SummarizedExperiment) # for pgxmatrix data
#> Loading required package: MatrixGenerics
#> Loading required package: matrixStats
#>
#> Attaching package: 'MatrixGenerics'
#> The following objects are masked from 'package:matrixStats':
#>
#> colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
#> colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
#> colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
#> colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
#> colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
#> colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
#> colWeightedMeans, colWeightedMedians, colWeightedSds,
#> colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
#> rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
#> rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
#> rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
#> rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
#> rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
#> rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
#> rowWeightedSds, rowWeightedVars
#> Loading required package: GenomicRanges
#> Loading required package: stats4
#> Loading required package: BiocGenerics
#> Loading required package: generics
#>
#> Attaching package: 'generics'
#> The following objects are masked from 'package:base':
#>
#> as.difftime, as.factor, as.ordered, intersect, is.element, setdiff,
#> setequal, union
#>
#> Attaching package: 'BiocGenerics'
#> The following objects are masked from 'package:stats':
#>
#> IQR, mad, sd, var, xtabs
#> The following objects are masked from 'package:base':
#>
#> Filter, Find, Map, Position, Reduce, anyDuplicated, aperm, append,
#> as.data.frame, basename, cbind, colnames, dirname, do.call,
#> duplicated, eval, evalq, get, grep, grepl, is.unsorted, lapply,
#> mapply, match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
#> rank, rbind, rownames, sapply, saveRDS, table, tapply, unique,
#> unsplit, which.max, which.min
#> Loading required package: S4Vectors
#>
#> Attaching package: 'S4Vectors'
#> The following object is masked from 'package:utils':
#>
#> findMatches
#> The following objects are masked from 'package:base':
#>
#> I, expand.grid, unname
#> Loading required package: IRanges
#> Loading required package: GenomeInfoDb
#> Loading required package: Biobase
#> Welcome to Bioconductor
#>
#> Vignettes contain introductory material; view with
#> 'browseVignettes()'. To cite Bioconductor, see
#> 'citation("Biobase")', and for packages 'citation("pkgname")'.
#>
#> Attaching package: 'Biobase'
#> The following object is masked from 'package:MatrixGenerics':
#>
#> rowMedians
#> The following objects are masked from 'package:matrixStats':
#>
#> anyMissing, rowMedians
pgxLoader
functionThis function loads various data from Progenetix
database via the Beacon v2 API with some extensions (BeaconPlus).
The parameters of this function used in this tutorial:
type
: A string specifying output data type. “g_variants” and “cnv_fraction” are used in this tutorial.output
: A string specifying output file format. The available options depend on the type
parameter. When type
is “g_variants”, available options are NULL (default), “pgxseg”, or “seg”; when type
is “cnv_fraction”, available options are NULL (default) or “pgxmatrix”.biosample_id
: Identifiers used in the query database for identifying biosamples.individual_id
: Identifiers used in the query database for identifying individuals.filters
: Identifiers used in public repositories, bio-ontology terms, or custom terms such as c(“NCIT:C7376”, “PMID:22824167”).
For more information about filters, see the documentation.codematches
: A logical value determining whether to exclude samples from child concepts
of specified filters in the ontology tree. If TRUE, only samples exactly matching the specified filters will be included.
Do not use this parameter when filters
include ontology-irrelevant filters such as PMID and cohort identifiers.
Default is FALSE.limit
: Integer to specify the number of returned profiles. Default is 0 (return all).skip
: Integer to specify the number of skipped profiles. E.g. if skip = 2, limit=500, the first 2*500 =1000 profiles are skipped and the next 500 profiles are returned. Default is NULL (no skip).dataset
: A string specifying the dataset to query from the Beacon response. Default is NULL, which includes results from all datasets.save_file
: A logical value determining whether to save variant data as a local file instead of direct return. Only used when the parameter type
is “g_variants”. Default is FALSE.filename
: A string specifying the path and name of the file to be saved. Only used if the parameter save_file
is TRUE. Default is “variants.tsv”, saved in the current working directory..num_cores
: Integer to specify the number of cores used for the variant query. Only used when the parameter type
is “g_variants”. Default is 1.domain
: A string specifying the domain of the query data resource. Default is "http://progenetix.org"
.entry_point
: A string specifying the entry point of the Beacon v2 API. Default is “beacon”, resulting in the endpoint being "http://progenetix.org/beacon"
.Due to time-out constraints, variant data in Progenetix can only be queried using biosample ids, rather than filters or individual ids. To speed up the process, you can use the num_cores
parameter to enable parallel processing. For more details about filters and retrieving biosample ids, refer to the vignette Introduction_1_load_metadata.
# get 2 samples for demonstration
biosamples <- pgxLoader(type="biosamples", filters = "PMID:20229506", limit=2)
biosample_id <- biosamples$biosample_id
biosample_id
#> [1] "pgxbs-kftviq25" "pgxbs-kftviq27"
There are three output formats.
The default output format extracts variant data from the Beacon v2 response, containing variant id and associated analysis id, biosample id and individual id. The CNV data is represented as copy number change class following the GA4GH Variation Representation Specification (VRS).
variant_1 <- pgxLoader(type="g_variants", biosample_id = biosample_id)
head(variant_1)
#> variant_id analysis_id biosample_id individual_id
#> 1 pgxvar-66577cf9b44ee5c2598e5148 pgxcs-kftwah0f pgxbs-kftviq25 pgxind-kftx4g36
#> 2 pgxvar-66577cf9b44ee5c2598e5149 pgxcs-kftwah0f pgxbs-kftviq25 pgxind-kftx4g36
#> 3 pgxvar-66577cf9b44ee5c2598e514a pgxcs-kftwah0f pgxbs-kftviq25 pgxind-kftx4g36
#> 4 pgxvar-66577cf9b44ee5c2598e514b pgxcs-kftwah0f pgxbs-kftviq25 pgxind-kftx4g36
#> 5 pgxvar-66577cf9b44ee5c2598e514c pgxcs-kftwah0f pgxbs-kftviq25 pgxind-kftx4g36
#> 6 pgxvar-66577cf9b44ee5c2598e514d pgxcs-kftwah0f pgxbs-kftviq25 pgxind-kftx4g36
#> variant_internal_id sequence_id start end
#> 1 1:1731500-12832655:EFO_0030068 refseq:NC_000001.11 1731500 12832655
#> 2 1:12849386-57712606:EFO_0030068 refseq:NC_000001.11 12849386 57712606
#> 3 1:57713043-64335282:EFO_0030068 refseq:NC_000001.11 57713043 64335282
#> 4 1:64338058-68715276:EFO_0030068 refseq:NC_000001.11 64338058 68715276
#> 5 1:68716685-72284670:EFO_0030068 refseq:NC_000001.11 68716685 72284670
#> 6 1:72303254-77320849:EFO_0030068 refseq:NC_000001.11 72303254 77320849
#> variant_copychange
#> 1 EFO:0030068
#> 2 EFO:0030068
#> 3 EFO:0030068
#> 4 EFO:0030068
#> 5 EFO:0030068
#> 6 EFO:0030068
output
= “pgxseg”)This format accesses data from Progenetix API services. The ‘.pgxseg’ file format contains segment mean values (in log2
column), which are equal to log2(copy number of measured sample/copy number of control sample (usually 2)). A few variants are point mutations represented by columns reference_bases
and alternate_bases
.
variant_2 <- pgxLoader(type="g_variants", biosample_id = biosample_id,output = "pgxseg")
head(variant_2)
#> biosample_id reference_name start end log2 variant_type
#> 1 pgxbs-kftviq25 1 1731500 12832655 -0.4922 DEL
#> 2 pgxbs-kftviq25 1 12849386 57712606 -0.4888 DEL
#> 3 pgxbs-kftviq25 1 57713043 64335282 -0.4254 DEL
#> 4 pgxbs-kftviq25 1 64338058 68715276 -0.4098 DEL
#> 5 pgxbs-kftviq25 1 68716685 72284670 -0.3219 DEL
#> 6 pgxbs-kftviq25 1 72303254 77320849 -0.3330 DEL
#> reference_bases alternate_bases variant_state_id variant_state_label
#> 1 . . EFO:0030068 low-level copy number loss
#> 2 . . EFO:0030068 low-level copy number loss
#> 3 . . EFO:0030068 low-level copy number loss
#> 4 . . EFO:0030068 low-level copy number loss
#> 5 . . EFO:0030068 low-level copy number loss
#> 6 . . EFO:0030068 low-level copy number loss
output
= “seg”)This format accesses data from Progenetix API services. This format is similar to the general ‘.seg’ file format and compatible with IGV tool for visualization. The only difference between this file format and the general ‘.seg’ file format is the fifth column. It represents variant type in this format while in the general ‘.seg’ file format, it represents number of probes or bins covered by the segment.
variant_3 <- pgxLoader(type="g_variants", biosample_id = biosample_id,output = "seg")
head(variant_3)
#> biosample_id reference_name start end variant_type log2
#> 1 pgxbs-kftviq25 1 1731500 12832655 DEL -0.4922
#> 2 pgxbs-kftviq25 1 12849386 57712606 DEL -0.4888
#> 3 pgxbs-kftviq25 1 57713043 64335282 DEL -0.4254
#> 4 pgxbs-kftviq25 1 64338058 68715276 DEL -0.4098
#> 5 pgxbs-kftviq25 1 68716685 72284670 DEL -0.3219
#> 6 pgxbs-kftviq25 1 72303254 77320849 DEL -0.3330
Setting save_file
to TRUE in the pgxLoader
function will save the retrieved variants data to a file
rather than returning it directly. By default, the data will be saved in the current working directory,
but you can specify a different path using the filename
parameter. This export functionality is
only available for variants data (when type='g_variants'
).
The following codes creates a ‘.pgxseg’ file with the name “variants.pgxseg” in “~/Downloads/” folder.
pgxLoader(type="g_variants", output="pgxseg", biosample_id=biosample_id, save_file=TRUE,
filename="~/Downloads/variants.pgxseg")
To visualize the ‘.pgxseg’ file, you can either upload it to this link or use the byconaut package for local visualization when dealing with a large number of samples.
The following codes creates a special ‘.seg’ file with the name “variants.seg” in “~/Downloads/” folder.
pgxLoader(type="g_variants", output="seg", biosample_id=biosample_id, save_file=TRUE,
filename="~/Downloads/variants.seg")
You can upload this ‘.seg’ file to IGV tool for visualization.
Because segment variants are not harmonized across samples, Progenetix provides processed CNV features, known as CNV fractions. These fractions represent the proportion of genomic regions overlapping one or more CNVs of a given type, facilitating sample-wise comparisons. The following query is based on filters, but biosample id and individual id are also available for sample-specific CNV fraction queries. For more information about filters, biosample id and individual id, as well as the use of parameters skip
, limit
, and codematches
, see the vignette Introduction_1_load_metadata.
cnv_fraction <- pgxLoader(type="cnv_fraction", filters = "NCIT:C2948")
This includes CNV fraction across chromosome arms, whole chromosomes, or the whole genome.
names(cnv_fraction)
#> [1] "arm_cnv_frac" "chr_cnv_frac" "genome_cnv_frac"
The CNV fraction across chromosomal arms looks like this
head(cnv_fraction$arm_cnv_frac)[,c(1:4, 49:52)]
#> chr1p.dup chr1q.dup chr2p.dup chr2q.dup chr1p.del chr1q.del
#> pgxcs-kftvs0ri 0 0.000 0.000 0.000 0.000 0
#> pgxcs-kftvu9w2 0 0.000 0.000 0.000 0.000 0
#> pgxcs-kftvw1kw 0 0.000 0.000 0.000 0.000 0
#> pgxcs-kftvw1ld 0 0.000 0.000 0.000 0.000 0
#> pgxcs-kftvw1lu 0 0.000 0.000 0.000 0.225 0
#> pgxcs-kftvw1mb 0 0.979 0.989 0.003 0.000 0
#> chr2p.del chr2q.del
#> pgxcs-kftvs0ri 0 0
#> pgxcs-kftvu9w2 0 0
#> pgxcs-kftvw1kw 0 0
#> pgxcs-kftvw1ld 0 0
#> pgxcs-kftvw1lu 0 0
#> pgxcs-kftvw1mb 0 0
The row names are analyses ids from samples that belong to the input filter NCIT:C2948. There are 96 columns. The first 48 columns are duplication fraction across chromosomal arms, followed by deletion fraction. CNV fraction across whole chromosomes is similar, with the only difference in columns.
The CNV fraction across the genome (hg38) looks like this
head(cnv_fraction$genome_cnv_frac)
#> cnvfraction dupfraction delfraction
#> pgxcs-kftvs0ri 0.080 0.036 0.044
#> pgxcs-kftvu9w2 0.058 0.010 0.048
#> pgxcs-kftvw1kw 0.000 0.000 0.000
#> pgxcs-kftvw1ld 0.000 0.000 0.000
#> pgxcs-kftvw1lu 0.027 0.000 0.027
#> pgxcs-kftvw1mb 0.176 0.159 0.017
The first column is the total called fraction, followed by duplication fraction and deletion fraction.
cnvfrac_matrix <- pgxLoader(type="cnv_fraction", output="pgxmatrix", filters = "NCIT:C2948")
The returned data is stored in RangedSummarizedExperiment object, which is a matrix-like container where rows represent ranges of interest (as a GRanges object) and columns represent analyses derived from biosamples. The data looks like this
cnvfrac_matrix
#> class: RangedSummarizedExperiment
#> dim: 6212 53
#> metadata(0):
#> assays(1): cnv_matrix
#> rownames(6212): 1 2 ... 6211 6212
#> rowData names(1): type
#> colnames(53): pgxcs-kftvs0ri pgxcs-kftvu9w2 ... pgxcs-m3io7ksh
#> pgxcs-m3io7kt4
#> colData names(3): analysis_id biosample_id group_id
You could get the interval information by rowRanges
function from SummarizedExperiment package.
rowRanges(cnvfrac_matrix)
#> GRanges object with 6212 ranges and 1 metadata column:
#> seqnames ranges strand | type
#> <Rle> <IRanges> <Rle> | <character>
#> 1 chr1 0-400000 * | DUP
#> 2 chr1 400000-1400000 * | DUP
#> 3 chr1 1400000-2400000 * | DUP
#> 4 chr1 2400000-3400000 * | DUP
#> 5 chr1 3400000-4400000 * | DUP
#> ... ... ... ... . ...
#> 6208 chrY 52400000-53400000 * | DEL
#> 6209 chrY 53400000-54400000 * | DEL
#> 6210 chrY 54400000-55400000 * | DEL
#> 6211 chrY 55400000-56400000 * | DEL
#> 6212 chrY 56400000-57227415 * | DEL
#> -------
#> seqinfo: 24 sequences from an unspecified genome; no seqlengths
To access the CNV fraction matrix, use assay
accesssor from SummarizedExperiment package
assay(cnvfrac_matrix)[1:3,1:3]
#> pgxcs-kftvs0ri pgxcs-kftvu9w2 pgxcs-kftvw1kw
#> 1 0 0 0
#> 2 0 0 0
#> 3 0 0 0
The matrix contains 6212 rows, corresponding to genomic regions. These rows include 3,106 intervals with “gain status” and 3,106 intervals with “loss status”. The columns represent analysis profiles derived from samples that belong to the input filter.
The value is the fraction of the binned interval overlapping with one or more CNVs of the given type (DUP/DEL). For example, if the value in the second row, the first column is 0.2, it means that one or more duplication events overlapped with 20% of the genomic bin located in chromosome 1: 400000-1400000 in the first analysis profile.
To get associated biosample id and filters for analyses, use colData
function from SummarizedExperiment package:
colData(cnvfrac_matrix)
#> DataFrame with 53 rows and 3 columns
#> analysis_id biosample_id group_id
#> <character> <character> <character>
#> pgxcs-kftvs0ri pgxcs-kftvs0ri pgxbs-kftvh262 NCIT:C2948
#> pgxcs-kftvu9w2 pgxcs-kftvu9w2 pgxbs-kftvh9fp NCIT:C2948
#> pgxcs-kftvw1kw pgxcs-kftvw1kw pgxbs-kftvhf4h NCIT:C8893
#> pgxcs-kftvw1ld pgxcs-kftvw1ld pgxbs-kftvhf4i NCIT:C8893
#> pgxcs-kftvw1lu pgxcs-kftvw1lu pgxbs-kftvhf4k NCIT:C8893
#> ... ... ... ...
#> pgxcs-m3io7jp3 pgxcs-m3io7jp3 pgxbs-kftvkxqr NCIT:C2948
#> pgxcs-m3io7jq2 pgxcs-m3io7jq2 pgxbs-kftvl062 NCIT:C2948
#> pgxcs-m3io7kkt pgxcs-m3io7kkt pgxbs-kftvl35c NCIT:C2948
#> pgxcs-m3io7ksh pgxcs-m3io7ksh pgxbs-kftvl0ut NCIT:C2948
#> pgxcs-m3io7kt4 pgxcs-m3io7kt4 pgxbs-kftvl3oa NCIT:C2948
It is noted that the number of analysis profiles does not necessarily equal the number of samples. One biosample id may correspond to multiple analysis ids. group_id
corresponds to the meaning of filters
.
#> R Under development (unstable) (2024-10-21 r87258)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.21-bioc/R/lib/libRblas.so
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_GB LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: America/New_York
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] SummarizedExperiment_1.37.0 Biobase_2.67.0
#> [3] GenomicRanges_1.59.1 GenomeInfoDb_1.43.2
#> [5] IRanges_2.41.2 S4Vectors_0.45.2
#> [7] BiocGenerics_0.53.3 generics_0.1.3
#> [9] MatrixGenerics_1.19.1 matrixStats_1.5.0
#> [11] pgxRpi_1.3.2 BiocStyle_2.35.0
#>
#> loaded via a namespace (and not attached):
#> [1] tidyselect_1.2.1 dplyr_1.1.4 farver_2.1.2
#> [4] fastmap_1.2.0 digest_0.6.37 timechange_0.3.0
#> [7] lifecycle_1.0.4 survival_3.8-3 magrittr_2.0.3
#> [10] compiler_4.5.0 rlang_1.1.5 sass_0.4.9
#> [13] tools_4.5.0 yaml_2.3.10 data.table_1.16.4
#> [16] knitr_1.49 ggsignif_0.6.4 S4Arrays_1.7.1
#> [19] labeling_0.4.3 curl_6.1.0 DelayedArray_0.33.3
#> [22] abind_1.4-8 withr_3.0.2 purrr_1.0.2
#> [25] grid_4.5.0 ggpubr_0.6.0 future_1.34.0
#> [28] xtable_1.8-4 colorspace_2.1-1 ggplot2_3.5.1
#> [31] globals_0.16.3 scales_1.3.0 tinytex_0.54
#> [34] cli_3.6.3 rmarkdown_2.29 crayon_1.5.3
#> [37] future.apply_1.11.3 km.ci_0.5-6 httr_1.4.7
#> [40] survminer_0.5.0 cachem_1.1.0 splines_4.5.0
#> [43] parallel_4.5.0 BiocManager_1.30.25 XVector_0.47.2
#> [46] survMisc_0.5.6 vctrs_0.6.5 Matrix_1.7-1
#> [49] jsonlite_1.8.9 carData_3.0-5 bookdown_0.42
#> [52] car_3.1-3 rstatix_0.7.2 Formula_1.2-5
#> [55] listenv_0.9.1 magick_2.8.5 attempt_0.3.1
#> [58] tidyr_1.3.1 jquerylib_0.1.4 parallelly_1.41.0
#> [61] glue_1.8.0 codetools_0.2-20 lubridate_1.9.4
#> [64] gtable_0.3.6 UCSC.utils_1.3.1 munsell_0.5.1
#> [67] tibble_3.2.1 pillar_1.10.1 htmltools_0.5.8.1
#> [70] GenomeInfoDbData_1.2.13 R6_2.5.1 KMsurv_0.1-5
#> [73] evaluate_1.0.3 lattice_0.22-6 backports_1.5.0
#> [76] broom_1.0.7 bslib_0.8.0 Rcpp_1.0.14
#> [79] SparseArray_1.7.4 gridExtra_2.3 xfun_0.50
#> [82] zoo_1.8-12 pkgconfig_2.0.3