Contents

Progenetix is an open data resource that provides curated individual cancer copy number variation (CNV) profiles along with associated metadata sourced from published oncogenomic studies and various data repositories. This vignette provides a comprehensive guide on accessing and utilizing metadata for samples or their corresponding individuals within the Progenetix database.

If your focus lies in cancer cell lines, you can access data from cancercelllines.org by setting the domain parameter to "https://cancercelllines.org" in pgxLoader function. This data repository originates from CNV profiling data of cell lines initially collected as part of Progenetix and currently includes additional types of genomic mutations.

1 Load library

library(pgxRpi)

1.1 pgxLoader function

This function loads various data from Progenetix database via the Beacon v2 API with some extensions (BeaconPlus).

The parameters of this function used in this tutorial:

  • type: A string specifying output data type. “individuals”, “biosamples”, “analyses”, “filtering_terms”, and “sample_count” are used in this tutorial.
  • filters: Identifiers used in public repositories, bio-ontology terms, or custom terms such as c(“NCIT:C7376”, “PMID:22824167”). When multiple filters are used, they are combined using AND logic when the parameter type is “individuals”, “biosamples”, or “analyses”; OR logic when the parameter type is “sample_count”.
  • individual_id: Identifiers used in the query database for identifying individuals.
  • biosample_id: Identifiers used in the query database for identifying biosamples.
  • codematches: A logical value determining whether to exclude samples from child concepts of specified filters in the ontology tree. If TRUE, only samples exactly matching the specified filters will be included. Do not use this parameter when filters include ontology-irrelevant filters such as PMID and cohort identifiers. Default is FALSE.
  • limit: Integer to specify the number of returned profiles. Default is 0 (return all).
  • skip: Integer to specify the number of skipped profiles. E.g. if skip = 2, limit=500, the first 2*500 =1000 profiles are skipped and the next 500 profiles are returned. Default is NULL (no skip).
  • dataset: A string specifying the dataset to query from the Beacon response. Default is NULL, which includes results from all datasets.
  • domain: A string specifying the domain of the query data resource. Default is "http://progenetix.org".
  • entry_point: A string specifying the entry point of the Beacon v2 API. Default is “beacon”, resulting in the endpoint being "http://progenetix.org/beacon".

2 Retrieve biosamples information

2.1 Search by filters

Filters are a significant enhancement to the Beacon query API, providing a mechanism for specifying rules to select records based on their field values. To learn more about how to utilize filters in Progenetix, please refer to the documentation.

The following example demonstrates how to access all available filters in Progenetix:

all_filters <- pgxLoader(type="filtering_terms")
head(all_filters)
#>                           id                      label scopes type
#> 1               PATO:0020000              genotypic sex     NA   NA
#> 2               PATO:0020001         male genotypic sex     NA   NA
#> 3               PATO:0020002       female genotypic sex     NA   NA
#> 4        EDAM:operation_3227        EDAM:operation_3227     NA   NA
#> 5        EDAM:operation_3961        EDAM:operation_3961     NA   NA
#> 6 labelSeg-based calibration labelSeg-based calibration     NA   NA

You can also query filters available in other resources via the Beacon v2 API by setting the domain and entry_point parameters accordingly.

The following query retrieves information about all retinoblastoma samples in Progenetix, utilizing a specific filter based on an NCIt code as a disease identifier.

biosamples <- pgxLoader(type="biosamples", filters = "NCIT:C7541")
# data looks like this
biosamples[1:5,]
#>     biosample_id   individual_id biosample_status_id biosample_status_label
#> 1 pgxbs-kftvh1n1 pgxind-kftx2vtw         EFO:0009656      neoplastic sample
#> 2 pgxbs-kftvh1n3 pgxind-kftx2vty         EFO:0009656      neoplastic sample
#> 3 pgxbs-kftvh1n4 pgxind-kftx2vu0         EFO:0009656      neoplastic sample
#> 4 pgxbs-kftvh1n6 pgxind-kftx2vu2         EFO:0009656      neoplastic sample
#> 5 pgxbs-kftvh1n8 pgxind-kftx2vu4         EFO:0009656      neoplastic sample
#>   sample_origin_type_id sample_origin_type_label histological_diagnosis_id
#> 1           OBI:0001479   specimen from organism                NCIT:C7541
#> 2           OBI:0001479   specimen from organism                NCIT:C7541
#> 3           OBI:0001479   specimen from organism                NCIT:C7541
#> 4           OBI:0001479   specimen from organism                NCIT:C7541
#> 5           OBI:0001479   specimen from organism                NCIT:C7541
#>   histological_diagnosis_label sampled_tissue_id sampled_tissue_label
#> 1               Retinoblastoma    UBERON:0000966               retina
#> 2               Retinoblastoma    UBERON:0000966               retina
#> 3               Retinoblastoma    UBERON:0000966               retina
#> 4               Retinoblastoma    UBERON:0000966               retina
#> 5               Retinoblastoma    UBERON:0000966               retina
#>   pathological_stage_id pathological_stage_label tnm_id tnm_label
#> 1           NCIT:C92207            Stage Unknown     NA        NA
#> 2           NCIT:C92207            Stage Unknown     NA        NA
#> 3           NCIT:C92207            Stage Unknown     NA        NA
#> 4           NCIT:C92207            Stage Unknown     NA        NA
#> 5           NCIT:C92207            Stage Unknown     NA        NA
#>   tumor_grade_id tumor_grade_label age_iso info          notes
#> 1             NA                NA    <NA>   NA Retinoblastoma
#> 2             NA                NA    <NA>   NA Retinoblastoma
#> 3             NA                NA    <NA>   NA Retinoblastoma
#> 4             NA                NA    <NA>   NA Retinoblastoma
#> 5             NA                NA    <NA>   NA Retinoblastoma
#>   icdo_morphology_id icdo_morphology_label icdo_topography_id
#> 1    pgx:icdom-95103   Retinoblastoma, NOS    pgx:icdot-C69.2
#> 2    pgx:icdom-95103   Retinoblastoma, NOS    pgx:icdot-C69.2
#> 3    pgx:icdom-95103   Retinoblastoma, NOS    pgx:icdot-C69.2
#> 4    pgx:icdom-95103   Retinoblastoma, NOS    pgx:icdot-C69.2
#> 5    pgx:icdom-95103   Retinoblastoma, NOS    pgx:icdot-C69.2
#>   icdo_topography_label     pubmed_id cellosaurus_id cbioportal_id
#> 1                Retina PMID:15834944           <NA>          <NA>
#> 2                Retina PMID:15834944           <NA>          <NA>
#> 3                Retina PMID:15834944           <NA>          <NA>
#> 4                Retina PMID:15834944           <NA>          <NA>
#> 5                Retina PMID:15834944           <NA>          <NA>
#>   tcga_project_id analysis_info_experiment_id analysis_info_series_id
#> 1              NA                        <NA>                    <NA>
#> 2              NA                        <NA>                    <NA>
#> 3              NA                        <NA>                    <NA>
#> 4              NA                        <NA>                    <NA>
#> 5              NA                        <NA>                    <NA>
#>   analysis_info_platform_id                cohort_ids geoprov_city
#> 1                      <NA> pgx:cohort-2021progenetix   Heidelberg
#> 2                      <NA> pgx:cohort-2021progenetix   Heidelberg
#> 3                      <NA> pgx:cohort-2021progenetix   Heidelberg
#> 4                      <NA> pgx:cohort-2021progenetix   Heidelberg
#> 5                      <NA> pgx:cohort-2021progenetix   Heidelberg
#>   geoprov_country geoprov_iso_alpha3 geoprov_long_latitude
#> 1         Germany                DEU                 49.41
#> 2         Germany                DEU                 49.41
#> 3         Germany                DEU                 49.41
#> 4         Germany                DEU                 49.41
#> 5         Germany                DEU                 49.41
#>   geoprov_long_longitude                    updated
#> 1                   8.69 2020-09-10 17:44:29.148000
#> 2                   8.69 2020-09-10 17:44:29.150000
#> 3                   8.69 2020-09-10 17:44:29.151000
#> 4                   8.69 2020-09-10 17:44:29.152000
#> 5                   8.69 2020-09-10 17:44:29.154000

The data contains many columns representing different aspects of sample information.

2.2 Search by biosample id and individual id

In Progenetix, biosample id and individual id serve as unique identifiers for biosamples and the corresponding individuals. You can obtain these IDs through metadata search with filters as described above, or through website interface query.

biosamples_2 <- pgxLoader(type="biosamples", biosample_id = "pgxbs-kftvki7h",individual_id = "pgxind-kftx6ltu")

biosamples_2
#>     biosample_id   individual_id biosample_status_id biosample_status_label
#> 1 pgxbs-kftvki7h pgxind-kftx6ltd         EFO:0009656      neoplastic sample
#> 2 pgxbs-kftvki7v pgxind-kftx6ltu         EFO:0009656      neoplastic sample
#>   sample_origin_type_id sample_origin_type_label histological_diagnosis_id
#> 1           OBI:0001479   specimen from organism                NCIT:C3512
#> 2           OBI:0001479   specimen from organism                NCIT:C3512
#>   histological_diagnosis_label sampled_tissue_id sampled_tissue_label
#> 1          Lung Adenocarcinoma    UBERON:0002048                 lung
#> 2          Lung Adenocarcinoma    UBERON:0002048                 lung
#>   pathological_stage_id pathological_stage_label
#> 1           NCIT:C27976                 Stage Ib
#> 2           NCIT:C27977               Stage IIIa
#>                                tnm_id
#> 1 NCIT:C48706,NCIT:C48714,NCIT:C48724
#> 2 NCIT:C48706,NCIT:C48714,NCIT:C48728
#>                                            tnm_label tumor_grade_id
#> 1 N1 Stage Finding,N3 Stage Finding,T2 Stage Finding             NA
#> 2 N1 Stage Finding,N3 Stage Finding,T3 Stage Finding             NA
#>   tumor_grade_label age_iso info                 notes icdo_morphology_id
#> 1                NA    P56Y   NA adenocarcinoma [lung]    pgx:icdom-81403
#> 2                NA    P75Y   NA adenocarcinoma [lung]    pgx:icdom-81403
#>   icdo_morphology_label icdo_topography_id icdo_topography_label     pubmed_id
#> 1   Adenocarcinoma, NOS    pgx:icdot-C34.9             Lung, NOS PMID:19607727
#> 2   Adenocarcinoma, NOS    pgx:icdot-C34.9             Lung, NOS PMID:19607727
#>   cellosaurus_id cbioportal_id tcga_project_id analysis_info_experiment_id
#> 1             NA            NA              NA               geo:GSM417055
#> 2             NA            NA              NA               geo:GSM417063
#>   analysis_info_series_id analysis_info_platform_id
#> 1            geo:GSE16597               geo:GPL8690
#> 2            geo:GSE16597               geo:GPL8690
#>                                                                              cohort_ids
#> 1 pgx:cohort-arraymap,pgx:cohort-2021progenetix,pgx:cohort-carriocordo2021heterogeneity
#> 2                                         pgx:cohort-arraymap,pgx:cohort-2021progenetix
#>    geoprov_city          geoprov_country geoprov_iso_alpha3
#> 1 New York City United States of America                USA
#> 2 New York City United States of America                USA
#>   geoprov_long_latitude geoprov_long_longitude                    updated
#> 1                 40.71                 -74.01 2020-09-10 17:46:45.105000
#> 2                 40.71                 -74.01 2020-09-10 17:46:45.115000

It’s also possible to query by a combination of filters, biosample id, and individual id.

2.3 Access a subset of samples

By default, it returns all related samples (limit=0). You can access a subset of them via the parameter limit and skip. For example, if you want to access the first 10 samples , you can set limit = 10, skip = 0.

biosamples_3 <- pgxLoader(type="biosamples", filters = "NCIT:C7541",skip=0, limit = 10)
# Dimension: Number of samples * features
print(dim(biosamples))
#> [1] 256  37
print(dim(biosamples_3))
#> [1] 10 37

2.4 Parameter codematches use

Some filters, such as NCIt codes, are hierarchical. As a result, retrieved samples may include not only the specified filters but also their child terms.

unique(biosamples$histological_diagnosis_id)
#> [1] "NCIT:C7541" "NCIT:C8714" "NCIT:C8713"

Setting codematches as TRUE allows this function to only return biosamples that exactly match the specified filter, excluding child terms.

biosamples_4 <- pgxLoader(type="biosamples", filters = "NCIT:C7541",codematches = TRUE)
unique(biosamples_4$histological_diagnosis_id)
#> [1] "NCIT:C7541"

2.5 Query the number of samples in Progenetix

The number of samples in specific filters can be queried as follows:

pgxLoader(type="sample_count",filters = "NCIT:C7541")
#>      filters          label total_count exact_match_count
#> 1 NCIT:C7541 Retinoblastoma         256               215

3 Retrieve individuals information

If you want to query details of individuals (e.g. clinical data) where the samples of interest come from, set the parameter type to “individuals” and follow the same steps as above.

individuals <- pgxLoader(type="individuals",individual_id = "pgxind-kftx26ml",filters="NCIT:C7541")
# data looks like this
tail(individuals,2)
#>       individual_id      sex_id sex_label age_iso histological_diagnosis_id
#> 254 pgxind-m3io67j5 NCIT:C16576    female    <NA>                NCIT:C7541
#> 255 pgxind-kftx26ml NCIT:C20197      male    <NA>                NCIT:C3493
#>     histological_diagnosis_label followup_time followup_state_id
#> 254               Retinoblastoma          <NA>       EFO:0030039
#> 255 Squamous Cell Lung Carcinoma                     EFO:0030039
#>     followup_state_label diseases_notes       info_legacy_ids
#> 254   no followup status           <NA>                  <NA>
#> 255   no followup status           <NA> PGX_IND_AdSqLu-bjo-01
#>                        updated info_provenance info
#> 254 2024-11-19T03:41:20.977857            <NA>   NA
#> 255 2018-09-26 09:50:52.800000            <NA>   NA

4 Retrieve analyses information

If you want to know more details about data analyses, set the parameter type to “analyses”. The other steps are the same, except the parameter codematches is not available because analyses data do not include filter information, even though it can be searched by filters.

analyses <- pgxLoader(type="analyses",biosample_id = c("pgxbs-kftvik5i","pgxbs-kftvik96"))

analyses
#>      analysis_id   biosample_id   individual_id analysis_operation_id
#> 1 pgxcs-kftw8qme pgxbs-kftvik5i pgxind-kftx4963   EDAM:operation_3961
#> 2 pgxcs-kftw8rrh pgxbs-kftvik96 pgxind-kftx49ao   EDAM:operation_3961
#>          analysis_operation_label experiment_id   series_id calling_pipeline
#> 1 Copy number variation detection geo:GSM115217 geo:GSE5051       progenetix
#> 2 Copy number variation detection geo:GSM120460 geo:GSE5359       progenetix
#>   platform_id                            platform_label
#> 1 geo:GPL2826             VUMC MACF human 30K oligo v31
#> 2 geo:GPL3960 MPIMG Homo sapiens 44K aCGH3_MPIMG_BERLIN
#>                      updated
#> 1 2024-11-20T07:24:51.839782
#> 2 2024-11-20T07:24:53.496612

5 Visualization of survival data

Suppose you want to investigate whether there are survival differences associated with a particular disease, for example, between younger and older patients, or based on other variables. You can query and visualize the relevant information using the pgxMetaplot function.

5.1 pgxMetaplot function

This function generates a survival plot using metadata of individuals obtained by the pgxLoader function.

The parameters of this function:

  • data: The data frame returned by the pgxLoader function, containing survival data for individuals. The survival state is represented by Experimental Factor Ontology in the “followup_state_id” column, and the survival time is represented in ISO 8601 duration format in the “followup_time” column.
  • group_id: A string specifying which column is used for grouping in the Kaplan-Meier plot.
  • condition: A string for splitting individuals into younger and older groups, following the ISO 8601 duration format. Only used if group_id is “age_iso”.
  • return_data: A logical value determining whether to return the metadata used for plotting. Default is FALSE.
  • ...: Other parameters relevant to KM plot. These include pval, pval.coord, pval.method, conf.int, linetype, and palette (see ggsurvplot function from survminer package)

5.1.1 Example usage

# query metadata of individuals with lung adenocarcinoma
luad_inds <- pgxLoader(type="individuals",filters="NCIT:C3512")
# use 70 years old as the splitting condition
pgxMetaplot(data=luad_inds, group_id="age_iso", condition="P70Y", pval=TRUE)

It’s noted that not all individuals have available survival data. If you set return_data to TRUE, the function will return the metadata of individuals used for the plot.

6 Session Info

#> R Under development (unstable) (2024-10-21 r87258)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.21-bioc/R/lib/libRblas.so 
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB              LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: America/New_York
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] pgxRpi_1.3.2     BiocStyle_2.35.0
#> 
#> loaded via a namespace (and not attached):
#>  [1] gtable_0.3.6        xfun_0.50           bslib_0.8.0        
#>  [4] ggplot2_3.5.1       rstatix_0.7.2       lattice_0.22-6     
#>  [7] vctrs_0.6.5         tools_4.5.0         generics_0.1.3     
#> [10] curl_6.1.0          tibble_3.2.1        pkgconfig_2.0.3    
#> [13] Matrix_1.7-1        data.table_1.16.4   lifecycle_1.0.4    
#> [16] compiler_4.5.0      farver_2.1.2        munsell_0.5.1      
#> [19] tinytex_0.54        carData_3.0-5       htmltools_0.5.8.1  
#> [22] sass_0.4.9          yaml_2.3.10         Formula_1.2-5      
#> [25] pillar_1.10.1       car_3.1-3           ggpubr_0.6.0       
#> [28] jquerylib_0.1.4     tidyr_1.3.1         cachem_1.1.0       
#> [31] survminer_0.5.0     magick_2.8.5        abind_1.4-8        
#> [34] km.ci_0.5-6         tidyselect_1.2.1    digest_0.6.37      
#> [37] dplyr_1.1.4         purrr_1.0.2         bookdown_0.42      
#> [40] labeling_0.4.3      splines_4.5.0       fastmap_1.2.0      
#> [43] grid_4.5.0          colorspace_2.1-1    cli_3.6.3          
#> [46] magrittr_2.0.3      survival_3.8-3      broom_1.0.7        
#> [49] withr_3.0.2         scales_1.3.0        backports_1.5.0    
#> [52] lubridate_1.9.4     timechange_0.3.0    rmarkdown_2.29     
#> [55] httr_1.4.7          gridExtra_2.3       ggsignif_0.6.4     
#> [58] zoo_1.8-12          evaluate_1.0.3      knitr_1.49         
#> [61] KMsurv_0.1-5        survMisc_0.5.6      rlang_1.1.5        
#> [64] Rcpp_1.0.14         xtable_1.8-4        glue_1.8.0         
#> [67] BiocManager_1.30.25 attempt_0.3.1       jsonlite_1.8.9     
#> [70] R6_2.5.1