DuplexDiscovereR implements methods for analysing data from RNA cross-linking and proximity ligation protocols such as SPLASH, PARIS, LIGR-seq and other methods which yeild information on RNA-RNA interactions as chimeric read fragments after high-throughput sequencing.
DuplexDiscovereR 1.1.1
DuplexDiscovereR
can be installed from Bioconductor:
if (!require("BiocManager", quietly = TRUE)) install.packages("BiocManager")
BiocManager::install("DuplexDiscovereR")
library(DuplexDiscovereR)
?DuplexDiscovereR
or from Github page:
if (!require("BiocManager", quietly = TRUE)) install.packages("BiocManager")
library(devtools)
devtools::install_github("Egors01/DuplexDiscovereR")
For calculating hybridization energies DuplexDiscovereR
uses
RNAduplex program
from the ViennaRNA software suite.
Although this step is optional and is not required for using most of the package methods, we strongly recommend installing ViennaRNA from its web-page, as predicting RNA hybrids is one of the central aims of the analysis of RNA-RNA interaction data. Please note that ViennaRNA is distributed under its own licence.
RNA cross-linking and proximity ligation methods as SPLASH, PARIS, LIGR-seq, RIC-seq and others provide infromation on the RNA-RNA interactions on a transcriptome-wide scale. These experiments generate high-throughput sequencing data containing a fraction of chimeric reads, where each chimeric part corresponds to the base-paired stretches of RNA (RNA duplexes)
DuplexDiscovereR
is designed for the bioinformatic analysis of RNA data.
It employs a workflow that allows users to identify RNA duplexes and
and provides several options for filtering and ranking the results.
Starting from the set of aligned chimeric reads, it implements Classification and filtering of non-RNA duplex alignments and their into duplex groups (DGs). Identified DGs are then annotated with transcripts or other genomic features. P-values are calculated to test the hypothesis that DGs are generated by random ligation. Finally, RNA duplex base pairing and hybridization energies are predicted.
The optimal procedures for identifying RNA duplexes may vary between
methods, as RNA-RNA interaction probing protocols are known to have
biases for certain types of RNA and differ in experimental steps.
DuplexDiscovereR
allows user to build customized analysis by utilizing
its methods for efficient chimeric read clustering, classification and
convenience functions for flexible data filtering, annotation and
visualization.
RNA duplex data is known to be sparse, with little overlap between the results of
between the results of different probing protocols. To facilitate
comparisons between different experiments and to allow straightforward
cross-checking between replicates, DuplexDiscovereR
includes
functionality to intersect multiple RNA interaction datasets.
DuplexDiscovereR
relies on the GInteractions
container from
InteractionSet package for the storing of the RNA-RNA
interaction data and uses tidyverse tibble
and dplyr
for internal
the Gviz - based DuplexTrack
track and
supports the exportof the results to the .sam file.
Key steps of RNA duplex data analysis with DuplexDiscovereR
can be run
through the single function call. It is expected that user already
aligned the raw sequencing reads which produced some chimeric or
split-read alignments. The input can be provided in several formats:
GInteractions
object with chimeric alignments.Full DuplexDiscovereR
workflow consists of the following steps:
Optionally,
Required inputs for minimal run are the following:
data
Chimeric read alignments or coordinates. Object of any
table type, convertible to tibble
or GInteractions
junctions_gr
GRanges
object with splice junction
coordinates.GInteractions
.Optional inputs for executing the full workflow:
anno_gr
a GRanges
object with genes or transcripts
annotation objectdf_counts
a 2-column table with the id
in the 1st column and
raw read count in the 2nd. id
should correspond to the entries
in anno_gr
. Providing this argument triggers p-value
calculation.fafile
path to .fasta file with the genome. Providing this
argument triggers prediction of RNA hybrids.Once installed, we can load the example data. Example dataset is based on the RNA duplex probing of SPLASH ES-cells Aw. et.al 2016 which was aligned with STAR Dobin et.al, 2013 and subset to the chromosome 22.
library(DuplexDiscovereR)
data(RNADuplexesSampleData)
Details on alignment of the reads of example dataset and fetching of the annotation data are described in the document which can be found at
system.file("extdata", "scripts", "DD_data_generation.R", package = "DuplexDiscovereR")
## [1] ""
genome_fasta <- NULL
result <- DuplexDiscovereR::runDuplexDiscoverer(
data = RNADuplexesRawChimSTAR,
junctions_gr = SampleSpliceJncGR,
anno_gr = SampleGeneAnnoGR,
df_counts = RNADuplexesGeneCounts,
sample_name = "example_run",
lib_type = "SE",
table_type = "STAR",
fafile = genome_fasta,
)
## Warning: `as_character()` is deprecated as of rlang 0.4.0
## Please use `vctrs::vec_cast()` instead.
## This warning is displayed once every 8 hours.
The result
is a DuplexDiscovererResults
object which contains several outputs of different
dimensions.
result
## DuplexDiscovereR Results
## Sample name: example_run
## Chimeric reads statistics:
## feature count
## n_reads_input 5000.00
## n_dgs 70.00
## 2arm 730.00
## 2arm_short 85.00
## 2arm_sj 1268.00
## multi_map 218.00
## multi_split 285.00
## multi_split&map 234.00
## not_chim 2173.00
## self_ovl 7.00
## mean_len_A 58.83
## median_len_A 59.00
## mean_len_B 59.19
## median_len_B 59.00
## Use class accessors to retrieve output. I.e duplex_groups(resultsObject)
The primary output are detected duplex groups.
This is a GInteractions
, where each entry represents boundaries
of the detected RNA duplex. Metadata fields as n_reads
, p_val
, score
and others
can be used for ranking and sub-setting results for downstream analysis.
gi_clusters <- dd_get_duplex_groups(result)
head(gi_clusters, 2)
## StrictGInteractions object with 2 interactions and 23 metadata columns:
## seqnames1 ranges1 seqnames2 ranges2 | n_reads
## <Rle> <IRanges> <Rle> <IRanges> | <numeric>
## [1] chr22 39318489-39318567 --- chr22 39319587-39319625 | 16
## [2] chr22 24555998-24556058 --- chr22 24557657-24557731 | 8
## dg_id score gene_id.A gene_id.B gene_name.A
## <integer> <numeric> <character> <character> <character>
## [1] 1 85.8981 ENSG00000100316.16 ENSG00000100316.16 RPL3
## [2] 2 86.3571 ENSG00000100028.12 ENSG00000100028.12 SNRPD3
## gene_name.B gene_type.A gene_type.B ambig.A ambig.B
## <character> <character> <character> <integer> <integer>
## [1] RPL3 protein_coding protein_coding 0 0
## [2] SNRPD3 protein_coding protein_coding 1 1
## ambig_list.A ambig_list.B cis p_val
## <character> <character> <numeric> <numeric>
## [1] <NA> <NA> 1 0.00000e+00
## [2] SNRPD3,ENSG00000286070 SNRPD3,ENSG00000286070 1 2.36047e-18
## p.adj Pa Pb count.chim.A.notB count.chim.notA.B
## <numeric> <numeric> <numeric> <numeric> <numeric>
## [1] 0.00000e+00 0.30084461 0.30084461 0 0
## [2] 3.43341e-18 0.00150707 0.00150707 0 0
## count.chim.notA.notB gene_count.A gene_count.B
## <numeric> <numeric> <numeric>
## [1] 157 795095 795095
## [2] 232 3983 3983
## -------
## regions: 140 ranges and 1 metadata column
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
Individual read-level data can be accessed by running:
gi_reads <- dd_get_chimeric_reads(result)
head(gi_reads, 2)
## StrictGInteractions object with 2 interactions and 30 metadata columns:
## seqnames1 ranges1 seqnames2 ranges2 |
## <Rle> <IRanges> <Rle> <IRanges> |
## [1] chr22 37877934-37877970 --- chr22 37878003-37878043 |
## [2] chr22 39314697-39314721 --- chr22 39314743-39314788 |
## readname map_type score brkpt_donorA brkpt_acceptorB
## <character> <character> <numeric> <numeric> <numeric>
## [1] SRR3404943.86173415 2arm 74 37878043 37877933
## [2] SRR3404943.37708269 2arm 67 39314696 39314788
## junction_type repeat_left_lenA repeat_right_lenB cigar_alnA cigar_alnB
## <factor> <numeric> <numeric> <character> <character>
## [1] 2arm 0 1 1S40M36S 41S36M
## [2] 2arm 0 0 45S24M 45M24S
## num_chim_aln max_poss_aln_score non_chim_aln_score bestall_chim_aln_score
## <numeric> <numeric> <numeric> <numeric>
## [1] 1 77 40 74
## [2] 1 69 46 67
## PEmerged_bool readgrp ngapsA ngapsB lengapsA lengapsB n_reads
## <numeric> <logical> <integer> <integer> <numeric> <numeric> <numeric>
## [1] 0 <NA> 0 0 0 0 1
## [2] 0 <NA> 0 0 0 0 1
## read_id splicejnc splicejnc_donor splicejnc_acceptor keep
## <integer> <integer> <numeric> <numeric> <logical>
## [1] 13 0 0 0 TRUE
## [2] 14 0 0 0 TRUE
## dg_id duplex_id n_reads_dg was_clustered
## <integer> <integer> <numeric> <numeric>
## [1] <NA> 1 0 1
## [2] 7 2 8 1
## -------
## regions: 3721 ranges and 0 metadata columns
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
Each entry in gi_reads
is a single read. They are tagged by DG with the
dg_id
, but not collapsed into groups. This layout is used for
visualization and may be useful for a more detailed review of the structure of
the individual DGs.
The dimensions of gi_reads
do not correspond 1:1 to the size of the
to the size of the input, since in general not every element in the input data
can be represented as a 2-arm split alignment.
To track each input alignment one can query the the read-level classification results, which have the same dimension with the input.
df_reads <- dd_get_reads_classes(result)
print(dim(df_reads))
## [1] 5000 11
print(dim(RNADuplexesRawChimSTAR))
## [1] 5000 21
df_reads
has fields such as read_type
, map_type
and junction_type
indicating the results of chimeric read classification and
filtering.
This layout is useful for assessing the statistics and highlighting potential problems and optimization points, i.e. how many reads from the input were classified as true chimeric and how many of them were assigned to duplex groups.
table(df_reads$read_type)
##
## 2arm 2arm_short 2arm_sj multi_map multi_split
## 730 85 1268 218 285
## multi_split&map not_chim self_ovl
## 234 2173 7
Among the 2-arm alignments from gi_reads
, only unambigously aligned reads
without self-ovelaps that pass splice-junctions and minimum junction length
filters are subjected to clustering into DGs. To see the determined types of
2-arm alignments, we can check the junction_type
metadata field
table(gi_reads$junction_type)
##
## 2arm 2arm_short self_ovl antisense_ovl
## 1998 85 7 0
The resulting GInteractions
with duplex groups are ready for
integration into transcriptomic analysis using other Bioconductor
packages.
There are several options available for outputting the results for other uses.
One can convert the object with duplex groups to dataframe-like object and write it to file
clusters_dt <- makeDfFromGi(gi_clusters)
write.table(clusters_dt, file = tempfile(pattern = "dgs_out", fileext = ".tab"))
Saving results to .sam file can be a useful tool for downstream visualization with interactive viewers as IGV
writeGiToSAMfile(gi_clusters, file_out = tempfile(pattern = "dgs_out", fileext = ".sam"), genome = "hg38")
Collapsed DGs can be visualized with Gviz -based DuplexTrack, allowing overlays of RNA duplexes with the genes of other features
library(Gviz)
# define plotting region
plotrange <- GRanges(
seqnames = "chr22",
ranges = IRanges(
start = c(37877619),
end = c(37878053)
),
strand = "+"
)
# make genes track
anno_track <- Gviz::GeneRegionTrack(SampleGeneAnnoGR,
name = "chr22 Genes",
range = plotrange
)
# construct DuplexTrack
duplex_track <- DuplexTrack(
gi = gi_clusters,
gr_region = plotrange,
labels.fontsize = 12,
arcConstrain = 4,
annotation.column1 = "gene_name.A",
annotation.column2 = "gene_name.B"
)
plotTracks(list(anno_track, duplex_track),
sizes = c(1, 3),
from = start(plotrange) - 100,
to = end(plotrange) + 100
)
In the example above we have omitted the base-pairing prediction and the
hybridization energies by not passing the path to file with genome as the
fafile
argument. However, the prediction of RNA hybrids is
central to the analysis of RNA-RNA interaction data.
To compute the structure formed by two stretches of RNA
DuplexDiscoverer
uses RNAduplex algorithm from ViennaRNA. Note
that ViennaRNA is distributed under its own licence. Please refer to
ViennaRNA’s license
for details.
If the RNAduplex is installed, all steps of the
analysis can be re-run by calling runDuplexDiscoverer
with the
option fafile = <path to genome fasta>
. Alternatively, one can predict RNA
hybrids by calling separate function on already existing GInteractions
object:
sequence <- paste0(
"AGCUAGCGAUAGCUAGCAUCGUAGCAUCGAUCGUAAGCUAGCUAGCUAGCAUCGAUCGUAGCUAGCAUCGAU",
"CGUAGCAUCGUAGCUAGCUAGCUAUGCGAUU")
# Save the sequence to a temp fasta file
fasta_file = tempfile(fileext = '.fa')
chrom <- "test_chrA"
writeLines(c(">test_chrA", sequence), con = fasta_file)
# Create the GInteraction object
# Define start and end positions for the base-pairing regions
regions <- data.frame(
start1 = c(1, 11, 21, 31, 41),
end1 = c(10, 20, 30, 40, 50),
start2 = c(91, 81, 71, 61, 51),
end2 = c(100, 90, 80, 70, 60)
)
# GRanges objects for the anchors
anchor1 <- GRanges(seqnames = chrom, ranges = IRanges(start = regions$start1, end = regions$end1))
anchor2 <- GRanges(seqnames = chrom, ranges = IRanges(start = regions$start2, end = regions$end2))
interaction <- GInteractions(anchor1, anchor2)
# predict hybrids
getRNAHybrids(interaction,fasta_file)
Due to the sparse nature of RNA duplex data, it is desirable to be able to compare replicates of the same experiment or to compare different data sets.
DuplexDiscovereR
provides a method for finding intersections between
multiple samples. Comparisons between ranges can result in one-to-many
relations, i.e. when a region (or pair a of regions) of one sample contains
multiple regions of other samples. DuplexDiscovereR
addresses this
problem by using the following procedure: first it computes a non-redundant
set of interactions that stacks and collapses interactions from all
input samples. This resulting super-set is then compared in a pairwise
manner with each entry from each individual sample.
To demonstrate how multiple samples can be compared, we will generate three pseudo-replicates from the example data which we clustered above.
First, we will select the clustered reads and divide them into three groups
clust_reads <- gi_reads
clust_reads <- clust_reads[!is.na(clust_reads$dg_id)]
# Create separate pseudo-sample objects for each group
set.seed(123)
group_indices <- sample(rep(1:3, length.out = length(clust_reads)))
group1 <- clust_reads[group_indices == 1]
group2 <- clust_reads[group_indices == 2]
group3 <- clust_reads[group_indices == 3]
Since individual reads have already been assigned to duplex groups, we can collapse them, ensuring that we have some overlap between reads that belong to the same ‘true’ DGs, but are spread across three artificial sets.
group1 <- collapse_duplex_groups(group1)
group2 <- collapse_duplex_groups(group2)
group3 <- collapse_duplex_groups(group3)
a <- list("sample1" = group1, "sample2" = group2, "sample3" = group3)
res_comp <- compareMultipleInteractions(a)
names(res_comp)
## [1] "gi_all" "dt_upset"
Result contains slot for interaction superset and overlap information information. We can visualize the intersections of the samples using the upset plot
# first, check if UpSetR is installed
if (!requireNamespace("UpSetR", quietly = TRUE)) {
stop("Install 'UpSetR' to use this function.")
}
library(UpSetR)
upset(res_comp$dt_upset, text.scale = 1.5)
One can customize the analysis by performing classification, clustering and subsequent annotation procedures step by step, adapting the procedures to the RNA-RNA interaction probing protocol or upstream/downstream analysis.
In this section, we show how users can use the core methods in the package to achieve more flexibility in this task.
Most of the methods in DuplexDiscovereR
work with
the RNA-RNA interaction data stored in GInteraction
object. There are
several metadata fields which are not strictly required by the package,
but utilized in procedures run classification: map_type
, read_id
and
n_reads
,score
for clustering.
The easiest way to run DuplexDiscovereR
step-by-step is to start with a
pre-processing routine. It is fairly flexible in input type: it can be either
GInteractions
, Chimeric.out.Junction file from STAR or generic
.bedpe file with pairs of regions.
data("RNADuplexesSampleData")
new_star <- runDuplexDiscoPreproc(
data = RNADuplexesRawChimSTAR,
table_type = "STAR",
library_type = "SE",
)
new_bedpe <- runDuplexDiscoPreproc(
data = RNADuplexesRawBed,
table_type = "bedpe",
library_type = "SE", return_gi = TRUE
)
By default, to make downstream read filtering slightly more convenient,
tibble
is returned after the pre-processing. In case of the
GInteractions
input, input reads already are 2-arm split by design, so
we can return GInteractions
.
# keep only readname in metadata
mcols(RNADuplexSampleGI) <- mcols(RNADuplexSampleGI)["readname"]
new_gi <- runDuplexDiscoPreproc(data = RNADuplexSampleGI, return_gi = TRUE)
head(new_gi, 1)
## StrictGInteractions object with 1 interaction and 5 metadata columns:
## seqnames1 ranges1 seqnames2 ranges2 |
## <Rle> <IRanges> <Rle> <IRanges> |
## [1] chr22 38303215-38303249 --- chr22 38336125-38336170 |
## readname map_type score n_reads read_id
## <character> <character> <numeric> <numeric> <integer>
## [1] SRR3404943.141026859 2arm 1 1 1
## -------
## regions: 3721 ranges and 0 metadata columns
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
The basic classification routines implemented in the package are comparing the chimeric junctions to the known splice junctions and determining the types of the junctions. Both can be called by a separate method that adds a corresponding metadata field to the input:
gi <- getChimericJunctionTypes(new_gi)
##
## --- filtering 2-arm duplexes by junction type ---
## Duplexes categorized by chimeric junction span: 2090
## Duplexes with normal junctions: 1941: 92.87 %
## Duplexes with self-overlap: 7: 0.33 %
## Duplexes with antisense self-overlap: 0: 0 %
## Duplexes with the junction shorter than 10nt): 142: 6.79 %
gi <- getSpliceJunctionChimeras(gi, sj_gr = SampleSpliceJncGR)
##
## --- searching for the exon-exon junctions ---
## Chimeras to test against splice junctions: 2090
## SJ entries : 1325: 63.397 % of all
## SJ entries / single chr entries: 1325/2063 : 64.23 % of single chromosome chimeras
One can inspect detected junction types
table(gi$junction_type)
##
## 2arm 2arm_short self_ovl antisense_ovl
## 1941 142 7 0
and amount of reads which are not likely to come from true RNA duplexes, because their chimeric junctions are too similar to normal exon-exon junction
table(gi$splicejnc)
##
## 0 1
## 765 1325
For determining the duplex groups, we will leave only chimeric junctions in a read which do not self-overlap:
keep <- which((gi$junction_type == "2arm") & (gi$splicejnc == 0))
gi <- gi[keep]
The most basic way to find the duplex groups is to simply call
gi <- clusterDuplexGroups(gi)
This procedure will:
dg_id
field to the input object.
NA
values are used for reads which are not a part of any read duplex
group (DG)table(is.na(gi$dg_id))
##
## FALSE TRUE
## 166 450
To collapse reads to the DGs with the boundaries, containing every read of DG, one can call
gi_dgs <- collapse_duplex_groups(gi)
head(gi_dgs, 1)
## StrictGInteractions object with 1 interaction and 3 metadata columns:
## seqnames1 ranges1 seqnames2 ranges2 | n_reads
## <Rle> <IRanges> <Rle> <IRanges> | <numeric>
## [1] chr22 39314686-39314744 --- chr22 39314738-39314827 | 7
## dg_id score
## <integer> <numeric>
## [1] 1 1
## -------
## regions: 92 ranges and 0 metadata columns
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
# number of DGs
length(gi_dgs)
## [1] 46
The above procedure can be applied to any subset of data. If the user is interested in the RNA structure in a particular region, they can select all reads overlapping with this region and repeat the clustering several times. By adjusting the parameters (minimum overlap, maximum allowed distance between alignment arms, coordinates of coordinates) one can obtain more refined or larger duplex groups for improving subsequent prediction of RNA secondary structure.
For large datasets, it is possible to identify and collapse identical reads – those that map to the same coordinates with the same score. In overlap-based clustering, these reads would belong to the same duplex groups. Collapsing them into a single entry reduces the number of nodes and vertices in the clustering graph, resulting in better time and memory performance.
res_collapse <- collapseIdenticalReads(gi)
## Duplicated : 1.95% of initial
## Initial size : 616
## New size : 604
# returns new object
gi_unique <- res_collapse$gi_collapsed
This procedure adds duplex_id
column as new id and removes read_id
,
because single entry in a new object can correspond to multiple reads.
n_reads
records the number of reads in this temporary duplex group.
head(res_collapse$stats_df, 1)
## # A tibble: 1 × 3
## read_id duplex_id n_reads_collapsed
## <int> <int> <dbl>
## 1 3 1 1
Note that collapse_duplex_groups()
method has options to handle
temporary duplex groups, so that merging identical reads before clustering
would not lose information on the read support for duplex group.
# cluster gi_unqiue
gi_unique_dg <- collapse_duplex_groups(clusterDuplexGroups(gi_unique))
# check if the get the same amount of reads as in basic clustering
sum(gi_unique_dg$n_reads) == sum(gi_dgs$n_reads)
## [1] TRUE
There is a possibility for a user to re-define the rules by which overlaps between reads are weighted before community search.
This might be a customization related to protocols or the size of the dataset. Possible scenarios are weighting the clustering graph vertices with hybridization energies calculated directly on the reads instead of duplex groups, or employing other heuristic assumptions on how RNA duplexes are translated into chimeric read fragments. e.g in RIL-seq, the second chimeric arm is more likely to be a target RNA rather than sRNA.
To demonstrate such a modification, we will first compute the overlap graph manually, using the following procedure, second, modify the graph weights, and third, call the clustering.
First, we find pair overlaps in the input GInteractions
object
graph_df <- computeGISelfOverlaps(gi_unique)
head(graph_df)
## # A tibble: 6 × 9
## duplex_id.1 duplex_id.2 A_span B_span A_ovl B_ovl ratio.A ratio.B weight
## <int> <int> <int> <int> <int> <int> <dbl> <dbl> <dbl>
## 1 2 173 34 51 23 28 0.676 0.549 1.23
## 2 2 402 35 70 20 19 0.571 0.271 0.843
## 3 2 437 38 51 25 43 0.658 0.843 1.50
## 4 2 481 32 55 18 46 0.562 0.836 1.40
## 5 2 501 48 79 19 27 0.396 0.342 0.738
## 6 12 310 69 44 42 26 0.609 0.591 1.20
The size of this dataframe is determined by the presence of pairwise overlap
between both arms of read alignments. The first two columns represent
the element index or duplex_id
of the corresponding reads. It can be
filtered or supplemented with other relevant metrics of the pairwise
relationship between the reads.
Then we change the rule for how we handle overlap between reads. In this example we will manually change the span-to-overlap ratio per pair or overlapping alignment arms. Read pairs that overlap for less than half of their length will not be treated as reliable to be considered part of the same DG.
graph_df <- graph_df[graph_df$ratio.A > 0.5 & graph_df$ratio.B > 0.5, ]
We will use the sum of the overlap ratios as the new weights and re-scale them to {0,1}.
rescale_vec <- function(x) {
return((x - min(x)) / (max(x) - min(x)))
}
graph_df$weight_new <- graph_df$ratio.A + graph_df$ratio.B
graph_df$weight_new <- rescale_vec(graph_df$weight_new)
head(graph_df, 2)
## # A tibble: 2 × 10
## duplex_id.1 duplex_id.2 A_span B_span A_ovl B_ovl ratio.A ratio.B weight
## <int> <int> <int> <int> <int> <int> <dbl> <dbl> <dbl>
## 1 2 173 34 51 23 28 0.676 0.549 1.23
## 2 2 437 38 51 25 43 0.658 0.843 1.50
## # ℹ 1 more variable: weight_new <dbl>
This dataframe then can be passed to the clustering function:
gi_clust_adj <- clusterDuplexGroups(gi_unique, graphdf = graph_df, weight_column = "weight_new")
gi_clust_adj_dgs <- collapse_duplex_groups(gi_clust_adj)
We get smaller number of the DGs, as some of them are not assembled after restricting the minimum required overlap
length(gi_clust_adj)
## [1] 604
sessionInfo()
## R Under development (unstable) (2024-10-21 r87258)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.21-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] grid stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] UpSetR_1.4.0 Gviz_1.51.0
## [3] DuplexDiscovereR_1.1.1 InteractionSet_1.35.0
## [5] SummarizedExperiment_1.37.0 Biobase_2.67.0
## [7] MatrixGenerics_1.19.1 matrixStats_1.5.0
## [9] GenomicRanges_1.59.1 GenomeInfoDb_1.43.2
## [11] IRanges_2.41.2 S4Vectors_0.45.2
## [13] BiocGenerics_0.53.3 generics_0.1.3
## [15] BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] RColorBrewer_1.1-3 rstudioapi_0.17.1 jsonlite_1.8.9
## [4] magrittr_2.0.3 magick_2.8.5 GenomicFeatures_1.59.1
## [7] farver_2.1.2 rmarkdown_2.29 BiocIO_1.17.1
## [10] zlibbioc_1.53.0 vctrs_0.6.5 memoise_2.0.1
## [13] Rsamtools_2.23.1 RCurl_1.98-1.16 base64enc_0.1-3
## [16] tinytex_0.54 htmltools_0.5.8.1 S4Arrays_1.7.1
## [19] progress_1.2.3 curl_6.1.0 SparseArray_1.7.2
## [22] Formula_1.2-5 sass_0.4.9 bslib_0.8.0
## [25] htmlwidgets_1.6.4 plyr_1.8.9 httr2_1.0.7
## [28] cachem_1.1.0 GenomicAlignments_1.43.0 igraph_2.1.3
## [31] lifecycle_1.0.4 pkgconfig_2.0.3 Matrix_1.7-1
## [34] R6_2.5.1 fastmap_1.2.0 GenomeInfoDbData_1.2.13
## [37] digest_0.6.37 colorspace_2.1-1 AnnotationDbi_1.69.0
## [40] Hmisc_5.2-1 RSQLite_2.3.9 labeling_0.4.3
## [43] filelock_1.0.3 httr_1.4.7 abind_1.4-8
## [46] compiler_4.5.0 withr_3.0.2 bit64_4.5.2
## [49] htmlTable_2.4.3 backports_1.5.0 BiocParallel_1.41.0
## [52] DBI_1.2.3 biomaRt_2.63.0 rappdirs_0.3.3
## [55] DelayedArray_0.33.3 rjson_0.2.23 ggsci_3.2.0
## [58] tools_4.5.0 foreign_0.8-87 nnet_7.3-20
## [61] glue_1.8.0 restfulr_0.0.15 checkmate_2.3.2
## [64] cluster_2.1.8 gtable_0.3.6 BSgenome_1.75.0
## [67] tidyr_1.3.1 ensembldb_2.31.0 data.table_1.16.4
## [70] hms_1.1.3 xml2_1.3.6 XVector_0.47.2
## [73] pillar_1.10.1 stringr_1.5.1 dplyr_1.1.4
## [76] BiocFileCache_2.15.0 lattice_0.22-6 rtracklayer_1.67.0
## [79] bit_4.5.0.1 deldir_2.0-4 biovizBase_1.55.0
## [82] tidyselect_1.2.1 Biostrings_2.75.3 knitr_1.49
## [85] gridExtra_2.3 bookdown_0.42 ProtGenerics_1.39.1
## [88] xfun_0.50 stringi_1.8.4 UCSC.utils_1.3.0
## [91] lazyeval_0.2.2 yaml_2.3.10 evaluate_1.0.1
## [94] codetools_0.2-20 interp_1.1-6 tibble_3.2.1
## [97] BiocManager_1.30.25 cli_3.6.3 rpart_4.1.24
## [100] munsell_0.5.1 jquerylib_0.1.4 dichromat_2.0-0.1
## [103] Rcpp_1.0.13-1 dbplyr_2.5.0 png_0.1-8
## [106] XML_3.99-0.18 parallel_4.5.0 ggplot2_3.5.1
## [109] blob_1.2.4 prettyunits_1.2.0 latticeExtra_0.6-30
## [112] jpeg_0.1-10 AnnotationFilter_1.31.0 bitops_1.0-9
## [115] VariantAnnotation_1.53.1 scales_1.3.0 purrr_1.0.2
## [118] crayon_1.5.3 rlang_1.1.4 KEGGREST_1.47.0