vignettes/preciseTAD_tutorial.Rmd
preciseTAD_tutorial.Rmd
library(knitr)
# if (!requireNamespace("BiocManager", quietly=TRUE))
# install.packages("BiocManager")
# BiocManager::install("preciseTAD")
#library(preciseTAD)
#library(devtools)
#devtools::install_github("dozmorovlab/preciseTAD")
library(preciseTAD)
#install_github("stilianoudakis/preciseTADhub")
#library(preciseTADhub)
#library(ExperimentHub)
The advent of chromosome conformation capture (3C) sequencing technologies, and its successor Hi-C, have revealed a hierarchy of the 3-dimensional (3D) structure of the human genome such as chromatin loops (Rao et al. 2014), Topologically Associating Domains (TADs) (Nora et al. 2012; Dixon et al. 2012), and A/B compartments (Lieberman-Aiden et al. 2009). At the megabase scale, TADs represent regions on the linear genome that are highly self-interacting. Emerging evidence has linked TADs to critical roles in cell dynamics and cell differentiation. Studies have shown that TADs themselves are highly conserved across species and cell lines (Dixon et al. 2012, 2015; Krefting, Andrade-Navarro, and Ibn-Salem 2018; Nora et al. 2012; Rao et al. 2014; Schmitt et al. 2016). Disruption of boundaries demarcating TADs and loops promotes cancer (Hnisz et al. 2016; Taberlay et al. 2016) and other disorders (Lupiáñez, Spielmann, and Mundlos 2016; Sun et al. 2018; Meaburn et al. 2007). Therefore, identifying the precise location of TAD boundaries remains a top priority in our goal to fully understand the functionality of the human genome.
In this workshop, we present preciseTAD, an optimally tuned machine learning framework for precise identification of domain boundaries using genome annotation data. Our method utilizes the random forest (RF) algorithm trained on high-resolution transcription factor binding site (TFBS) data to predict low-resolution boundaries. We introduce a systematic pipeline for building the optimal boundary region prediction classifier. Translated from Hi-C data resolution level to base level (annotating each base and predicting its boundary probability), preciseTAD employs density-based clustering (DBSCAN) and partitioning around medoids (PAM) to detect genome annotation-guided boundary regions and points at a base-level resolution. This approach circumvents resolution restrictions of Hi-C data, allowing for the precise detection of biologically meaningful boundaries.
In this workshop we will demonstrate the following:
preciseTAD uses pre-defined TAD boundary coordinates as ground truth. We opt to use the Arrowhead algorithm to define our ground truth TAD boundaries, a gold-standard TAD-caller developed by Aiden Lab (Durand, Shamim, et al. 2016). The .hic files are publically available on GEO here, and are provided by the same team that developed Arrowhead (Rao et al. 2014). For this workshop we will be working with the “GSE63525_GM12878_insitu_primary.hic” file. This .hic file is a highly compressed binary file that has contact matrices from multiple biological replicate HiC experiments, at different resolutions, for GM12878. You can read the full protocol used to derive the various .hic files here.
Once you have the .hic file downloaded, you are ready to implement Arrowhead to call TAD boundaries. Below is an example of how you would perform this for every chromosome and for a variety of resolutions:
for i in 5000 10000 25000 50000 100000;
do
echo $i.kb
for j in {1..22};
do
echo chr$j
java -jar juicebox_tools_7.5.jar arrowhead -c $j \ #chromosome to call TADs on
-r $i \ #HiC data resolution
~/GSE63525_GM12878_insitu_primary.hic \ #location of the .HIC file
~/arrowhead_output/GM12878/$i.b/chr$j #location to store the output
done;
done;
Note: This creates separate files for each chromosome (given a specific resolution). As an example, here is what the first few lines of CHR1 look like, performed at 5kb:
chr1 | x1 | x2 | chr2 | y1 | y2 | color | score | uVarScore | lVarScore | upSign | loSign |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | 49375000 | 50805000 | 1 | 49375000 | 50805000 | 255,255,0 | 0.8635365988485998 | 0.3025175910115416 | 0.2942087461534729 | 0.4431332556332556 | 0.6401515151515151 |
1 | 16830000 | 17230000 | 1 | 16830000 | 17230000 | 255,255,0 | 0.20513320424507786 | 0.3676005813255187 | 0.22358121923097696 | 0.4121951219512195 | 0.4792682926829268 |
1 | 163355000 | 164860000 | 1 | 163355000 | 164860000 | 255,255,0 | 0.8052794339967723 | 0.27908025641958345 | 0.3017420763866275 | 0.657032586290075 | 0.5470812683654226 |
1 | 231935000 | 233400000 | 1 | 231935000 | 233400000 | 255,255,0 | 0.43336944864061144 | 0.28052310457518204 | 0.27077299003747285 | 0.451432273589708 | 0.44314868804664725 |
1 | 149035000 | 149430000 | 1 | 149035000 | 149430000 | 255,255,0 | 0.8579084086781587 | 0.3477831683616397 | 0.18791010793617813 | 0.45375 | 0.805625 |
Explanations of each field are as follows:
For the purposes of this workshop, we are only interested in the first 3 columns, the chromosome, and the start and end coordinates of each called TAD. Since this is performed per chromosome, each chromosome specific file needs to be concatenated to obtain the full list of TADs on the entire genome. For convenience, the TADs called at 5kb resolution for GM12878 using Arrowhead are provided internally in the preciseTAD package.
The next piece of data that we need are the functional genomic elements used to construct the feature space. These are processed narrowPeak ChIP-seq data in the form of .BED files with chromosome, and start and end coordinates of called peaks of signal enrichment. For this workshop we consider 43 transcription factor binding sites.
Cell type-specific genomic annotation data can be downloaded from the ENCODE data portal as BED files. For convenience, we have included the list of elements in the \inst\extdata\genomicElements
folder of the github repository here.
The training/testing data used for modeling is represented as a matrix with rows being genomic regions, columns being genomic annotations, and cells containing measures of association between them. Users have the option to concatenate genomic regions from multiple chromosomes.
To create the row-wise dimension of the data, preciseTAD
uses shifted binning, a strategy for making the dimensions of the data matrix used for modeling by segmenting the linear genome into nonoverlapping regions. This step is transparent for the user. To create shifted bins, chromosome-specific bins start at half of the resolution r, and continue in congruent intervals of length r until the end of the chromosome (mod r + r/2), using hg19 genomic coordinates. The shifted genomic bins, are then defined as boundary regions (Y = 1) if they contain a called boundary, and non-boundary regions (Y = 0) otherwise, thus establishing the binary response vector (Y) used for classification (Figure 1). Intuitively, shifted bins are centered on borders between the original bins, thus capturing potential boundaries.
The column-wise dimension is formed by genomic annotations, such as transcription factor binding sites (TFBSs). In this workshop we focus on the (\(log_{2}\)) distances, which enumerate the genomic distance from the center of each genomic bin to the center of the nearest ChIP-seq peak region of interest, although other options are available (see ?preciseTAD::createTADdata). These quantaties form the feature space \(\textbf{X} = \{X_{1}, X_{2} \cdots X_{p} \}\) (Figure 2). The customized training/testing data formation offered by preciseTAD
allows users to implement any binary classification machine learning algorithm.
preciseTAD
provides functionality to implement a random forest (RF) model, allowing tuning hyperparameters and applying feature reduction. The primary inputs are the training and testing data, a list of hyperparameter values, the number of cross-validation folds to use (if a grid of hyperparameter values is provided), and the metric used for optimization. The output includes the model object (necessary for downstream prediction of boundaries), the variable importance values, and a list of performance metrics when validating the model on the testing data. This model is then used to predict base-level precise boundary locations.
To predict the base-level location of domain boundaries, the preciseTAD
model is applied for each base annotated with the aforementioned genomic annotations. The base-level probabilities are clustered using density-based clustering and scalable data partitioning techniques, to narrow boundary regions and points. First, the probability vector, \(p_{n_{i}}\), is extracted (\(n_{i}\) representing the length of chromosome \(i\)). Next, DBSCAN (Density-based Spatial Clustering of Applications with Noise) (Ester et al. 1996; Hahsler, Piekenbrock, and Doran 2019) is applied to the matrix of pairwise genomic distances between bases with \(p_{n_{i}} \ge t\), where \(t\) is a threshold determined by the user. The resulting clusters of highly predictive bases identified by DBSCAN are termed preciseTAD boundary regions (PTBR). To precisely identify a single base among each PTBR, preciseTAD implements partitioning around medoids (PAM) within each cluster. The corresponding cluster medoid is defined as a preciseTAD boundary point (PTBP), making it the most representative and biologically meaningful base within each clustered PTBR. The output includes a list of genomic coordinates of PTBPs and PTBSs.
preciseTAD
allows us to use the pre-trained model to predict the precise location of domain boundaries in cell types without Hi-C data but with genome annotation data. Specifically, only cell-line-specific ChIP-seq data (BED format) for CTCF, RAD21, SMC3, and ZNF143 transcription factor binding sites is required. We provide models pre-trained using GM12878 and K562 genome annotation data, and Arrowhead- and Peakachu-predicted boundaries for chromosome-specific prediction of precise domain boundaries in other cell lines.
As an example of the functionality of preciseTAD, lets consider the goal of predicting TAD boundary coordinates for CHR14 on GM12878 at 5kb. First, we will build an optimized model to predict TAD boundary regions, then we will leverage said model to predict boundaries on CHR14 at base-level resolution. We will consider the heuristic of training a model on chromosomes 1-13 & 15-22, while reserving data from CHR14 as the test data.
Scripts used to generate the figures in the following sections can be found here.
Users will need to transform the TAD coordinate data into a GRanges object of unique boundary coordinates using the preciseTAD::extractBoundaries()
function. Here we extract boundaries for CHR1-CHR22 because all chromosomes will be in use (either in the training set or the testing set). We specify preprocess=FALSE
because we are only interested in boundaries and not filtering TADs by length. Lastly, we specify resolution=5000
to match the resolution used by Arrowhead (although this argument is ignored given that preprocess=FALSE
). As shown below, there were a total of 16153 unique TAD boundaries reported by Arrowhead for chromosomes 1-22.
bounds <- extractBoundaries(domains.mat = arrowhead_gm12878_5kb,
preprocess = FALSE,
CHR = paste0("CHR",1:22),
resolution = 5000)
# View unique boundaries
bounds
#> GRanges object with 16153 ranges and 0 metadata columns:
#> seqnames ranges strand
#> <Rle> <IRanges> <Rle>
#> [1] chr1 815000 *
#> [2] chr1 890000 *
#> [3] chr1 915000 *
#> [4] chr1 1005000 *
#> [5] chr1 1015000 *
#> ... ... ... ...
#> [16149] chr22 50815000 *
#> [16150] chr22 50935000 *
#> [16151] chr22 50945000 *
#> [16152] chr22 51060000 *
#> [16153] chr22 51235000 *
#> -------
#> seqinfo: 22 sequences from an unspecified genome; no seqlengths
Additionally, users will need to add each of the respective .BED files,representing the functional genomic elements of choice, to a single GenomicRangesList
object. This can be done via the preciseTAD::bedToGRangesList()
function. The signal
argument refers to the column in the BED files containing peak signal strength values and is used to assign metadata to the corresponding GRanges (this needs to be the same column for every .BED file; only necessary for downstream plotting).
path <- "../inst/extdata/genomicElements"
path
#> [1] "../inst/extdata/genomicElements"
tfbsList <- bedToGRangesList(filepath=path, bedList=NULL, bedNames=NULL, pattern = "*.bed", signal=4)
names(tfbsList)
#> [1] "Gm12878-Bcl3-Haib" "Gm12878-Bclaf101388-Haib"
#> [3] "Gm12878-Bhlh-Sydh" "Gm12878-Cebpbsc150-Haib"
#> [5] "Gm12878-Chd1a301218a-Sydh" "Gm12878-Chd2ab68301-Sydh"
#> [7] "Gm12878-Cmyc-Uta" "Gm12878-Ctcf-Broad"
#> [9] "Gm12878-E2f4-Sydh" "Gm12878-Egr1-Haib"
#> [11] "Gm12878-Elf1sc631-Haib" "Gm12878-Elk112771-Sydh"
#> [13] "Gm12878-Ets1-Haib" "Gm12878-Ezh239875-Broad"
#> [15] "Gm12878-Gabp-Haib" "Gm12878-Jund-Sydh"
#> [17] "Gm12878-Max-Sydh" "Gm12878-Mazab85725-Sydh"
#> [19] "Gm12878-Mef2a-Haib" "Gm12878-Mxi1-Sydh"
#> [21] "Gm12878-Nfyb-Sydh" "Gm12878-Nrf1-Sydh"
#> [23] "Gm12878-Nrsf-Haib" "Gm12878-P300-Sydh"
#> [25] "Gm12878-Pmlsc71910-Haib" "Gm12878-Pol2-Haib"
#> [27] "Gm12878-Pol24h8-Haib" "Gm12878-Pol2s2-Sydh"
#> [29] "Gm12878-Pu1-Haib" "Gm12878-Rad21-Haib"
#> [31] "Gm12878-Rfx5-Sydh" "Gm12878-Sin3a-Sydh"
#> [33] "Gm12878-Six5-Haib" "Gm12878-Smc3-Sydh"
#> [35] "Gm12878-Sp1-Haib" "Gm12878-Srf-Haib"
#> [37] "Gm12878-Stat5asc74442-Haib" "Gm12878-Taf1-Haib"
#> [39] "Gm12878-Tblr1ab24550-Sydh" "Gm12878-Tbp-Sydh"
#> [41] "Gm12878-Usf1-Haib" "Gm12878-Usf2-Sydh"
#> [43] "Gm12878-Znf143-Sydh"
Using the “ground-truth” boundaries and the annotations, we can build the data matrix that will be used for predictive modeling. The preciseTAD::createTADdata()
function can create the training and testing data simultaneously as a list object. Here, we specify to train on chromosomes 1-22 and test on chromosome 14. Additionally, we specify resolution = 5000
to construct 5kb shifted genomic bins (to match the Hi-C data resolution), featureType = "distance"
for a \(log_2(distance + 1)\)-type feature space, and resampling = "rus"
to apply random under-sampling (RUS) on the training data to balance classes of boundary vs. non-boundary regions. We also specify a seed to ensure reproducibility when performing the resampling. The result is a list containing two data frames: (1) the resampled (if specified) training data, and (2) the testing data.
set.seed(123)
tadData <- createTADdata(bounds.GR = bounds,
resolution = 5000,
genomicElements.GR = tfbsList,
featureType = "distance",
resampling = "rus",
trainCHR = paste0("CHR",c(1:8,10:13,15:22)),
predictCHR = "CHR14")
# View subset of training data
tadData[[1]][1:5,1:4]
#> y Gm12878-Bcl3-Haib Gm12878-Bclaf101388-Haib Gm12878-Bhlh-Sydh
#> 5799 No 14.25414 14.24822 14.24652
#> 32930 No 11.80090 14.88312 14.09119
#> 5767 Yes 11.10656 10.58871 10.99081
#> 8504 Yes 16.48621 18.09989 17.95111
#> 39486 No 13.57884 18.67059 13.57837
# Check it is balanced
table(tadData[[1]]$y)
#>
#> No Yes
#> 14913 14913
# View subset of testing data
tadData[[2]][1:5,1:4]
#> y Gm12878-Bcl3-Haib Gm12878-Bclaf101388-Haib Gm12878-Bhlh-Sydh
#> 1 No 24.30718 24.30793 24.30793
#> 2 No 24.30683 24.30759 24.30758
#> 3 No 24.30649 24.30724 24.30723
#> 4 No 24.30614 24.30689 24.30688
#> 5 No 24.30579 24.30654 24.30654
We can now implement our machine learning algorithm of choice to predict boundary regions. Here, we opt for the random forest algorithm for binary classification. preciseTAD
offers functionality for performing recursive feature elimination (RFE) as a form of feature reduction through the use of the preciseTAD::TADrfe()
function, which is a wrapper for the rfe
function in the caret
package (Kuhn 2012). preciseTAD::TADrfe()
implements a random forest model on the best subset of features from 2 to the maximum number of features in the data by powers of 2, using 5-fold cross-validation. We specify accuracy as the performance metric.
set.seed(1)
rfe_res <- TADrfe(trainData = tadData[[1]],
tuneParams = list(ntree = 500, nodesize = 1),
cvFolds = 5,
cvMetric = "Accuracy",
verbose = TRUE)
Results from RFE indicate that performance levels off when only considering the top 8 most predictive transcription factors (Figure 3A). We can further optimize our model by evaluating the predictive importance of these top 8 genomic elements across each cross-fold via a heatmap. After aggregating the variable importance values across each cross-fold, we see that SMC3, RAD21, CTCF, and ZNF143 stand out as the most predictive elements TAD-boundary regions (Figure 3B). These are known components of the loop-extrusion model that has been proposed as a mechanism for the 3D architecture of the human genome (Sanborn et al. 2015; Fudenberg et al. 2016; Hansen et al. 2018).
Now that we have suitably reduced our feature space, we can implement a random forest algorithm built simply on the TFBS mentioned above (SMC3, RAD21, CTCF, and ZNF143). We can take advantage of the preciseTAD::TADrandomForest()
function, which is a wrapper of the randomForest
package (Breiman 2001; Liaw, Wiener, and others 2002). We specify the training and testing data, the hyperparameter values, the number of cross-validation folds, the performance metric to consider (here, accuracy), the seed to initialize for reproducibility, an indicator for retaining the model object, an indicator for retaining variable importances, the variable importance measure to consider (here, mean decrease in accuracy (MDA)), and an indicator for retaining model performances based on the test data. The function returns a list containing: 1) a trained model from caret with model information, 2) a data.frame
of variable importance for each feature included in the model, and 3) a data.frame
of various model performance metrics. Model performances are shown in Table 1.
# Restrict the data matrix to include only SMC3, RAD21, CTCF, and ZNF143
tadDataFiltTrain <- tadData[[1]][,c(1, grep(paste0(c("Ctcf","Rad21","Smc3","Znf143"), collapse = "|"), names(tadData[[1]])))]
tadDataFiltTest <- tadData[[2]][,c(1, grep(paste0(c("Ctcf","Rad21","Smc3","Znf143"), collapse = "|"), names(tadData[[1]])))]
# Run RF
set.seed(123)
tadModel <- TADrandomForest(trainData = tadDataFiltTrain,
testData = tadDataFiltTest,
tuneParams = list(mtry = c(2,3),
ntree = 500,
nodesize = 1),
cvFolds = 3,
cvMetric = "Accuracy",
verbose = TRUE,
model = TRUE,
importances = TRUE,
impMeasure = "MDA",
performances = TRUE)
#> + Fold1: mtry=2, ntree=500, nodesize=1
#> - Fold1: mtry=2, ntree=500, nodesize=1
#> + Fold1: mtry=3, ntree=500, nodesize=1
#> - Fold1: mtry=3, ntree=500, nodesize=1
#> + Fold2: mtry=2, ntree=500, nodesize=1
#> - Fold2: mtry=2, ntree=500, nodesize=1
#> + Fold2: mtry=3, ntree=500, nodesize=1
#> - Fold2: mtry=3, ntree=500, nodesize=1
#> + Fold3: mtry=2, ntree=500, nodesize=1
#> - Fold3: mtry=2, ntree=500, nodesize=1
#> + Fold3: mtry=3, ntree=500, nodesize=1
#> - Fold3: mtry=3, ntree=500, nodesize=1
#> Aggregating results
#> Selecting tuning parameters
#> Fitting mtry = 2, ntree = 500, nodesize = 1 on full training set
# View model performances
performances <- tadModel[[3]]
performances$Performance <- round(performances$Performance, digits = 2)
rownames(performances) <- NULL #performances$Metric
kable(performances, caption = "Table 1. List of model performances.")
Metric | Performance |
---|---|
TN | 14835.00 |
FN | 132.00 |
FP | 5479.00 |
TP | 423.00 |
Total | 20869.00 |
Sensitivity | 0.76 |
Specificity | 0.73 |
Kappa | 0.09 |
Accuracy | 0.73 |
BalancedAccuracy | 0.75 |
Precision | 0.07 |
FPR | 0.27 |
FNR | 0.01 |
NPV | 0.99 |
MCC | 0.18 |
F1 | 0.13 |
AUC | 0.82 |
Youden | 0.49 |
AUPRC | 0.09 |
preciseTADhub
(must be using R verion >= 4.1 or R-devel)For convenience, we have provided an ExperimentData package to supplement preciseTAD
, called preciseTADhub
. preciseTADhub
offers users access to pre-trained random forest classification models built on CTCF, RAD21, SMC3, and ZNF143. Each of the 84 (2 cell lines \(\times\) 2 ground truth boundaries \(\times\) 21 autosomal chromosomes) files (.RDS) are stored as lists containing two objects: 1) a train object from with RF model information, and 2) a data.frame of variable importance for each genomic annotation included in the model. The file names are structured as follows:
\[i\_j\_k\_l.rds\]
where \(i\) denotes the chromosome that was used as a holdout {CHR1, CHR2, …, CHR21, CHR22} (i.e. for testing; meaning all other chromosomes were used for training), \(j\) denotes the cell line {GM12878, K562}, \(k\) denotes the resolution (size of genomic bins) {5kb, 10kb}, and \(l\) denotes the TAD/loop caller used to define ground truth {Arrowhead, Peakachu}.
Table 2 shows the file names and the corresponding ExperimentHub (EH) IDs. Since we want to make TAD boundary predictions on CHR14 for GM12878, we opt to read in the “CHR14_GM12878_5kb_Arrowhead.rds” file. This corresponds to the EH3863 EHID.
FileName | EHID |
---|---|
CHR1_GM12878_5kb_Arrowhead.rds | EH3815 |
CHR1_GM12878_10kb_Peakachu.rds | EH3816 |
CHR1_K562_5kb_Arrowhead.rds | EH3817 |
CHR1_K562_10kb_Peakachu.rds | EH3818 |
CHR2_GM12878_5kb_Arrowhead.rds | EH3819 |
CHR2_GM12878_10kb_Peakachu.rds | EH3820 |
CHR2_K562_5kb_Arrowhead.rds | EH3821 |
CHR2_K562_10kb_Peakachu.rds | EH3822 |
CHR3_GM12878_5kb_Arrowhead.rds | EH3823 |
CHR3_GM12878_10kb_Peakachu.rds | EH3824 |
CHR3_K562_5kb_Arrowhead.rds | EH3825 |
CHR3_K562_10kb_Peakachu.rds | EH3826 |
CHR4_GM12878_5kb_Arrowhead.rds | EH3827 |
CHR4_GM12878_10kb_Peakachu.rds | EH3828 |
CHR4_K562_5kb_Arrowhead.rds | EH3829 |
CHR4_K562_10kb_Peakachu.rds | EH3830 |
CHR5_GM12878_5kb_Arrowhead.rds | EH3831 |
CHR5_GM12878_10kb_Peakachu.rds | EH3832 |
CHR5_K562_5kb_Arrowhead.rds | EH3833 |
CHR5_K562_10kb_Peakachu.rds | EH3834 |
CHR6_GM12878_5kb_Arrowhead.rds | EH3835 |
CHR6_GM12878_10kb_Peakachu.rds | EH3836 |
CHR6_K562_5kb_Arrowhead.rds | EH3837 |
CHR6_K562_10kb_Peakachu.rds | EH3838 |
CHR7_GM12878_5kb_Arrowhead.rds | EH3839 |
CHR7_GM12878_10kb_Peakachu.rds | EH3840 |
CHR7_K562_5kb_Arrowhead.rds | EH3841 |
CHR7_K562_10kb_Peakachu.rds | EH3842 |
CHR8_GM12878_5kb_Arrowhead.rds | EH3843 |
CHR8_GM12878_10kb_Peakachu.rds | EH3844 |
CHR8_K562_5kb_Arrowhead.rds | EH3845 |
CHR8_K562_10kb_Peakachu.rds | EH3846 |
CHR10_GM12878_5kb_Arrowhead.rds | EH3847 |
CHR10_GM12878_10kb_Peakachu.rds | EH3848 |
CHR10_K562_5kb_Arrowhead.rds | EH3849 |
CHR10_K562_10kb_Peakachu.rds | EH3850 |
CHR11_GM12878_5kb_Arrowhead.rds | EH3851 |
CHR11_GM12878_10kb_Peakachu.rds | EH3852 |
CHR11_K562_5kb_Arrowhead.rds | EH3853 |
CHR11_K562_10kb_Peakachu.rds | EH3854 |
CHR12_GM12878_5kb_Arrowhead.rds | EH3855 |
CHR12_GM12878_10kb_Peakachu.rds | EH3856 |
CHR12_K562_5kb_Arrowhead.rds | EH3857 |
CHR12_K562_10kb_Peakachu.rds | EH3858 |
CHR13_GM12878_5kb_Arrowhead.rds | EH3859 |
CHR13_GM12878_10kb_Peakachu.rds | EH3860 |
CHR13_K562_5kb_Arrowhead.rds | EH3861 |
CHR13_K562_10kb_Peakachu.rds | EH3862 |
CHR14_GM12878_5kb_Arrowhead.rds | EH3863 |
CHR14_GM12878_10kb_Peakachu.rds | EH3864 |
CHR14_K562_5kb_Arrowhead.rds | EH3865 |
CHR14_K562_10kb_Peakachu.rds | EH3866 |
CHR15_GM12878_5kb_Arrowhead.rds | EH3867 |
CHR15_GM12878_10kb_Peakachu.rds | EH3868 |
CHR15_K562_5kb_Arrowhead.rds | EH3869 |
CHR15_K562_10kb_Peakachu.rds | EH3870 |
CHR16_GM12878_5kb_Arrowhead.rds | EH3871 |
CHR16_GM12878_10kb_Peakachu.rds | EH3872 |
CHR16_K562_5kb_Arrowhead.rds | EH3873 |
CHR16_K562_10kb_Peakachu.rds | EH3874 |
CHR17_GM12878_5kb_Arrowhead.rds | EH3875 |
CHR17_GM12878_10kb_Peakachu.rds | EH3876 |
CHR17_K562_5kb_Arrowhead.rds | EH3877 |
CHR17_K562_10kb_Peakachu.rds | EH3878 |
CHR18_GM12878_5kb_Arrowhead.rds | EH3879 |
CHR18_GM12878_10kb_Peakachu.rds | EH3880 |
CHR18_K562_5kb_Arrowhead.rds | EH3881 |
CHR18_K562_10kb_Peakachu.rds | EH3882 |
CHR19_GM12878_5kb_Arrowhead.rds | EH3883 |
CHR19_GM12878_10kb_Peakachu.rds | EH3884 |
CHR19_K562_5kb_Arrowhead.rds | EH3885 |
CHR19_K562_10kb_Peakachu.rds | EH3886 |
CHR20_GM12878_5kb_Arrowhead.rds | EH3887 |
CHR20_GM12878_10kb_Peakachu.rds | EH3888 |
CHR20_K562_5kb_Arrowhead.rds | EH3889 |
CHR20_K562_10kb_Peakachu.rds | EH3890 |
CHR21_GM12878_5kb_Arrowhead.rds | EH3891 |
CHR21_GM12878_10kb_Peakachu.rds | EH3892 |
CHR21_K562_5kb_Arrowhead.rds | EH3893 |
CHR21_K562_10kb_Peakachu.rds | EH3894 |
CHR22_GM12878_5kb_Arrowhead.rds | EH3895 |
CHR22_GM12878_10kb_Peakachu.rds | EH3896 |
CHR22_K562_5kb_Arrowhead.rds | EH3897 |
CHR22_K562_10kb_Peakachu.rds | EH3898 |
Table 2: File names and corresponding ExperimentHub (EH) IDs for all .RDS files stored in preciseTADhub
.
#Initialize ExperimentHub
hub <- ExperimentHub()
query(hub, "preciseTADhub")
myfiles <- query(hub, "preciseTADhub")
id <- "EH3863"
tadModel <- myfiles[[id]]
Recall that our model classifies boundary regions, in that each prediction refers to a genomic bin of width 5000 bases. To predict boundary coordinates at the base resolution more precisely, we can leverage our model through the use of the preciseTAD::preciseTAD()
function. Conceptually, instead of genomic bins, we annotate each base with the selected genomic annotations and featureType. We then apply our model on this annotation matrix to predict the probability of each base as a boundary given the associated genomic annotations. See the preciseTAD paper for a concise description of the algorithm and the pseudocode.
Suppose we want to precisely predict the domain boundary coordinates for the 1.3Mb section of CHR14:50,070,000-50,800,000. To do so, we specify chromCoords = list(50070000,50800000)
. Additionally, we set a threshold of 1.0 for constructing PTBRs using threshold = 1.0
. For DBSCAN, we assign 10000 and 3 for the \(\epsilon\) and MinPts parameters, respectively. The results is a list containing 3 elements including: 1) the genomic coordinates spanning each preciseTAD predicted region (PTBR), 2) the genomic coordinates of preciseTAD predicted boundaries points (PTBP), and 3) a named list including summary statistics of the following: PTBRWidth - PTBR width, PTBRCoverage - the proportion of bases within a PTBR with probabilities that equal to or exceed the threshold (t=1 by default), DistanceBetweenPTBR - the genomic distance between the end of the previous PTBR and the start of the subsequent PTBR, NumSubRegions - the number of the subregions in each PTBR cluster, SubRegionWidth - the width of the subregion forming each PTBR, DistBetweenSubRegions - the genomic distance between the end of the previous PTBR-specific subregion and the start of the subsequent PTBR-specific subregion, and the normalized enrichment of the genomic annotations used in the model around flanked PTBPs (Figure 4).
# Restrict to include only SMC3, RAD21, CTCF, and ZNF143
tfbsList_filt <- tfbsList[names(tfbsList) %in% c("Gm12878-Ctcf-Broad",
"Gm12878-Rad21-Haib",
"Gm12878-Smc3-Sydh",
"Gm12878-Znf143-Sydh")]
# Run preciseTAD
pt <- preciseTAD(genomicElements.GR = tfbsList_filt,
featureType = "distance",
CHR = "CHR14",
chromCoords = list(50070000,51360000),
tadModel = tadModel[[1]],
threshold = 1.0,
verbose = TRUE,
parallel = NULL,
DBSCAN_params = list(10000, 3),
flank = 5000)
#> [1] "Establishing bp resolution test data using a distance type feature space"
#> [1] "Establishing probability vector"
#> [1] "preciseTAD identified a total of 2918 base pairs whose predictive probability was equal to or exceeded a threshold of 1"
#> [1] "Initializing DBSCAN"
#> [1] "preciseTAD identified 8 PTBRs"
#> [1] "Establishing PTBPs"
#> [1] "Cluster 1 out of 8"
#> [1] "Cluster 2 out of 8"
#> [1] "Cluster 3 out of 8"
#> [1] "Cluster 4 out of 8"
#> [1] "Cluster 5 out of 8"
#> [1] "Cluster 6 out of 8"
#> [1] "Cluster 7 out of 8"
#> [1] "Cluster 8 out of 8"
# View the results
pt
#> $PTBR
#> GRanges object with 8 ranges and 0 metadata columns:
#> seqnames ranges strand
#> <Rle> <IRanges> <Rle>
#> [1] chr14 50101047-50101136 *
#> [2] chr14 50312215-50334062 *
#> [3] chr14 50563705-50583007 *
#> [4] chr14 50784355-50796085 *
#> [5] chr14 50997740-51001756 *
#> [6] chr14 51059855-51093997 *
#> [7] chr14 51231563-51247722 *
#> [8] chr14 51319223-51334404 *
#> -------
#> seqinfo: 1 sequence from an unspecified genome; no seqlengths
#>
#> $PTBP
#> GRanges object with 8 ranges and 0 metadata columns:
#> seqnames ranges strand
#> <Rle> <IRanges> <Rle>
#> [1] chr14 50101086 *
#> [2] chr14 50328921 *
#> [3] chr14 50571870 *
#> [4] chr14 50787930 *
#> [5] chr14 50999455 *
#> [6] chr14 51066958 *
#> [7] chr14 51241489 *
#> [8] chr14 51327392 *
#> -------
#> seqinfo: 1 sequence from an unspecified genome; no seqlengths
#>
#> $Summaries
#> $Summaries$PTBRWidth
#> min max median iqr mean sd
#> 1 90 34143 15671 10136.75 15309.25 10597.17
#>
#> $Summaries$PTBRCoverage
#> min max median iqr mean sd
#> 1 0.01098322 0.8666667 0.0278572 0.0212158 0.1339193 0.2964553
#>
#> $Summaries$DistanceBetweenPTBR
#> min max median iqr mean sd
#> 1 58099 229643 201348 101833.5 158698.7 70251.13
#>
#> $Summaries$NumSubRegions
#> min max median iqr mean sd
#> 1 2 50 32 8.75 30.125 13.98405
#>
#> $Summaries$SubRegionWidth
#> min max median iqr mean sd
#> 1 1 193 4 7 12.11203 30.7296
#>
#> $Summaries$DistBetweenSubRegions
#> min max median iqr mean sd
#> 1 2 7529 40 389 514.1116 1149.644
#>
#> $Summaries$NormilizedEnrichment
#> Gm12878-Ctcf-Broad Gm12878-Rad21-Haib Gm12878-Smc3-Sydh Gm12878-Znf143-Sydh
#> 1.25 1.25 1.25 1.25
Juicebox is an interface provided by Aiden Lab that allows for superimposing boundary coordinates onto Hi-C contact maps. You can download a desktop version of the application here, or use the online version here. To visualize domains demarcated by the predicted boundaries, you must first select a Hi-C map. Using the desktop version, you can import the contact matrix for GM12878 derived by Rao et al. 2014 by choosing File -> Open... -> Rao and Huntley et al. -> GM12878 -> in situ Mbol -> primary
. To format preciseTAD
results to use in Juicebox, users can take advantage of preciseTAD::juicer_func()
, which transforms a GRanges object into a data frame as shown below.
# Transform
pt_juice <- juicer_func(pt$PTBP)
You will then need to save the PTBPs as a BED file using write.table
as shown below. Once saved, import them into Juicebox using Show Annotation Panel -> 2D Annotations -> Add Local -> pt_juice.bed
. An example of using Juicebox to plot 3D domain boundary coordinates is shown in Figure 5.
filepath = "~/path/to/store/ptbps"
write.table(pt_juice,
file.path(filepath, "pt_juice.bed"),
quote = FALSE,
col.names = TRUE,
row.names = FALSE,
sep = "\t")
The biological significance of TAD boundaries can be assessed by evaluating their association with the signal of known molecular drivers of 3D chromatin, namely CTCF, RAD21, SMC3, and ZNF143. Two popular visualization techniques include enriched heatmaps and profile plots, both of which can be accomplished with deepTools (Ramirez et al. 2016). Briefly, for an enriched heatmap, a matrix \(M_{r \times c}\), is created where the rows, \(r\), are given by the number of boundaries, either called or predicted. The column dimension, \(c\), is created by flanking each boundary by a given amount (5 kb). The flanking is then broken up into 50 bp segments, for 100 windows on both sides of a given boundary, a total of 200 columns. The cells of the matrix are calculated as a mean coverage value for each window with respect to the signal from the respective ChIP-seq annotation via bigwig files. For a profile plot, the matrix is then summarized by row-wise averages and plotted as a density curve, where the center represents the boundary, and the curve represents the average ChIP-seq peak signal around a flanked region. You can read more about each technique here.
Furthermore, we can assess the conservation of TAD boundaries by visualizing the amount of overlap between flanked boundaries across cell lines. Overlaps can be quantified using the Jaccard index defined as
\[J_{(A,B)} = \dfrac{A \cap B}{A \cup B}\]
where A and B represent genomic regions created by flanked called and predicted boundaries. That is, between cell lines, the number of overlapping flanked boundaries divided by the total number of flanked boundaries. Wilcoxon Rank-Sum tests can then be used to compare chromosome-specific Jaccard indices across cell lines.
Below is an example script of how to create an enriched heatmap and profile plot for a set of boundaries using deepTools (version 2.0). Users will first need to create the matrix of signal values, described above, using the computeMatrix tool. To do so, you need to save the resulting boundaries as a .BED file (without a header), where the first column is the chromosome, second column is the boundary coordinate, and third column is (boundary coordinate + 1). The TSS option is used such that the 2nd column of the .BED file is chosen as the center (i.e. the boundary coordinate). Users will next need to download bigwig files for each cell-line specific annotation of interest, which can be done through ENCODE here. The annotation (.bigwig) along with the boundaries (.BED) are provided next in the python code. We next specify by how much to flank around the boundary. Here, we opt for 5 kb (5000 bases) on either side. Additionally, we specify to divide the flanked region into 50 base segments (i.e. (5000+5000)/50=1000/50=200 segments). Lastly, we provide the name of the resulting matrix (computeMatrix_output.gz).
# Create the matrix of signal using computeMatrix
computeMatrix --referencePoint TSS \ #uses the start coordinate (i.e. the boundary)
-S annotation.bigWig \ #cell-line specific annotation of interest
-R boundaries.bed \ #TAD boundaries (either predicted or called)
-a 5000 -b 5000 \ #size of flank on left side (-a) and right side (-b) in bases
--binSize 50 \ #number of bases to divide the flanked region by
-o computeMatrix_output.gz #name of the output
Users can then use the computeMatrix_output.gz to create both the enriched heatmap and profile plot as shown below.
# Creating Enriched Heatmap
plotHeatmap -m computeMatrix_output.gz \ #name of the output of computeMatrix
-out EnrichedHeatmap.png \ #name of heatmap image
--whatToShow "heatmap and colorbar" \ #what to include in the image
--regionsLabel " " \ #row labels
--xAxisLabel " " \ #x-axis label
--refPointLabel "Boundary" \ #name of the center point
--samplesLabel "TAD-caller" \ #column label
--heatmapHeight 20 \ #height dimension in cm
--heatmapWidth 5 \ #width dimension in cm
--dpi 300 \ #dpi of saved image
# Creating profile plot
plotProfile -m computeMatrix_output.gz \ #name of the output of computeMatrix
-out ProfilePlot.png \ #name of profile plot image
--colors blue \ #color of signal curve
--regionsLabel " " \ #row labels
--samplesLabel "TAD-caller" \ #column label
--refPointLabel "Boundary" \ #name of the center point
--plotHeight 7 \ #height dimension in cm
--plotWidth 7 \ #width dimension in cm
--dpi 300 \ #dpi of saved image
We extend Section 6.4 by using preciseTAD to predict TAD boundaries on all of chromosome 14 for both GM12878 and K562 cell lines. This can be done by simply replacing the chromCoords
parameter with NULL
. We have provided the boundaries predicted by preciseTAD here and the Arrowhead called boundaries here. When trained using Arrowhead ground truth boundaries at 5 kb, preciseTAD predicted a total of 440 and 243 TAD boundaries on CHR14 for GM12878 and K562 respectively (Table 3). For comparison, Arrowhead identified a total of 555 and 240 TAD boundaries on CHR14 for GM12878 and K562 respectively (Table 3). We see that even though preciseTAD predicted less boundaries than Arrowhead, CTCF signal is much more colocalized around preciseTAD-predicted TAD boundaries compared to Arrowhead-called TAD boundaries (Figure 6A). This can be attributed to Arrowhead’s inflation of called TADs at 5 kb. Additionally, preciseTAD boundaries are statistically significantly closer to CTCF sites (Figure 6B). These results indicate that preciseTAD-predicted boundaries better reflected the known biology of boundary formation.
Arrowhead | preciseTAD | |
---|---|---|
GM12878 | 555 | 440 |
K562 | 240 | 243 |
Table 3. TAD boundaries either called (Arrowhead) or predicted (preciseTAD) on CHR14.
Previous studies suggest that TAD boundaries are conserved across cell lines (Dixon et al. 2012; Krefting, Andrade-Navarro, and Ibn-Salem 2018; Nora et al. 2012; Sexton et al. 2012). To assess the level of cross-cell-line conservation, we can evaluate the overlap between cell line-specific boundaries detected by preciseTAD and Arrowhead. To do so, preciseTAD-predicted and Arrowhead-called TAD boundaries on CHR14 were first flanked on both sides by 5 kb. Only 21% of boundaries were conserved between cell lines for Arrowhead (J=0.16) (Figure 7A), while 37% of preciseTAD-predicted boundaries were conserved between GM12878 and K562 cell lines (J=0.32) (Figure 7B). The higher level of conservation for preciseTAD-predicted boundaries further supports the notion of their higher biological relevance.
The preciseTAD algorithm can be extended to predict boundaries on cell lines that do not currently have high resolution HiC data using publicly available cell-line specific annotations. We highlight this technique by evaluating the following scenario (for CHR14): training on GM12878 and predicting boundaries on GM12878 (GM on GM) vs. training on K562 and predicting on GM12878 (K on GM). Using the preciseTAD framework, 74% (J=0.68) of predicted boundaries overlapped in the cross-cell-line prediction scenario (Figure 8A). Furthermore, boundaries predicted on unseen annotation data exhibit a similar level of enrichment for CTCF as did those trained and predicted on the same cell line (Figure 7B). These results indicate that preciseTAD pre-trained models can be successfully used to predict domain boundaries for cell lines lacking Hi-C data but for which genome annotation data is available.
In this workshop we demonstrate the usefulness of preciseTAD, an R package that provides users with a data-driven approach toward TAD boundary prediction. preciseTAD leverages a random forest (RF) classification model built on low-resolution domain boundaries obtained from domain calling tools, and high-resolution genomic annotations as the feature space. preciseTAD predicts the probability of each base being a boundary, and identifies the precise location of boundary regions and the most likely boundary points among them. We benchmarked preciseTAD against Arrowhead (Durand, Robinson, et al. 2016), an established TAD-caller. We showed that, unlike Arrowhead, preciseTAD does not inflate the number of predicted boundaries, providing only the most biologically meaningful boundaries that demarcate regions of high inter-chromosomal interactions. preciseTAD boundaries were more enriched for known architectural transcription factors, such as CTCF. Likewise, preciseTAD boundaries were more conserved between GM12878 and K562 cell lines, a known feature among the 3D architecture of the human genome. Lastly, we show that cell-line-specific functional genomic annotation data (ChIP-seq) can be used to precisely predict domain boundaries across cell lines. Pre-trained models are publicly available on Bioconductor in the ExperimentHub package preciseTADhub, here. This creates new opportunities for users to predict domain boundaries on cell lines without relying on high-resolution Hi-C data for which there is none available.
Breiman, Leo. 2001. “Random Forests.” Machine Learning 45 (1): 5–32.
Dixon, Jesse R, Inkyung Jung, Siddarth Selvaraj, Yin Shen, Jessica E Antosiewicz-Bourget, Ah Young Lee, Zhen Ye, et al. 2015. “Chromatin Architecture Reorganization During Stem Cell Differentiation.” Nature 518 (7539): 331.
Dixon, Jesse R, Siddarth Selvaraj, Feng Yue, Audrey Kim, Yan Li, Yin Shen, Ming Hu, Jun S Liu, and Bing Ren. 2012. “Topological Domains in Mammalian Genomes Identified by Analysis of Chromatin Interactions.” Nature 485 (7398): 376.
Durand, Neva C, James T Robinson, Muhammad S Shamim, Ido Machol, Jill P Mesirov, Eric S Lander, and Erez Lieberman Aiden. 2016. “Juicebox Provides a Visualization System for Hi-c Contact Maps with Unlimited Zoom.” Cell Systems 3 (1): 99–101.
Durand, Neva C, Muhammad S Shamim, Ido Machol, Suhas SP Rao, Miriam H Huntley, Eric S Lander, and Erez Lieberman Aiden. 2016. “Juicer Provides a One-Click System for Analyzing Loop-Resolution Hi-c Experiments.” Cell Systems 3 (1): 95–98.
Ester, Martin, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu, and others. 1996. “A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise.” In Kdd, 96:226–31. 34.
Fudenberg, Geoffrey, Maxim Imakaev, Carolyn Lu, Anton Goloborodko, Nezar Abdennur, and Leonid A Mirny. 2016. “Formation of Chromosomal Domains by Loop Extrusion.” Cell Reports 15 (9): 2038–49.
Hahsler, Michael, Matthew Piekenbrock, and Derek Doran. 2019. “Dbscan: Fast Density-Based Clustering with R.” Journal of Statistical Software 25: 409–16.
Hansen, Anders S, Claudia Cattoglio, Xavier Darzacq, and Robert Tjian. 2018. “Recent Evidence That Tads and Chromatin Loops Are Dynamic Structures.” Nucleus 9 (1): 20–32.
Hnisz, Denes, Abraham S Weintraub, Daniel S Day, Anne-Laure Valton, Rasmus O Bak, Charles H Li, Johanna Goldmann, et al. 2016. “Activation of Proto-Oncogenes by Disruption of Chromosome Neighborhoods.” Science 351 (6280): 1454–8.
Krefting, Jan, Miguel A Andrade-Navarro, and Jonas Ibn-Salem. 2018. “Evolutionary Stability of Topologically Associating Domains Is Associated with Conserved Gene Regulation.” BMC Biology 16 (1): 1–12.
Kuhn, Max. 2012. “The Caret Package.” R Foundation for Statistical Computing, Vienna, Austria. URL Https://Cran. R-Project. Org/Package= Caret.
Liaw, Andy, Matthew Wiener, and others. 2002. “Classification and Regression by randomForest.” R News 2 (3): 18–22.
Lieberman-Aiden, Erez, Nynke L Van Berkum, Louise Williams, Maxim Imakaev, Tobias Ragoczy, Agnes Telling, Ido Amit, et al. 2009. “Comprehensive Mapping of Long-Range Interactions Reveals Folding Principles of the Human Genome.” Science 326 (5950): 289–93.
Lupiáñez, Darı́o G, Malte Spielmann, and Stefan Mundlos. 2016. “Breaking Tads: How Alterations of Chromatin Domains Result in Disease.” Trends in Genetics 32 (4): 225–37.
Meaburn, Karen J, Erik Cabuy, Gisele Bonne, Nicolas Levy, Glenn E Morris, Giuseppe Novelli, Ian R Kill, and Joanna M Bridger. 2007. “Primary Laminopathy Fibroblasts Display Altered Genome Organization and Apoptosis.” Aging Cell 6 (2): 139–53. https://doi.org/10.1111/j.1474-9726.2007.00270.x.
Nora, Elphège P, Bryan R Lajoie, Edda G Schulz, Luca Giorgetti, Ikuhiro Okamoto, Nicolas Servant, Tristan Piolot, et al. 2012. “Spatial Partitioning of the Regulatory Landscape of the X-Inactivation Centre.” Nature 485 (7398): 381.
Ramirez, Fidel, Devon P Ryan, Bjorn Gruning, Vivek Bhardwaj, Fabian Kilpert, Andreas S Richter, Steffen Heyne, Friederike Dundar, and Thomas Manke. 2016. “DeepTools2: A Next Generation Web Server for Deep-Sequencing Data Analysis.” Nucleic Acids Research 44 (W1): W160–W165.
Rao, Suhas SP, Miriam H Huntley, Neva C Durand, Elena K Stamenova, Ivan D Bochkov, James T Robinson, Adrian L Sanborn, et al. 2014. “A 3D Map of the Human Genome at Kilobase Resolution Reveals Principles of Chromatin Looping.” Cell 159 (7): 1665–80.
Sanborn, Adrian L, Suhas SP Rao, Su-Chen Huang, Neva C Durand, Miriam H Huntley, Andrew I Jewett, Ivan D Bochkov, et al. 2015. “Chromatin Extrusion Explains Key Features of Loop and Domain Formation in Wild-Type and Engineered Genomes.” Proceedings of the National Academy of Sciences 112 (47): E6456–E6465.
Schmitt, Anthony D, Ming Hu, Inkyung Jung, Zheng Xu, Yunjiang Qiu, Catherine L Tan, Yun Li, et al. 2016. “A Compendium of Chromatin Contact Maps Reveals Spatially Active Regions in the Human Genome.” Cell Rep 17 (8): 2042–59. https://doi.org/10.1016/j.celrep.2016.10.061.
Sexton, Tom, Eitan Yaffe, Ephraim Kenigsberg, Frédéric Bantignies, Benjamin Leblanc, Michael Hoichman, Hugues Parrinello, Amos Tanay, and Giacomo Cavalli. 2012. “Three-Dimensional Folding and Functional Organization Principles of the Drosophila Genome.” Cell 148 (3): 458–72.
Sun, James H, Linda Zhou, Daniel J Emerson, Sai A Phyo, Katelyn R Titus, Wanfeng Gong, Thomas G Gilgenast, et al. 2018. “Disease-Associated Short Tandem Repeats Co-Localize with Chromatin Domain Boundaries.” Cell 175 (1): 224–238.e15. https://doi.org/10.1016/j.cell.2018.08.005.
Taberlay, Phillippa C, Joanna Achinger-Kawecka, Aaron TL Lun, Fabian A Buske, Kenneth Sabir, Cathryn M Gould, Elena Zotenko, et al. 2016. “Three-Dimensional Disorganization of the Cancer Genome Occurs Coincident with Long-Range Genetic and Epigenetic Alterations.” Genome Research 26 (6): 719–31.