Installation

library(knitr)

# if (!requireNamespace("BiocManager", quietly=TRUE))
#     install.packages("BiocManager")
# BiocManager::install("preciseTAD")
#library(preciseTAD)

#library(devtools)
#devtools::install_github("dozmorovlab/preciseTAD")
library(preciseTAD)

#install_github("stilianoudakis/preciseTADhub")
#library(preciseTADhub)
#library(ExperimentHub)

Introduction

The advent of chromosome conformation capture (3C) sequencing technologies, and its successor Hi-C, have revealed a hierarchy of the 3-dimensional (3D) structure of the human genome such as chromatin loops (Rao et al. 2014), Topologically Associating Domains (TADs) (Nora et al. 2012; Dixon et al. 2012), and A/B compartments (Lieberman-Aiden et al. 2009). At the megabase scale, TADs represent regions on the linear genome that are highly self-interacting. Emerging evidence has linked TADs to critical roles in cell dynamics and cell differentiation. Studies have shown that TADs themselves are highly conserved across species and cell lines (Dixon et al. 2012, 2015; Krefting, Andrade-Navarro, and Ibn-Salem 2018; Nora et al. 2012; Rao et al. 2014; Schmitt et al. 2016). Disruption of boundaries demarcating TADs and loops promotes cancer (Hnisz et al. 2016; Taberlay et al. 2016) and other disorders (Lupiáñez, Spielmann, and Mundlos 2016; Sun et al. 2018; Meaburn et al. 2007). Therefore, identifying the precise location of TAD boundaries remains a top priority in our goal to fully understand the functionality of the human genome.

In this workshop, we present preciseTAD, an optimally tuned machine learning framework for precise identification of domain boundaries using genome annotation data. Our method utilizes the random forest (RF) algorithm trained on high-resolution transcription factor binding site (TFBS) data to predict low-resolution boundaries. We introduce a systematic pipeline for building the optimal boundary region prediction classifier. Translated from Hi-C data resolution level to base level (annotating each base and predicting its boundary probability), preciseTAD employs density-based clustering (DBSCAN) and partitioning around medoids (PAM) to detect genome annotation-guided boundary regions and points at a base-level resolution. This approach circumvents resolution restrictions of Hi-C data, allowing for the precise detection of biologically meaningful boundaries.

In this workshop we will demonstrate the following:

preciseTAD predictions are more enriched for known molecular drivers of 3D chromatin, including CTCF, RAD21, SMC3, and ZNF143
preciseTAD-predicted boundaries are more conserved across cell lines
Using the methods developed, a model trained in one cell line can accurately predict boundaries in another cell line

Obtaining ground truth TAD boundaries

preciseTAD uses pre-defined TAD boundary coordinates as ground truth. We opt to use the Arrowhead algorithm to define our ground truth TAD boundaries, a gold-standard TAD-caller developed by Aiden Lab (Durand, Shamim, et al. 2016). The .hic files are publically available on GEO here, and are provided by the same team that developed Arrowhead (Rao et al. 2014). For this workshop we will be working with the “GSE63525_GM12878_insitu_primary.hic” file. This .hic file is a highly compressed binary file that has contact matrices from multiple biological replicate HiC experiments, at different resolutions, for GM12878. You can read the full protocol used to derive the various .hic files here.

Once you have the .hic file downloaded, you are ready to implement Arrowhead to call TAD boundaries. Below is an example of how you would perform this for every chromosome and for a variety of resolutions:

for i in 5000 10000 25000 50000 100000;
    do 
        echo $i.kb
            for j in {1..22};
                do
                echo chr$j
                java -jar juicebox_tools_7.5.jar arrowhead -c $j \ #chromosome to call TADs on
                                                           -r $i \ #HiC data resolution
                                                           ~/GSE63525_GM12878_insitu_primary.hic \ #location of the .HIC file
                                                           ~/arrowhead_output/GM12878/$i.b/chr$j #location to store the output
            done;
    done;

Note: This creates separate files for each chromosome (given a specific resolution). As an example, here is what the first few lines of CHR1 look like, performed at 5kb:

chr1	x1	x2	chr2	y1	y2	color	score	uVarScore	lVarScore	upSign	loSign
1	49375000	50805000	1	49375000	50805000	255,255,0	0.8635365988485998	0.3025175910115416	0.2942087461534729	0.4431332556332556	0.6401515151515151
1	16830000	17230000	1	16830000	17230000	255,255,0	0.20513320424507786	0.3676005813255187	0.22358121923097696	0.4121951219512195	0.4792682926829268
1	163355000	164860000	1	163355000	164860000	255,255,0	0.8052794339967723	0.27908025641958345	0.3017420763866275	0.657032586290075	0.5470812683654226
1	231935000	233400000	1	231935000	233400000	255,255,0	0.43336944864061144	0.28052310457518204	0.27077299003747285	0.451432273589708	0.44314868804664725
1	149035000	149430000	1	149035000	149430000	255,255,0	0.8579084086781587	0.3477831683616397	0.18791010793617813	0.45375	0.805625

Explanations of each field are as follows:

chromosome = the chromosome that the domain is located on
x1,x2/y1,y2 = the interval spanned by the domain (contact domains manifest as squares on the diagonal of a Hi-C matrix and as such: x1=y1, x2=y2)
color = the color that the feature will be rendered as if loaded in Juicebox
corner_score = the corner score, a score indicating the likelihood that a pixel is at the corner of a contact domain. Higher values indicate a greater likelihood of being at the corner of a domain
Uvar = the variance of the upper triangle
Lvar = the variance of the lower triangle
Usign = -1*(sum of the sign of the entries in the upper triangle)
Lsign = sum of the sign of the entries in the lower triangle

For the purposes of this workshop, we are only interested in the first 3 columns, the chromosome, and the start and end coordinates of each called TAD. Since this is performed per chromosome, each chromosome specific file needs to be concatenated to obtain the full list of TADs on the entire genome. For convenience, the TADs called at 5kb resolution for GM12878 using Arrowhead are provided internally in the preciseTAD package.

data("arrowhead_gm12878_5kb")
head(arrowhead_gm12878_5kb)
#>   V1        V2        V3
#> 1  1  49375000  50805000
#> 2  1  16830000  17230000
#> 3  1 163355000 164860000
#> 4  1 231935000 233400000
#> 5  1 149035000 149430000
#> 6  1   3995000   5505000

Obtaining functional genomic elements

The next piece of data that we need are the functional genomic elements used to construct the feature space. These are processed narrowPeak ChIP-seq data in the form of .BED files with chromosome, and start and end coordinates of called peaks of signal enrichment. For this workshop we consider 43 transcription factor binding sites.

Cell type-specific genomic annotation data can be downloaded from the ENCODE data portal as BED files. For convenience, we have included the list of elements in the \inst\extdata\genomicElements folder of the github repository here.

Describing preciseTAD functionality

The training/testing data used for modeling is represented as a matrix with rows being genomic regions, columns being genomic annotations, and cells containing measures of association between them. Users have the option to concatenate genomic regions from multiple chromosomes.

To create the row-wise dimension of the data, preciseTAD uses shifted binning, a strategy for making the dimensions of the data matrix used for modeling by segmenting the linear genome into nonoverlapping regions. This step is transparent for the user. To create shifted bins, chromosome-specific bins start at half of the resolution r, and continue in congruent intervals of length r until the end of the chromosome (mod r + r/2), using hg19 genomic coordinates. The shifted genomic bins, are then defined as boundary regions (Y = 1) if they contain a called boundary, and non-boundary regions (Y = 0) otherwise, thus establishing the binary response vector (Y) used for classification (Figure 1). Intuitively, shifted bins are centered on borders between the original bins, thus capturing potential boundaries.

Figure 1. Creating the response vector for binary classification.

The column-wise dimension is formed by genomic annotations, such as transcription factor binding sites (TFBSs). In this workshop we focus on the (\(log_{2}\)) distances, which enumerate the genomic distance from the center of each genomic bin to the center of the nearest ChIP-seq peak region of interest, although other options are available (see ?preciseTAD::createTADdata). These quantaties form the feature space \(\textbf{X} = \{X_{1}, X_{2} \cdots X_{p} \}\) (Figure 2). The customized training/testing data formation offered by preciseTAD allows users to implement any binary classification machine learning algorithm.

Figure 2. Creating the feature space for binary classification.

preciseTAD provides functionality to implement a random forest (RF) model, allowing tuning hyperparameters and applying feature reduction. The primary inputs are the training and testing data, a list of hyperparameter values, the number of cross-validation folds to use (if a grid of hyperparameter values is provided), and the metric used for optimization. The output includes the model object (necessary for downstream prediction of boundaries), the variable importance values, and a list of performance metrics when validating the model on the testing data. This model is then used to predict base-level precise boundary locations.

To predict the base-level location of domain boundaries, the preciseTAD model is applied for each base annotated with the aforementioned genomic annotations. The base-level probabilities are clustered using density-based clustering and scalable data partitioning techniques, to narrow boundary regions and points. First, the probability vector, \(p_{n_{i}}\), is extracted (\(n_{i}\) representing the length of chromosome \(i\)). Next, DBSCAN (Density-based Spatial Clustering of Applications with Noise) (Ester et al. 1996; Hahsler, Piekenbrock, and Doran 2019) is applied to the matrix of pairwise genomic distances between bases with \(p_{n_{i}} \ge t\), where \(t\) is a threshold determined by the user. The resulting clusters of highly predictive bases identified by DBSCAN are termed preciseTAD boundary regions (PTBR). To precisely identify a single base among each PTBR, preciseTAD implements partitioning around medoids (PAM) within each cluster. The corresponding cluster medoid is defined as a preciseTAD boundary point (PTBP), making it the most representative and biologically meaningful base within each clustered PTBR. The output includes a list of genomic coordinates of PTBPs and PTBSs.

preciseTAD allows us to use the pre-trained model to predict the precise location of domain boundaries in cell types without Hi-C data but with genome annotation data. Specifically, only cell-line-specific ChIP-seq data (BED format) for CTCF, RAD21, SMC3, and ZNF143 transcription factor binding sites is required. We provide models pre-trained using GM12878 and K562 genome annotation data, and Arrowhead- and Peakachu-predicted boundaries for chromosome-specific prediction of precise domain boundaries in other cell lines.

The preciseTAD R package

As an example of the functionality of preciseTAD, lets consider the goal of predicting TAD boundary coordinates for CHR14 on GM12878 at 5kb. First, we will build an optimized model to predict TAD boundary regions, then we will leverage said model to predict boundaries on CHR14 at base-level resolution. We will consider the heuristic of training a model on chromosomes 1-13 & 15-22, while reserving data from CHR14 as the test data.

Scripts used to generate the figures in the following sections can be found here.

Model building

Users will need to transform the TAD coordinate data into a GRanges object of unique boundary coordinates using the preciseTAD::extractBoundaries() function. Here we extract boundaries for CHR1-CHR22 because all chromosomes will be in use (either in the training set or the testing set). We specify preprocess=FALSE because we are only interested in boundaries and not filtering TADs by length. Lastly, we specify resolution=5000 to match the resolution used by Arrowhead (although this argument is ignored given that preprocess=FALSE). As shown below, there were a total of 16153 unique TAD boundaries reported by Arrowhead for chromosomes 1-22.

bounds <- extractBoundaries(domains.mat = arrowhead_gm12878_5kb, 
                            preprocess = FALSE, 
                            CHR = paste0("CHR",1:22), 
                            resolution = 5000)
# View unique boundaries
bounds
#> GRanges object with 16153 ranges and 0 metadata columns:
#>           seqnames    ranges strand
#>              <Rle> <IRanges>  <Rle>
#>       [1]     chr1    815000      *
#>       [2]     chr1    890000      *
#>       [3]     chr1    915000      *
#>       [4]     chr1   1005000      *
#>       [5]     chr1   1015000      *
#>       ...      ...       ...    ...
#>   [16149]    chr22  50815000      *
#>   [16150]    chr22  50935000      *
#>   [16151]    chr22  50945000      *
#>   [16152]    chr22  51060000      *
#>   [16153]    chr22  51235000      *
#>   -------
#>   seqinfo: 22 sequences from an unspecified genome; no seqlengths

Additionally, users will need to add each of the respective .BED files,representing the functional genomic elements of choice, to a single GenomicRangesList object. This can be done via the preciseTAD::bedToGRangesList() function. The signal argument refers to the column in the BED files containing peak signal strength values and is used to assign metadata to the corresponding GRanges (this needs to be the same column for every .BED file; only necessary for downstream plotting).

path <- "../inst/extdata/genomicElements"
path
#> [1] "../inst/extdata/genomicElements"
tfbsList <- bedToGRangesList(filepath=path, bedList=NULL, bedNames=NULL, pattern = "*.bed", signal=4)

names(tfbsList)
#>  [1] "Gm12878-Bcl3-Haib"          "Gm12878-Bclaf101388-Haib"  
#>  [3] "Gm12878-Bhlh-Sydh"          "Gm12878-Cebpbsc150-Haib"   
#>  [5] "Gm12878-Chd1a301218a-Sydh"  "Gm12878-Chd2ab68301-Sydh"  
#>  [7] "Gm12878-Cmyc-Uta"           "Gm12878-Ctcf-Broad"        
#>  [9] "Gm12878-E2f4-Sydh"          "Gm12878-Egr1-Haib"         
#> [11] "Gm12878-Elf1sc631-Haib"     "Gm12878-Elk112771-Sydh"    
#> [13] "Gm12878-Ets1-Haib"          "Gm12878-Ezh239875-Broad"   
#> [15] "Gm12878-Gabp-Haib"          "Gm12878-Jund-Sydh"         
#> [17] "Gm12878-Max-Sydh"           "Gm12878-Mazab85725-Sydh"   
#> [19] "Gm12878-Mef2a-Haib"         "Gm12878-Mxi1-Sydh"         
#> [21] "Gm12878-Nfyb-Sydh"          "Gm12878-Nrf1-Sydh"         
#> [23] "Gm12878-Nrsf-Haib"          "Gm12878-P300-Sydh"         
#> [25] "Gm12878-Pmlsc71910-Haib"    "Gm12878-Pol2-Haib"         
#> [27] "Gm12878-Pol24h8-Haib"       "Gm12878-Pol2s2-Sydh"       
#> [29] "Gm12878-Pu1-Haib"           "Gm12878-Rad21-Haib"        
#> [31] "Gm12878-Rfx5-Sydh"          "Gm12878-Sin3a-Sydh"        
#> [33] "Gm12878-Six5-Haib"          "Gm12878-Smc3-Sydh"         
#> [35] "Gm12878-Sp1-Haib"           "Gm12878-Srf-Haib"          
#> [37] "Gm12878-Stat5asc74442-Haib" "Gm12878-Taf1-Haib"         
#> [39] "Gm12878-Tblr1ab24550-Sydh"  "Gm12878-Tbp-Sydh"          
#> [41] "Gm12878-Usf1-Haib"          "Gm12878-Usf2-Sydh"         
#> [43] "Gm12878-Znf143-Sydh"

Using the “ground-truth” boundaries and the annotations, we can build the data matrix that will be used for predictive modeling. The preciseTAD::createTADdata() function can create the training and testing data simultaneously as a list object. Here, we specify to train on chromosomes 1-22 and test on chromosome 14. Additionally, we specify resolution = 5000 to construct 5kb shifted genomic bins (to match the Hi-C data resolution), featureType = "distance" for a \(log_2(distance + 1)\)-type feature space, and resampling = "rus" to apply random under-sampling (RUS) on the training data to balance classes of boundary vs. non-boundary regions. We also specify a seed to ensure reproducibility when performing the resampling. The result is a list containing two data frames: (1) the resampled (if specified) training data, and (2) the testing data.

set.seed(123)
tadData <- createTADdata(bounds.GR = bounds,
                         resolution = 5000,
                         genomicElements.GR = tfbsList,
                         featureType = "distance",
                         resampling = "rus",
                         trainCHR = paste0("CHR",c(1:8,10:13,15:22)),
                         predictCHR = "CHR14")

# View subset of training data
tadData[[1]][1:5,1:4]
#>         y Gm12878-Bcl3-Haib Gm12878-Bclaf101388-Haib Gm12878-Bhlh-Sydh
#> 5799   No          14.25414                 14.24822          14.24652
#> 32930  No          11.80090                 14.88312          14.09119
#> 5767  Yes          11.10656                 10.58871          10.99081
#> 8504  Yes          16.48621                 18.09989          17.95111
#> 39486  No          13.57884                 18.67059          13.57837
# Check it is balanced
table(tadData[[1]]$y)
#> 
#>    No   Yes 
#> 14913 14913

# View subset of testing data
tadData[[2]][1:5,1:4]
#>    y Gm12878-Bcl3-Haib Gm12878-Bclaf101388-Haib Gm12878-Bhlh-Sydh
#> 1 No          24.30718                 24.30793          24.30793
#> 2 No          24.30683                 24.30759          24.30758
#> 3 No          24.30649                 24.30724          24.30723
#> 4 No          24.30614                 24.30689          24.30688
#> 5 No          24.30579                 24.30654          24.30654

Optimization and model evaluation

We can now implement our machine learning algorithm of choice to predict boundary regions. Here, we opt for the random forest algorithm for binary classification. preciseTAD offers functionality for performing recursive feature elimination (RFE) as a form of feature reduction through the use of the preciseTAD::TADrfe() function, which is a wrapper for the rfe function in the caret package (Kuhn 2012). preciseTAD::TADrfe() implements a random forest model on the best subset of features from 2 to the maximum number of features in the data by powers of 2, using 5-fold cross-validation. We specify accuracy as the performance metric.

set.seed(1)
rfe_res <- TADrfe(trainData = tadData[[1]],
                 tuneParams = list(ntree = 500, nodesize = 1),
                 cvFolds = 5,
                 cvMetric = "Accuracy",
                 verbose = TRUE)

Results from RFE indicate that performance levels off when only considering the top 8 most predictive transcription factors (Figure 3A). We can further optimize our model by evaluating the predictive importance of these top 8 genomic elements across each cross-fold via a heatmap. After aggregating the variable importance values across each cross-fold, we see that SMC3, RAD21, CTCF, and ZNF143 stand out as the most predictive elements TAD-boundary regions (Figure 3B). These are known components of the loop-extrusion model that has been proposed as a mechanism for the 3D architecture of the human genome (Sanborn et al. 2015; Fudenberg et al. 2016; Hansen et al. 2018).

Figure 3. Optimizing the TAD boundary region prediction model. (A) Recursive feature elimination. (B) Variable importance heatmap across each of the 5 cross-validation folds.

Now that we have suitably reduced our feature space, we can implement a random forest algorithm built simply on the TFBS mentioned above (SMC3, RAD21, CTCF, and ZNF143). We can take advantage of the preciseTAD::TADrandomForest() function, which is a wrapper of the randomForest package (Breiman 2001; Liaw, Wiener, and others 2002). We specify the training and testing data, the hyperparameter values, the number of cross-validation folds, the performance metric to consider (here, accuracy), the seed to initialize for reproducibility, an indicator for retaining the model object, an indicator for retaining variable importances, the variable importance measure to consider (here, mean decrease in accuracy (MDA)), and an indicator for retaining model performances based on the test data. The function returns a list containing: 1) a trained model from caret with model information, 2) a data.frame of variable importance for each feature included in the model, and 3) a data.frame of various model performance metrics. Model performances are shown in Table 1.

# Restrict the data matrix to include only SMC3, RAD21, CTCF, and ZNF143
tadDataFiltTrain <- tadData[[1]][,c(1, grep(paste0(c("Ctcf","Rad21","Smc3","Znf143"), collapse = "|"), names(tadData[[1]])))]
tadDataFiltTest <- tadData[[2]][,c(1, grep(paste0(c("Ctcf","Rad21","Smc3","Znf143"), collapse = "|"), names(tadData[[1]])))]

# Run RF
set.seed(123)
tadModel <- TADrandomForest(trainData  = tadDataFiltTrain,
                            testData   = tadDataFiltTest,
                            tuneParams = list(mtry     = c(2,3),
                                              ntree    = 500,
                                              nodesize = 1),
                            cvFolds      = 3,
                            cvMetric     = "Accuracy",
                            verbose      = TRUE,
                            model        = TRUE,
                            importances  = TRUE,
                            impMeasure   = "MDA",
                            performances = TRUE)
#> + Fold1: mtry=2, ntree=500, nodesize=1 
#> - Fold1: mtry=2, ntree=500, nodesize=1 
#> + Fold1: mtry=3, ntree=500, nodesize=1 
#> - Fold1: mtry=3, ntree=500, nodesize=1 
#> + Fold2: mtry=2, ntree=500, nodesize=1 
#> - Fold2: mtry=2, ntree=500, nodesize=1 
#> + Fold2: mtry=3, ntree=500, nodesize=1 
#> - Fold2: mtry=3, ntree=500, nodesize=1 
#> + Fold3: mtry=2, ntree=500, nodesize=1 
#> - Fold3: mtry=2, ntree=500, nodesize=1 
#> + Fold3: mtry=3, ntree=500, nodesize=1 
#> - Fold3: mtry=3, ntree=500, nodesize=1 
#> Aggregating results
#> Selecting tuning parameters
#> Fitting mtry = 2, ntree = 500, nodesize = 1 on full training set

# View model performances
performances <- tadModel[[3]]
performances$Performance <- round(performances$Performance, digits = 2)
rownames(performances) <- NULL #performances$Metric
kable(performances, caption = "Table 1. List of model performances.")

Table 1. List of model performances.
Metric	Performance
TN	14835.00
FN	132.00
FP	5479.00
TP	423.00
Total	20869.00
Sensitivity	0.76
Specificity	0.73
Kappa	0.09
Accuracy	0.73
BalancedAccuracy	0.75
Precision	0.07
FPR	0.27
FNR	0.01
NPV	0.99
MCC	0.18
F1	0.13
AUC	0.82
Youden	0.49
AUPRC	0.09

Reading in model objects using `preciseTADhub` (must be using R verion >= 4.1 or R-devel)

For convenience, we have provided an ExperimentData package to supplement preciseTAD, called preciseTADhub. preciseTADhub offers users access to pre-trained random forest classification models built on CTCF, RAD21, SMC3, and ZNF143. Each of the 84 (2 cell lines \(\times\) 2 ground truth boundaries \(\times\) 21 autosomal chromosomes) files (.RDS) are stored as lists containing two objects: 1) a train object from with RF model information, and 2) a data.frame of variable importance for each genomic annotation included in the model. The file names are structured as follows:

\[i\_j\_k\_l.rds\]

where \(i\) denotes the chromosome that was used as a holdout {CHR1, CHR2, …, CHR21, CHR22} (i.e. for testing; meaning all other chromosomes were used for training), \(j\) denotes the cell line {GM12878, K562}, \(k\) denotes the resolution (size of genomic bins) {5kb, 10kb}, and \(l\) denotes the TAD/loop caller used to define ground truth {Arrowhead, Peakachu}.

Table 2 shows the file names and the corresponding ExperimentHub (EH) IDs. Since we want to make TAD boundary predictions on CHR14 for GM12878, we opt to read in the “CHR14_GM12878_5kb_Arrowhead.rds” file. This corresponds to the EH3863 EHID.

FileName	EHID
CHR1_GM12878_5kb_Arrowhead.rds	EH3815
CHR1_GM12878_10kb_Peakachu.rds	EH3816
CHR1_K562_5kb_Arrowhead.rds	EH3817
CHR1_K562_10kb_Peakachu.rds	EH3818
CHR2_GM12878_5kb_Arrowhead.rds	EH3819
CHR2_GM12878_10kb_Peakachu.rds	EH3820
CHR2_K562_5kb_Arrowhead.rds	EH3821
CHR2_K562_10kb_Peakachu.rds	EH3822
CHR3_GM12878_5kb_Arrowhead.rds	EH3823
CHR3_GM12878_10kb_Peakachu.rds	EH3824
CHR3_K562_5kb_Arrowhead.rds	EH3825
CHR3_K562_10kb_Peakachu.rds	EH3826
CHR4_GM12878_5kb_Arrowhead.rds	EH3827
CHR4_GM12878_10kb_Peakachu.rds	EH3828
CHR4_K562_5kb_Arrowhead.rds	EH3829
CHR4_K562_10kb_Peakachu.rds	EH3830
CHR5_GM12878_5kb_Arrowhead.rds	EH3831
CHR5_GM12878_10kb_Peakachu.rds	EH3832
CHR5_K562_5kb_Arrowhead.rds	EH3833
CHR5_K562_10kb_Peakachu.rds	EH3834
CHR6_GM12878_5kb_Arrowhead.rds	EH3835
CHR6_GM12878_10kb_Peakachu.rds	EH3836
CHR6_K562_5kb_Arrowhead.rds	EH3837
CHR6_K562_10kb_Peakachu.rds	EH3838
CHR7_GM12878_5kb_Arrowhead.rds	EH3839
CHR7_GM12878_10kb_Peakachu.rds	EH3840
CHR7_K562_5kb_Arrowhead.rds	EH3841
CHR7_K562_10kb_Peakachu.rds	EH3842
CHR8_GM12878_5kb_Arrowhead.rds	EH3843
CHR8_GM12878_10kb_Peakachu.rds	EH3844
CHR8_K562_5kb_Arrowhead.rds	EH3845
CHR8_K562_10kb_Peakachu.rds	EH3846
CHR10_GM12878_5kb_Arrowhead.rds	EH3847
CHR10_GM12878_10kb_Peakachu.rds	EH3848
CHR10_K562_5kb_Arrowhead.rds	EH3849
CHR10_K562_10kb_Peakachu.rds	EH3850
CHR11_GM12878_5kb_Arrowhead.rds	EH3851
CHR11_GM12878_10kb_Peakachu.rds	EH3852
CHR11_K562_5kb_Arrowhead.rds	EH3853
CHR11_K562_10kb_Peakachu.rds	EH3854
CHR12_GM12878_5kb_Arrowhead.rds	EH3855
CHR12_GM12878_10kb_Peakachu.rds	EH3856
CHR12_K562_5kb_Arrowhead.rds	EH3857
CHR12_K562_10kb_Peakachu.rds	EH3858
CHR13_GM12878_5kb_Arrowhead.rds	EH3859
CHR13_GM12878_10kb_Peakachu.rds	EH3860
CHR13_K562_5kb_Arrowhead.rds	EH3861
CHR13_K562_10kb_Peakachu.rds	EH3862
CHR14_GM12878_5kb_Arrowhead.rds	EH3863
CHR14_GM12878_10kb_Peakachu.rds	EH3864
CHR14_K562_5kb_Arrowhead.rds	EH3865
CHR14_K562_10kb_Peakachu.rds	EH3866
CHR15_GM12878_5kb_Arrowhead.rds	EH3867
CHR15_GM12878_10kb_Peakachu.rds	EH3868
CHR15_K562_5kb_Arrowhead.rds	EH3869
CHR15_K562_10kb_Peakachu.rds	EH3870
CHR16_GM12878_5kb_Arrowhead.rds	EH3871
CHR16_GM12878_10kb_Peakachu.rds	EH3872
CHR16_K562_5kb_Arrowhead.rds	EH3873
CHR16_K562_10kb_Peakachu.rds	EH3874
CHR17_GM12878_5kb_Arrowhead.rds	EH3875
CHR17_GM12878_10kb_Peakachu.rds	EH3876
CHR17_K562_5kb_Arrowhead.rds	EH3877
CHR17_K562_10kb_Peakachu.rds	EH3878
CHR18_GM12878_5kb_Arrowhead.rds	EH3879
CHR18_GM12878_10kb_Peakachu.rds	EH3880
CHR18_K562_5kb_Arrowhead.rds	EH3881
CHR18_K562_10kb_Peakachu.rds	EH3882
CHR19_GM12878_5kb_Arrowhead.rds	EH3883
CHR19_GM12878_10kb_Peakachu.rds	EH3884
CHR19_K562_5kb_Arrowhead.rds	EH3885
CHR19_K562_10kb_Peakachu.rds	EH3886
CHR20_GM12878_5kb_Arrowhead.rds	EH3887
CHR20_GM12878_10kb_Peakachu.rds	EH3888
CHR20_K562_5kb_Arrowhead.rds	EH3889
CHR20_K562_10kb_Peakachu.rds	EH3890
CHR21_GM12878_5kb_Arrowhead.rds	EH3891
CHR21_GM12878_10kb_Peakachu.rds	EH3892
CHR21_K562_5kb_Arrowhead.rds	EH3893
CHR21_K562_10kb_Peakachu.rds	EH3894
CHR22_GM12878_5kb_Arrowhead.rds	EH3895
CHR22_GM12878_10kb_Peakachu.rds	EH3896
CHR22_K562_5kb_Arrowhead.rds	EH3897
CHR22_K562_10kb_Peakachu.rds	EH3898

Table 2: File names and corresponding ExperimentHub (EH) IDs for all .RDS files stored in preciseTADhub.

#Initialize ExperimentHub
hub <- ExperimentHub()
query(hub, "preciseTADhub")
myfiles <- query(hub, "preciseTADhub")

id <- "EH3863"
tadModel <- myfiles[[id]]

Boundary prediction with preciseTAD

Recall that our model classifies boundary regions, in that each prediction refers to a genomic bin of width 5000 bases. To predict boundary coordinates at the base resolution more precisely, we can leverage our model through the use of the preciseTAD::preciseTAD() function. Conceptually, instead of genomic bins, we annotate each base with the selected genomic annotations and featureType. We then apply our model on this annotation matrix to predict the probability of each base as a boundary given the associated genomic annotations. See the preciseTAD paper for a concise description of the algorithm and the pseudocode.

Suppose we want to precisely predict the domain boundary coordinates for the 1.3Mb section of CHR14:50,070,000-50,800,000. To do so, we specify chromCoords = list(50070000,50800000). Additionally, we set a threshold of 1.0 for constructing PTBRs using threshold = 1.0. For DBSCAN, we assign 10000 and 3 for the \(\epsilon\) and MinPts parameters, respectively. The results is a list containing 3 elements including: 1) the genomic coordinates spanning each preciseTAD predicted region (PTBR), 2) the genomic coordinates of preciseTAD predicted boundaries points (PTBP), and 3) a named list including summary statistics of the following: PTBRWidth - PTBR width, PTBRCoverage - the proportion of bases within a PTBR with probabilities that equal to or exceed the threshold (t=1 by default), DistanceBetweenPTBR - the genomic distance between the end of the previous PTBR and the start of the subsequent PTBR, NumSubRegions - the number of the subregions in each PTBR cluster, SubRegionWidth - the width of the subregion forming each PTBR, DistBetweenSubRegions - the genomic distance between the end of the previous PTBR-specific subregion and the start of the subsequent PTBR-specific subregion, and the normalized enrichment of the genomic annotations used in the model around flanked PTBPs (Figure 4).

Figure 4. preciseTAD summaries.

# Restrict to include only SMC3, RAD21, CTCF, and ZNF143
tfbsList_filt <- tfbsList[names(tfbsList) %in% c("Gm12878-Ctcf-Broad", 
                                            "Gm12878-Rad21-Haib",
                                            "Gm12878-Smc3-Sydh",
                                            "Gm12878-Znf143-Sydh")]

# Run preciseTAD
pt <- preciseTAD(genomicElements.GR = tfbsList_filt,
                featureType         = "distance",
                CHR                 = "CHR14",
                chromCoords         = list(50070000,51360000),
                tadModel            = tadModel[[1]],
                threshold           = 1.0,
                verbose             = TRUE,
                parallel            = NULL,
                DBSCAN_params       = list(10000, 3),
                flank               = 5000)
#> [1] "Establishing bp resolution test data using a distance type feature space"
#> [1] "Establishing probability vector"
#> [1] "preciseTAD identified a total of 2918 base pairs whose predictive probability was equal to or exceeded a threshold of 1"
#> [1] "Initializing DBSCAN"
#> [1] "preciseTAD identified 8 PTBRs"
#> [1] "Establishing PTBPs"
#> [1] "Cluster 1 out of 8"
#> [1] "Cluster 2 out of 8"
#> [1] "Cluster 3 out of 8"
#> [1] "Cluster 4 out of 8"
#> [1] "Cluster 5 out of 8"
#> [1] "Cluster 6 out of 8"
#> [1] "Cluster 7 out of 8"
#> [1] "Cluster 8 out of 8"

# View the results
pt
#> $PTBR
#> GRanges object with 8 ranges and 0 metadata columns:
#>       seqnames            ranges strand
#>          <Rle>         <IRanges>  <Rle>
#>   [1]    chr14 50101047-50101136      *
#>   [2]    chr14 50312215-50334062      *
#>   [3]    chr14 50563705-50583007      *
#>   [4]    chr14 50784355-50796085      *
#>   [5]    chr14 50997740-51001756      *
#>   [6]    chr14 51059855-51093997      *
#>   [7]    chr14 51231563-51247722      *
#>   [8]    chr14 51319223-51334404      *
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths
#> 
#> $PTBP
#> GRanges object with 8 ranges and 0 metadata columns:
#>       seqnames    ranges strand
#>          <Rle> <IRanges>  <Rle>
#>   [1]    chr14  50101086      *
#>   [2]    chr14  50328921      *
#>   [3]    chr14  50571870      *
#>   [4]    chr14  50787930      *
#>   [5]    chr14  50999455      *
#>   [6]    chr14  51066958      *
#>   [7]    chr14  51241489      *
#>   [8]    chr14  51327392      *
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths
#> 
#> $Summaries
#> $Summaries$PTBRWidth
#>   min   max median      iqr     mean       sd
#> 1  90 34143  15671 10136.75 15309.25 10597.17
#> 
#> $Summaries$PTBRCoverage
#>          min       max    median       iqr      mean        sd
#> 1 0.01098322 0.8666667 0.0278572 0.0212158 0.1339193 0.2964553
#> 
#> $Summaries$DistanceBetweenPTBR
#>     min    max median      iqr     mean       sd
#> 1 58099 229643 201348 101833.5 158698.7 70251.13
#> 
#> $Summaries$NumSubRegions
#>   min max median  iqr   mean       sd
#> 1   2  50     32 8.75 30.125 13.98405
#> 
#> $Summaries$SubRegionWidth
#>   min max median iqr     mean      sd
#> 1   1 193      4   7 12.11203 30.7296
#> 
#> $Summaries$DistBetweenSubRegions
#>   min  max median iqr     mean       sd
#> 1   2 7529     40 389 514.1116 1149.644
#> 
#> $Summaries$NormilizedEnrichment
#>  Gm12878-Ctcf-Broad  Gm12878-Rad21-Haib   Gm12878-Smc3-Sydh Gm12878-Znf143-Sydh 
#>                1.25                1.25                1.25                1.25

Using preciseTAD with Juicebox

Juicebox is an interface provided by Aiden Lab that allows for superimposing boundary coordinates onto Hi-C contact maps. You can download a desktop version of the application here, or use the online version here. To visualize domains demarcated by the predicted boundaries, you must first select a Hi-C map. Using the desktop version, you can import the contact matrix for GM12878 derived by Rao et al. 2014 by choosing File -> Open... -> Rao and Huntley et al. -> GM12878 -> in situ Mbol -> primary. To format preciseTAD results to use in Juicebox, users can take advantage of preciseTAD::juicer_func(), which transforms a GRanges object into a data frame as shown below.

# Transform
pt_juice <- juicer_func(pt$PTBP)

You will then need to save the PTBPs as a BED file using write.table as shown below. Once saved, import them into Juicebox using Show Annotation Panel -> 2D Annotations -> Add Local -> pt_juice.bed. An example of using Juicebox to plot 3D domain boundary coordinates is shown in Figure 5.

filepath = "~/path/to/store/ptbps"
write.table(pt_juice, 
            file.path(filepath, "pt_juice.bed"),
            quote = FALSE,
            col.names = TRUE,
            row.names = FALSE,
            sep = "\t")

Figure 5. Juicebox visualization of TAD boundaries called by Arrowhead (blue) vs predicted by preciseTAD (green). The region presented are the genomic coordinates spanning 50,055,000-51,360,000 on chromosome 14 of GM12878.

Techniques for visualizing TAD boundaries

The biological significance of TAD boundaries can be assessed by evaluating their association with the signal of known molecular drivers of 3D chromatin, namely CTCF, RAD21, SMC3, and ZNF143. Two popular visualization techniques include enriched heatmaps and profile plots, both of which can be accomplished with deepTools (Ramirez et al. 2016). Briefly, for an enriched heatmap, a matrix \(M_{r \times c}\), is created where the rows, \(r\), are given by the number of boundaries, either called or predicted. The column dimension, \(c\), is created by flanking each boundary by a given amount (5 kb). The flanking is then broken up into 50 bp segments, for 100 windows on both sides of a given boundary, a total of 200 columns. The cells of the matrix are calculated as a mean coverage value for each window with respect to the signal from the respective ChIP-seq annotation via bigwig files. For a profile plot, the matrix is then summarized by row-wise averages and plotted as a density curve, where the center represents the boundary, and the curve represents the average ChIP-seq peak signal around a flanked region. You can read more about each technique here.

Furthermore, we can assess the conservation of TAD boundaries by visualizing the amount of overlap between flanked boundaries across cell lines. Overlaps can be quantified using the Jaccard index defined as

\[J_{(A,B)} = \dfrac{A \cap B}{A \cup B}\]

where A and B represent genomic regions created by flanked called and predicted boundaries. That is, between cell lines, the number of overlapping flanked boundaries divided by the total number of flanked boundaries. Wilcoxon Rank-Sum tests can then be used to compare chromosome-specific Jaccard indices across cell lines.

Below is an example script of how to create an enriched heatmap and profile plot for a set of boundaries using deepTools (version 2.0). Users will first need to create the matrix of signal values, described above, using the computeMatrix tool. To do so, you need to save the resulting boundaries as a .BED file (without a header), where the first column is the chromosome, second column is the boundary coordinate, and third column is (boundary coordinate + 1). The TSS option is used such that the 2nd column of the .BED file is chosen as the center (i.e. the boundary coordinate). Users will next need to download bigwig files for each cell-line specific annotation of interest, which can be done through ENCODE here. The annotation (.bigwig) along with the boundaries (.BED) are provided next in the python code. We next specify by how much to flank around the boundary. Here, we opt for 5 kb (5000 bases) on either side. Additionally, we specify to divide the flanked region into 50 base segments (i.e. (5000+5000)/50=1000/50=200 segments). Lastly, we provide the name of the resulting matrix (computeMatrix_output.gz).

# Create the matrix of signal using computeMatrix
computeMatrix --referencePoint TSS \ #uses the start coordinate (i.e. the boundary)
              -S annotation.bigWig \ #cell-line specific annotation of interest
              -R boundaries.bed \ #TAD boundaries (either predicted or called)
              -a 5000 -b 5000 \ #size of flank on left side (-a) and right side (-b) in bases
              --binSize 50 \ #number of bases to divide the flanked region by
              -o computeMatrix_output.gz #name of the output

Users can then use the computeMatrix_output.gz to create both the enriched heatmap and profile plot as shown below.

# Creating Enriched Heatmap 
plotHeatmap -m computeMatrix_output.gz \ #name of the output of computeMatrix 
            -out EnrichedHeatmap.png \ #name of heatmap image
            --whatToShow "heatmap and colorbar" \ #what to include in the image
            --regionsLabel " " \ #row labels
            --xAxisLabel " " \ #x-axis label
            --refPointLabel "Boundary" \ #name of the center point
            --samplesLabel "TAD-caller" \ #column label
            --heatmapHeight 20 \ #height dimension in cm
            --heatmapWidth 5 \ #width dimension in cm
            --dpi 300 \ #dpi of saved image
            
# Creating profile plot
plotProfile -m computeMatrix_output.gz \ #name of the output of computeMatrix 
            -out ProfilePlot.png \ #name of profile plot image
            --colors blue \ #color of signal curve
            --regionsLabel " " \ #row labels
            --samplesLabel "TAD-caller" \ #column label
            --refPointLabel "Boundary" \ #name of the center point
            --plotHeight 7 \ #height dimension in cm
            --plotWidth 7 \ #width dimension in cm
            --dpi 300 \ #dpi of saved image

Comparing preciseTAD with Arrowhead

We extend Section 6.4 by using preciseTAD to predict TAD boundaries on all of chromosome 14 for both GM12878 and K562 cell lines. This can be done by simply replacing the chromCoords parameter with NULL. We have provided the boundaries predicted by preciseTAD here and the Arrowhead called boundaries here. When trained using Arrowhead ground truth boundaries at 5 kb, preciseTAD predicted a total of 440 and 243 TAD boundaries on CHR14 for GM12878 and K562 respectively (Table 3). For comparison, Arrowhead identified a total of 555 and 240 TAD boundaries on CHR14 for GM12878 and K562 respectively (Table 3). We see that even though preciseTAD predicted less boundaries than Arrowhead, CTCF signal is much more colocalized around preciseTAD-predicted TAD boundaries compared to Arrowhead-called TAD boundaries (Figure 6A). This can be attributed to Arrowhead’s inflation of called TADs at 5 kb. Additionally, preciseTAD boundaries are statistically significantly closer to CTCF sites (Figure 6B). These results indicate that preciseTAD-predicted boundaries better reflected the known biology of boundary formation.

	Arrowhead	preciseTAD
GM12878	555	440
K562	240	243

Table 3. TAD boundaries either called (Arrowhead) or predicted (preciseTAD) on CHR14.

Figure 6. Levels of CTCF enrichment around preciseTAD-predicted boundaries are greater than Arrowhead-called boundaries.

Previous studies suggest that TAD boundaries are conserved across cell lines (Dixon et al. 2012; Krefting, Andrade-Navarro, and Ibn-Salem 2018; Nora et al. 2012; Sexton et al. 2012). To assess the level of cross-cell-line conservation, we can evaluate the overlap between cell line-specific boundaries detected by preciseTAD and Arrowhead. To do so, preciseTAD-predicted and Arrowhead-called TAD boundaries on CHR14 were first flanked on both sides by 5 kb. Only 21% of boundaries were conserved between cell lines for Arrowhead (J=0.16) (Figure 7A), while 37% of preciseTAD-predicted boundaries were conserved between GM12878 and K562 cell lines (J=0.32) (Figure 7B). The higher level of conservation for preciseTAD-predicted boundaries further supports the notion of their higher biological relevance.

Figure 7. preciseTAD-predicted TAD boundaries are more conserved across cell lines.

Cross cell-line prediction

The preciseTAD algorithm can be extended to predict boundaries on cell lines that do not currently have high resolution HiC data using publicly available cell-line specific annotations. We highlight this technique by evaluating the following scenario (for CHR14): training on GM12878 and predicting boundaries on GM12878 (GM on GM) vs. training on K562 and predicting on GM12878 (K on GM). Using the preciseTAD framework, 74% (J=0.68) of predicted boundaries overlapped in the cross-cell-line prediction scenario (Figure 8A). Furthermore, boundaries predicted on unseen annotation data exhibit a similar level of enrichment for CTCF as did those trained and predicted on the same cell line (Figure 7B). These results indicate that preciseTAD pre-trained models can be successfully used to predict domain boundaries for cell lines lacking Hi-C data but for which genome annotation data is available.

Figure 8. Cross cell-line predicted boundaries strongly overlap with same-cell-line predicted boundaries.

Conclusion

In this workshop we demonstrate the usefulness of preciseTAD, an R package that provides users with a data-driven approach toward TAD boundary prediction. preciseTAD leverages a random forest (RF) classification model built on low-resolution domain boundaries obtained from domain calling tools, and high-resolution genomic annotations as the feature space. preciseTAD predicts the probability of each base being a boundary, and identifies the precise location of boundary regions and the most likely boundary points among them. We benchmarked preciseTAD against Arrowhead (Durand, Robinson, et al. 2016), an established TAD-caller. We showed that, unlike Arrowhead, preciseTAD does not inflate the number of predicted boundaries, providing only the most biologically meaningful boundaries that demarcate regions of high inter-chromosomal interactions. preciseTAD boundaries were more enriched for known architectural transcription factors, such as CTCF. Likewise, preciseTAD boundaries were more conserved between GM12878 and K562 cell lines, a known feature among the 3D architecture of the human genome. Lastly, we show that cell-line-specific functional genomic annotation data (ChIP-seq) can be used to precisely predict domain boundaries across cell lines. Pre-trained models are publicly available on Bioconductor in the ExperimentHub package preciseTADhub, here. This creates new opportunities for users to predict domain boundaries on cell lines without relying on high-resolution Hi-C data for which there is none available.

References

Breiman, Leo. 2001. “Random Forests.” Machine Learning 45 (1): 5–32.

Dixon, Jesse R, Inkyung Jung, Siddarth Selvaraj, Yin Shen, Jessica E Antosiewicz-Bourget, Ah Young Lee, Zhen Ye, et al. 2015. “Chromatin Architecture Reorganization During Stem Cell Differentiation.” Nature 518 (7539): 331.

Dixon, Jesse R, Siddarth Selvaraj, Feng Yue, Audrey Kim, Yan Li, Yin Shen, Ming Hu, Jun S Liu, and Bing Ren. 2012. “Topological Domains in Mammalian Genomes Identified by Analysis of Chromatin Interactions.” Nature 485 (7398): 376.

Durand, Neva C, James T Robinson, Muhammad S Shamim, Ido Machol, Jill P Mesirov, Eric S Lander, and Erez Lieberman Aiden. 2016. “Juicebox Provides a Visualization System for Hi-c Contact Maps with Unlimited Zoom.” Cell Systems 3 (1): 99–101.

Durand, Neva C, Muhammad S Shamim, Ido Machol, Suhas SP Rao, Miriam H Huntley, Eric S Lander, and Erez Lieberman Aiden. 2016. “Juicer Provides a One-Click System for Analyzing Loop-Resolution Hi-c Experiments.” Cell Systems 3 (1): 95–98.

Ester, Martin, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu, and others. 1996. “A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise.” In Kdd, 96:226–31. 34.

Fudenberg, Geoffrey, Maxim Imakaev, Carolyn Lu, Anton Goloborodko, Nezar Abdennur, and Leonid A Mirny. 2016. “Formation of Chromosomal Domains by Loop Extrusion.” Cell Reports 15 (9): 2038–49.

Hahsler, Michael, Matthew Piekenbrock, and Derek Doran. 2019. “Dbscan: Fast Density-Based Clustering with R.” Journal of Statistical Software 25: 409–16.

Hansen, Anders S, Claudia Cattoglio, Xavier Darzacq, and Robert Tjian. 2018. “Recent Evidence That Tads and Chromatin Loops Are Dynamic Structures.” Nucleus 9 (1): 20–32.

Hnisz, Denes, Abraham S Weintraub, Daniel S Day, Anne-Laure Valton, Rasmus O Bak, Charles H Li, Johanna Goldmann, et al. 2016. “Activation of Proto-Oncogenes by Disruption of Chromosome Neighborhoods.” Science 351 (6280): 1454–8.

Krefting, Jan, Miguel A Andrade-Navarro, and Jonas Ibn-Salem. 2018. “Evolutionary Stability of Topologically Associating Domains Is Associated with Conserved Gene Regulation.” BMC Biology 16 (1): 1–12.

Kuhn, Max. 2012. “The Caret Package.” R Foundation for Statistical Computing, Vienna, Austria. URL Https://Cran. R-Project. Org/Package= Caret.

Liaw, Andy, Matthew Wiener, and others. 2002. “Classification and Regression by randomForest.” R News 2 (3): 18–22.

Lieberman-Aiden, Erez, Nynke L Van Berkum, Louise Williams, Maxim Imakaev, Tobias Ragoczy, Agnes Telling, Ido Amit, et al. 2009. “Comprehensive Mapping of Long-Range Interactions Reveals Folding Principles of the Human Genome.” Science 326 (5950): 289–93.

Lupiáñez, Darı́o G, Malte Spielmann, and Stefan Mundlos. 2016. “Breaking Tads: How Alterations of Chromatin Domains Result in Disease.” Trends in Genetics 32 (4): 225–37.

Meaburn, Karen J, Erik Cabuy, Gisele Bonne, Nicolas Levy, Glenn E Morris, Giuseppe Novelli, Ian R Kill, and Joanna M Bridger. 2007. “Primary Laminopathy Fibroblasts Display Altered Genome Organization and Apoptosis.” Aging Cell 6 (2): 139–53. https://doi.org/10.1111/j.1474-9726.2007.00270.x.

Nora, Elphège P, Bryan R Lajoie, Edda G Schulz, Luca Giorgetti, Ikuhiro Okamoto, Nicolas Servant, Tristan Piolot, et al. 2012. “Spatial Partitioning of the Regulatory Landscape of the X-Inactivation Centre.” Nature 485 (7398): 381.

Ramirez, Fidel, Devon P Ryan, Bjorn Gruning, Vivek Bhardwaj, Fabian Kilpert, Andreas S Richter, Steffen Heyne, Friederike Dundar, and Thomas Manke. 2016. “DeepTools2: A Next Generation Web Server for Deep-Sequencing Data Analysis.” Nucleic Acids Research 44 (W1): W160–W165.

Rao, Suhas SP, Miriam H Huntley, Neva C Durand, Elena K Stamenova, Ivan D Bochkov, James T Robinson, Adrian L Sanborn, et al. 2014. “A 3D Map of the Human Genome at Kilobase Resolution Reveals Principles of Chromatin Looping.” Cell 159 (7): 1665–80.

Sanborn, Adrian L, Suhas SP Rao, Su-Chen Huang, Neva C Durand, Miriam H Huntley, Andrew I Jewett, Ivan D Bochkov, et al. 2015. “Chromatin Extrusion Explains Key Features of Loop and Domain Formation in Wild-Type and Engineered Genomes.” Proceedings of the National Academy of Sciences 112 (47): E6456–E6465.

Schmitt, Anthony D, Ming Hu, Inkyung Jung, Zheng Xu, Yunjiang Qiu, Catherine L Tan, Yun Li, et al. 2016. “A Compendium of Chromatin Contact Maps Reveals Spatially Active Regions in the Human Genome.” Cell Rep 17 (8): 2042–59. https://doi.org/10.1016/j.celrep.2016.10.061.

Sexton, Tom, Eitan Yaffe, Ephraim Kenigsberg, Frédéric Bantignies, Benjamin Leblanc, Michael Hoichman, Hugues Parrinello, Amos Tanay, and Giacomo Cavalli. 2012. “Three-Dimensional Folding and Functional Organization Principles of the Drosophila Genome.” Cell 148 (3): 458–72.

Sun, James H, Linda Zhou, Daniel J Emerson, Sai A Phyo, Katelyn R Titus, Wanfeng Gong, Thomas G Gilgenast, et al. 2018. “Disease-Associated Short Tandem Repeats Co-Localize with Chromatin Domain Boundaries.” Cell 175 (1): 224–238.e15. https://doi.org/10.1016/j.cell.2018.08.005.

Taberlay, Phillippa C, Joanna Achinger-Kawecka, Aaron TL Lun, Fabian A Buske, Kenneth Sabir, Cathryn M Gould, Elena Zotenko, et al. 2016. “Three-Dimensional Disorganization of the Cancer Genome Occurs Coincident with Long-Range Genetic and Epigenetic Alterations.” Genome Research 26 (6): 719–31.

preciseTAD: A machine learning framework for precise 3D domain boundary prediction at base-level resolution

Spiro Stilianoudakis

Mikhail Dozmorov

December 8, 2020

Installation

Introduction

Obtaining ground truth TAD boundaries

Obtaining functional genomic elements

Describing preciseTAD functionality

The preciseTAD R package

Model building

Optimization and model evaluation

Reading in model objects using `preciseTADhub` (must be using R verion >= 4.1 or R-devel)

Boundary prediction with preciseTAD

Using preciseTAD with Juicebox

Techniques for visualizing TAD boundaries

Comparing preciseTAD with Arrowhead

Cross cell-line prediction

Conclusion

References

preciseTAD: A machine learning framework for precise 3D domain boundary prediction at base-level resolution

Spiro Stilianoudakis

Mikhail Dozmorov

December 8, 2020

Installation

Introduction

Obtaining ground truth TAD boundaries

Obtaining functional genomic elements

Describing preciseTAD functionality

The preciseTAD R package

Model building

Optimization and model evaluation

Reading in model objects using preciseTADhub (must be using R verion >= 4.1 or R-devel)

Boundary prediction with preciseTAD

Using preciseTAD with Juicebox

Techniques for visualizing TAD boundaries

Comparing preciseTAD with Arrowhead

Cross cell-line prediction

Conclusion

References

Reading in model objects using `preciseTADhub` (must be using R verion >= 4.1 or R-devel)