caret::train
to apply a random forest
classification algorithm built and tested on user-defined binned domain
data from createTADdata
.R/TADrandomForest.R
TADrandomForest.Rd
A wrapper function passed to caret::train
to apply a random forest
classification algorithm built and tested on user-defined binned domain
data from createTADdata
.
Data frame, the binned data matrix to built a random forest
classifiers (can be obtained using createTADdata
). Required.
Data frame, the binned data matrix to test random forest
classifiers (can be obtained using createTADdata
). The first
column must be a factor with positive class "Yes". Default is NULL in which
case no performances are evaluated.
List, providing mtry
, ntree
, and
nodesize
parameters to feed into randomForest
. Default
is list(mtry = ceiling(sqrt(ncol(trainData) - 1)), ntree = 500,
nodesize = 1). If multiple values are provided, then a grid search is
performed to tune the model. Required.
Numeric, number of k-fold cross-validation to perform in order to tune the hyperparameters. Required.
Character, performance metric to use to choose optimal tuning parameters (one of either "Kappa", "Accuracy", "MCC", "ROC", "Sens", "Spec", "Pos Pred Value", "Neg Pred Value"). Default is "Accuracy".
Logical, controls whether or not details regarding modeling should be printed out. Default is TRUE.
Logical, whether to keep the model object. Default is TRUE.
Logical, whether to extract variable importances. Default is TRUE.
Character, indicates the variable importance measure to use (one of either "MDA" (mean decrease in accuracy) or "MDG" (mean decrease in gini)). Ignored if importances = FALSE.
Logical, indicates whether various performance metrics should be extracted when validating the model on the test data. Ignored if testData = NULL.
A list containing: 1) a train object from caret
with model
information, 2) a data.frame of variable importance for each feature
included in the model, and 3) a data.frame of various performance metrics
# Read in ARROWHEAD-called TADs at 5kb
data(arrowhead_gm12878_5kb)
# Extract unique boundaries
bounds.GR <- extractBoundaries(domains.mat = arrowhead_gm12878_5kb,
filter = FALSE,
CHR = c("CHR21", "CHR22"),
resolution = 5000)
# Read in GRangesList of 26 TFBS
data(tfbsList)
# Create the binned data matrix for CHR1 (training) and CHR22 (testing)
# using 5 kb binning, distance-type predictors from 26 different TFBS from
# the GM12878 cell line, and random under-sampling
tadData <- createTADdata(bounds.GR = bounds.GR,
resolution = 5000,
genomicElements.GR = tfbsList,
featureType = "distance",
resampling = "rus",
trainCHR = "CHR21",
predictCHR = "CHR22")
# Perform random forest using TADrandomForest by tuning mtry over 10 values
# using 3-fold CV
tadModel <- TADrandomForest(trainData = tadData[[1]],
testData = tadData[[2]],
tuneParams = list(mtry = c(2,5,8,10,13,16,18,21,24,26),
ntree = 500,
nodesize = 1),
cvFolds = 3,
cvMetric = "Accuracy",
verbose = TRUE,
model = TRUE,
importances = TRUE,
impMeasure = "MDA",
performances = TRUE)
#> Loading required package: ggplot2
#> Loading required package: lattice
#> + Fold1: mtry= 2, ntree=500, nodesize=1
#> - Fold1: mtry= 2, ntree=500, nodesize=1
#> + Fold1: mtry= 5, ntree=500, nodesize=1
#> - Fold1: mtry= 5, ntree=500, nodesize=1
#> + Fold1: mtry= 8, ntree=500, nodesize=1
#> - Fold1: mtry= 8, ntree=500, nodesize=1
#> + Fold1: mtry=10, ntree=500, nodesize=1
#> - Fold1: mtry=10, ntree=500, nodesize=1
#> + Fold1: mtry=13, ntree=500, nodesize=1
#> - Fold1: mtry=13, ntree=500, nodesize=1
#> + Fold1: mtry=16, ntree=500, nodesize=1
#> - Fold1: mtry=16, ntree=500, nodesize=1
#> + Fold1: mtry=18, ntree=500, nodesize=1
#> - Fold1: mtry=18, ntree=500, nodesize=1
#> + Fold1: mtry=21, ntree=500, nodesize=1
#> - Fold1: mtry=21, ntree=500, nodesize=1
#> + Fold1: mtry=24, ntree=500, nodesize=1
#> - Fold1: mtry=24, ntree=500, nodesize=1
#> + Fold1: mtry=26, ntree=500, nodesize=1
#> - Fold1: mtry=26, ntree=500, nodesize=1
#> + Fold2: mtry= 2, ntree=500, nodesize=1
#> - Fold2: mtry= 2, ntree=500, nodesize=1
#> + Fold2: mtry= 5, ntree=500, nodesize=1
#> - Fold2: mtry= 5, ntree=500, nodesize=1
#> + Fold2: mtry= 8, ntree=500, nodesize=1
#> - Fold2: mtry= 8, ntree=500, nodesize=1
#> + Fold2: mtry=10, ntree=500, nodesize=1
#> - Fold2: mtry=10, ntree=500, nodesize=1
#> + Fold2: mtry=13, ntree=500, nodesize=1
#> - Fold2: mtry=13, ntree=500, nodesize=1
#> + Fold2: mtry=16, ntree=500, nodesize=1
#> - Fold2: mtry=16, ntree=500, nodesize=1
#> + Fold2: mtry=18, ntree=500, nodesize=1
#> - Fold2: mtry=18, ntree=500, nodesize=1
#> + Fold2: mtry=21, ntree=500, nodesize=1
#> - Fold2: mtry=21, ntree=500, nodesize=1
#> + Fold2: mtry=24, ntree=500, nodesize=1
#> - Fold2: mtry=24, ntree=500, nodesize=1
#> + Fold2: mtry=26, ntree=500, nodesize=1
#> - Fold2: mtry=26, ntree=500, nodesize=1
#> + Fold3: mtry= 2, ntree=500, nodesize=1
#> - Fold3: mtry= 2, ntree=500, nodesize=1
#> + Fold3: mtry= 5, ntree=500, nodesize=1
#> - Fold3: mtry= 5, ntree=500, nodesize=1
#> + Fold3: mtry= 8, ntree=500, nodesize=1
#> - Fold3: mtry= 8, ntree=500, nodesize=1
#> + Fold3: mtry=10, ntree=500, nodesize=1
#> - Fold3: mtry=10, ntree=500, nodesize=1
#> + Fold3: mtry=13, ntree=500, nodesize=1
#> - Fold3: mtry=13, ntree=500, nodesize=1
#> + Fold3: mtry=16, ntree=500, nodesize=1
#> - Fold3: mtry=16, ntree=500, nodesize=1
#> + Fold3: mtry=18, ntree=500, nodesize=1
#> - Fold3: mtry=18, ntree=500, nodesize=1
#> + Fold3: mtry=21, ntree=500, nodesize=1
#> - Fold3: mtry=21, ntree=500, nodesize=1
#> + Fold3: mtry=24, ntree=500, nodesize=1
#> - Fold3: mtry=24, ntree=500, nodesize=1
#> + Fold3: mtry=26, ntree=500, nodesize=1
#> - Fold3: mtry=26, ntree=500, nodesize=1
#> Aggregating results
#> Selecting tuning parameters
#> Fitting mtry = 16, ntree = 500, nodesize = 1 on full training set