A wrapper function passed to caret::train to apply a random forest classification algorithm built and tested on user-defined binned domain data from createTADdata.

A wrapper function passed to caret::train to apply a random forest classification algorithm built and tested on user-defined binned domain data from createTADdata.

TADrandomForest(
  trainData,
  testData = NULL,
  tuneParams = list(mtry = ceiling(sqrt(ncol(trainData) - 1)), ntree = 500, nodesize =
    1),
  cvFolds = 3,
  cvMetric = "Accuracy",
  verbose = FALSE,
  model = TRUE,
  importances = TRUE,
  impMeasure = "MDA",
  performances = FALSE
)

Arguments

trainData: Data frame, the binned data matrix to built a random forest classifiers (can be obtained using createTADdata). Required.
testData: Data frame, the binned data matrix to test random forest classifiers (can be obtained using createTADdata). The first column must be a factor with positive class "Yes". Default is NULL in which case no performances are evaluated.
tuneParams: List, providing mtry, ntree, and nodesize parameters to feed into randomForest. Default is list(mtry = ceiling(sqrt(ncol(trainData) - 1)), ntree = 500, nodesize = 1). If multiple values are provided, then a grid search is performed to tune the model. Required.
cvFolds: Numeric, number of k-fold cross-validation to perform in order to tune the hyperparameters. Required.
cvMetric: Character, performance metric to use to choose optimal tuning parameters (one of either "Kappa", "Accuracy", "MCC", "ROC", "Sens", "Spec", "Pos Pred Value", "Neg Pred Value"). Default is "Accuracy".
verbose: Logical, controls whether or not details regarding modeling should be printed out. Default is TRUE.
model: Logical, whether to keep the model object. Default is TRUE.
importances: Logical, whether to extract variable importances. Default is TRUE.
impMeasure: Character, indicates the variable importance measure to use (one of either "MDA" (mean decrease in accuracy) or "MDG" (mean decrease in gini)). Ignored if importances = FALSE.
performances: Logical, indicates whether various performance metrics should be extracted when validating the model on the test data. Ignored if testData = NULL.

Value

A list containing: 1) a train object from caret with model information, 2) a data.frame of variable importance for each feature included in the model, and 3) a data.frame of various performance metrics

Examples

# Read in ARROWHEAD-called TADs at 5kb
data(arrowhead_gm12878_5kb)

# Extract unique boundaries
bounds.GR <- extractBoundaries(domains.mat = arrowhead_gm12878_5kb,
                               filter = FALSE,
                               CHR = c("CHR21", "CHR22"),
                               resolution = 5000)

# Read in GRangesList of 26 TFBS
data(tfbsList)

# Create the binned data matrix for CHR1 (training) and CHR22 (testing)
# using 5 kb binning, distance-type predictors from 26 different TFBS from
# the GM12878 cell line, and random under-sampling
tadData <- createTADdata(bounds.GR = bounds.GR,
                         resolution = 5000,
                         genomicElements.GR = tfbsList,
                         featureType = "distance",
                         resampling = "rus",
                         trainCHR = "CHR21",
                         predictCHR = "CHR22")

# Perform random forest using TADrandomForest by tuning mtry over 10 values
# using 3-fold CV
tadModel <- TADrandomForest(trainData = tadData[[1]],
                            testData = tadData[[2]],
                            tuneParams = list(mtry = c(2,5,8,10,13,16,18,21,24,26),
                                            ntree = 500,
                                            nodesize = 1),
                            cvFolds = 3,
                            cvMetric = "Accuracy",
                            verbose = TRUE,
                            model = TRUE,
                            importances = TRUE,
                            impMeasure = "MDA",
                            performances = TRUE)
#> Loading required package: ggplot2
#> Loading required package: lattice
#> + Fold1: mtry= 2, ntree=500, nodesize=1 
#> - Fold1: mtry= 2, ntree=500, nodesize=1 
#> + Fold1: mtry= 5, ntree=500, nodesize=1 
#> - Fold1: mtry= 5, ntree=500, nodesize=1 
#> + Fold1: mtry= 8, ntree=500, nodesize=1 
#> - Fold1: mtry= 8, ntree=500, nodesize=1 
#> + Fold1: mtry=10, ntree=500, nodesize=1 
#> - Fold1: mtry=10, ntree=500, nodesize=1 
#> + Fold1: mtry=13, ntree=500, nodesize=1 
#> - Fold1: mtry=13, ntree=500, nodesize=1 
#> + Fold1: mtry=16, ntree=500, nodesize=1 
#> - Fold1: mtry=16, ntree=500, nodesize=1 
#> + Fold1: mtry=18, ntree=500, nodesize=1 
#> - Fold1: mtry=18, ntree=500, nodesize=1 
#> + Fold1: mtry=21, ntree=500, nodesize=1 
#> - Fold1: mtry=21, ntree=500, nodesize=1 
#> + Fold1: mtry=24, ntree=500, nodesize=1 
#> - Fold1: mtry=24, ntree=500, nodesize=1 
#> + Fold1: mtry=26, ntree=500, nodesize=1 
#> - Fold1: mtry=26, ntree=500, nodesize=1 
#> + Fold2: mtry= 2, ntree=500, nodesize=1 
#> - Fold2: mtry= 2, ntree=500, nodesize=1 
#> + Fold2: mtry= 5, ntree=500, nodesize=1 
#> - Fold2: mtry= 5, ntree=500, nodesize=1 
#> + Fold2: mtry= 8, ntree=500, nodesize=1 
#> - Fold2: mtry= 8, ntree=500, nodesize=1 
#> + Fold2: mtry=10, ntree=500, nodesize=1 
#> - Fold2: mtry=10, ntree=500, nodesize=1 
#> + Fold2: mtry=13, ntree=500, nodesize=1 
#> - Fold2: mtry=13, ntree=500, nodesize=1 
#> + Fold2: mtry=16, ntree=500, nodesize=1 
#> - Fold2: mtry=16, ntree=500, nodesize=1 
#> + Fold2: mtry=18, ntree=500, nodesize=1 
#> - Fold2: mtry=18, ntree=500, nodesize=1 
#> + Fold2: mtry=21, ntree=500, nodesize=1 
#> - Fold2: mtry=21, ntree=500, nodesize=1 
#> + Fold2: mtry=24, ntree=500, nodesize=1 
#> - Fold2: mtry=24, ntree=500, nodesize=1 
#> + Fold2: mtry=26, ntree=500, nodesize=1 
#> - Fold2: mtry=26, ntree=500, nodesize=1 
#> + Fold3: mtry= 2, ntree=500, nodesize=1 
#> - Fold3: mtry= 2, ntree=500, nodesize=1 
#> + Fold3: mtry= 5, ntree=500, nodesize=1 
#> - Fold3: mtry= 5, ntree=500, nodesize=1 
#> + Fold3: mtry= 8, ntree=500, nodesize=1 
#> - Fold3: mtry= 8, ntree=500, nodesize=1 
#> + Fold3: mtry=10, ntree=500, nodesize=1 
#> - Fold3: mtry=10, ntree=500, nodesize=1 
#> + Fold3: mtry=13, ntree=500, nodesize=1 
#> - Fold3: mtry=13, ntree=500, nodesize=1 
#> + Fold3: mtry=16, ntree=500, nodesize=1 
#> - Fold3: mtry=16, ntree=500, nodesize=1 
#> + Fold3: mtry=18, ntree=500, nodesize=1 
#> - Fold3: mtry=18, ntree=500, nodesize=1 
#> + Fold3: mtry=21, ntree=500, nodesize=1 
#> - Fold3: mtry=21, ntree=500, nodesize=1 
#> + Fold3: mtry=24, ntree=500, nodesize=1 
#> - Fold3: mtry=24, ntree=500, nodesize=1 
#> + Fold3: mtry=26, ntree=500, nodesize=1 
#> - Fold3: mtry=26, ntree=500, nodesize=1 
#> Aggregating results
#> Selecting tuning parameters
#> Fitting mtry = 16, ntree = 500, nodesize = 1 on full training set