A wrapper function passed to caret::train to apply a random forest classification algorithm built and tested on user-defined binned domain data from createTADdata.

TADrandomForest(
  trainData,
  testData = NULL,
  tuneParams = list(mtry = ceiling(sqrt(ncol(trainData) - 1)), ntree = 500, nodesize =
    1),
  cvFolds = 3,
  cvMetric = "Accuracy",
  verbose = FALSE,
  model = TRUE,
  importances = TRUE,
  impMeasure = "MDA",
  performances = FALSE
)

Arguments

trainData

Data frame, the binned data matrix to built a random forest classifiers (can be obtained using createTADdata). Required.

testData

Data frame, the binned data matrix to test random forest classifiers (can be obtained using createTADdata). The first column must be a factor with positive class "Yes". Default is NULL in which case no performances are evaluated.

tuneParams

List, providing mtry, ntree, and nodesize parameters to feed into randomForest. Default is list(mtry = ceiling(sqrt(ncol(trainData) - 1)), ntree = 500, nodesize = 1). If multiple values are provided, then a grid search is performed to tune the model. Required.

cvFolds

Numeric, number of k-fold cross-validation to perform in order to tune the hyperparameters. Required.

cvMetric

Character, performance metric to use to choose optimal tuning parameters (one of either "Kappa", "Accuracy", "MCC", "ROC", "Sens", "Spec", "Pos Pred Value", "Neg Pred Value"). Default is "Accuracy".

verbose

Logical, controls whether or not details regarding modeling should be printed out. Default is TRUE.

model

Logical, whether to keep the model object. Default is TRUE.

importances

Logical, whether to extract variable importances. Default is TRUE.

impMeasure

Character, indicates the variable importance measure to use (one of either "MDA" (mean decrease in accuracy) or "MDG" (mean decrease in gini)). Ignored if importances = FALSE.

performances

Logical, indicates whether various performance metrics should be extracted when validating the model on the test data. Ignored if testData = NULL.

Value

A list containing: 1) a train object from caret with model information, 2) a data.frame of variable importance for each feature included in the model, and 3) a data.frame of various performance metrics

Examples

# Read in ARROWHEAD-called TADs at 5kb
data(arrowhead_gm12878_5kb)

# Extract unique boundaries
bounds.GR <- extractBoundaries(domains.mat = arrowhead_gm12878_5kb,
                               filter = FALSE,
                               CHR = c("CHR21", "CHR22"),
                               resolution = 5000)

# Read in GRangesList of 26 TFBS
data(tfbsList)

# Create the binned data matrix for CHR1 (training) and CHR22 (testing)
# using 5 kb binning, distance-type predictors from 26 different TFBS from
# the GM12878 cell line, and random under-sampling
tadData <- createTADdata(bounds.GR = bounds.GR,
                         resolution = 5000,
                         genomicElements.GR = tfbsList,
                         featureType = "distance",
                         resampling = "rus",
                         trainCHR = "CHR21",
                         predictCHR = "CHR22")

# Perform random forest using TADrandomForest by tuning mtry over 10 values
# using 3-fold CV
tadModel <- TADrandomForest(trainData = tadData[[1]],
                            testData = tadData[[2]],
                            tuneParams = list(mtry = c(2,5,8,10,13,16,18,21,24,26),
                                            ntree = 500,
                                            nodesize = 1),
                            cvFolds = 3,
                            cvMetric = "Accuracy",
                            verbose = TRUE,
                            model = TRUE,
                            importances = TRUE,
                            impMeasure = "MDA",
                            performances = TRUE)
#> Loading required package: ggplot2
#> Loading required package: lattice
#> + Fold1: mtry= 2, ntree=500, nodesize=1 
#> - Fold1: mtry= 2, ntree=500, nodesize=1 
#> + Fold1: mtry= 5, ntree=500, nodesize=1 
#> - Fold1: mtry= 5, ntree=500, nodesize=1 
#> + Fold1: mtry= 8, ntree=500, nodesize=1 
#> - Fold1: mtry= 8, ntree=500, nodesize=1 
#> + Fold1: mtry=10, ntree=500, nodesize=1 
#> - Fold1: mtry=10, ntree=500, nodesize=1 
#> + Fold1: mtry=13, ntree=500, nodesize=1 
#> - Fold1: mtry=13, ntree=500, nodesize=1 
#> + Fold1: mtry=16, ntree=500, nodesize=1 
#> - Fold1: mtry=16, ntree=500, nodesize=1 
#> + Fold1: mtry=18, ntree=500, nodesize=1 
#> - Fold1: mtry=18, ntree=500, nodesize=1 
#> + Fold1: mtry=21, ntree=500, nodesize=1 
#> - Fold1: mtry=21, ntree=500, nodesize=1 
#> + Fold1: mtry=24, ntree=500, nodesize=1 
#> - Fold1: mtry=24, ntree=500, nodesize=1 
#> + Fold1: mtry=26, ntree=500, nodesize=1 
#> - Fold1: mtry=26, ntree=500, nodesize=1 
#> + Fold2: mtry= 2, ntree=500, nodesize=1 
#> - Fold2: mtry= 2, ntree=500, nodesize=1 
#> + Fold2: mtry= 5, ntree=500, nodesize=1 
#> - Fold2: mtry= 5, ntree=500, nodesize=1 
#> + Fold2: mtry= 8, ntree=500, nodesize=1 
#> - Fold2: mtry= 8, ntree=500, nodesize=1 
#> + Fold2: mtry=10, ntree=500, nodesize=1 
#> - Fold2: mtry=10, ntree=500, nodesize=1 
#> + Fold2: mtry=13, ntree=500, nodesize=1 
#> - Fold2: mtry=13, ntree=500, nodesize=1 
#> + Fold2: mtry=16, ntree=500, nodesize=1 
#> - Fold2: mtry=16, ntree=500, nodesize=1 
#> + Fold2: mtry=18, ntree=500, nodesize=1 
#> - Fold2: mtry=18, ntree=500, nodesize=1 
#> + Fold2: mtry=21, ntree=500, nodesize=1 
#> - Fold2: mtry=21, ntree=500, nodesize=1 
#> + Fold2: mtry=24, ntree=500, nodesize=1 
#> - Fold2: mtry=24, ntree=500, nodesize=1 
#> + Fold2: mtry=26, ntree=500, nodesize=1 
#> - Fold2: mtry=26, ntree=500, nodesize=1 
#> + Fold3: mtry= 2, ntree=500, nodesize=1 
#> - Fold3: mtry= 2, ntree=500, nodesize=1 
#> + Fold3: mtry= 5, ntree=500, nodesize=1 
#> - Fold3: mtry= 5, ntree=500, nodesize=1 
#> + Fold3: mtry= 8, ntree=500, nodesize=1 
#> - Fold3: mtry= 8, ntree=500, nodesize=1 
#> + Fold3: mtry=10, ntree=500, nodesize=1 
#> - Fold3: mtry=10, ntree=500, nodesize=1 
#> + Fold3: mtry=13, ntree=500, nodesize=1 
#> - Fold3: mtry=13, ntree=500, nodesize=1 
#> + Fold3: mtry=16, ntree=500, nodesize=1 
#> - Fold3: mtry=16, ntree=500, nodesize=1 
#> + Fold3: mtry=18, ntree=500, nodesize=1 
#> - Fold3: mtry=18, ntree=500, nodesize=1 
#> + Fold3: mtry=21, ntree=500, nodesize=1 
#> - Fold3: mtry=21, ntree=500, nodesize=1 
#> + Fold3: mtry=24, ntree=500, nodesize=1 
#> - Fold3: mtry=24, ntree=500, nodesize=1 
#> + Fold3: mtry=26, ntree=500, nodesize=1 
#> - Fold3: mtry=26, ntree=500, nodesize=1 
#> Aggregating results
#> Selecting tuning parameters
#> Fitting mtry = 16, ntree = 500, nodesize = 1 on full training set