You might need a rigorous way to determine the important variables first before feeding them to the ML algorithm. This is important.

A good choice of selecting the important features is the recursive feature elimination (RFE)

RFE works in 3 broad steps:

Step 1: Build a ML model on a training dataset and estimate the feature importances on the test dataset.(在确定自由度的情况下,评价变量在测试数据集中的重要性)

Step 2: Keeping priority to the most important variables, iterate through by building models of given sizes. Ranking of the predictors is recalculated in each iteration.(把刚才的过程在不同的自由度下迭代执行)

Step 3: The model performances are compared across different subset sizes to arrive at the optimal number and list of final predictors.(比较不同自由度的测试错误率,给出最佳自由度模型选择)

Load Package And Data

# Load Package And Data
load("../../data/craet_4.Rdata")
library(tidyverse)
library(caret)
#Set Parallel Processing - Decrease computation time
if (!require("doMC")) install.packages("doMC")
library(doMC)
registerDoMC(cores = 4)

Feature select

set.seed(100)
options(warn=-1)

subsets <- c(1:5, 10, 15, 18)

#Step 1: Build a ML model on a training dataset and estimate the feature importances on the test dataset.(在确定自由度的情况下,评价变量在测试数据集中的重要性)
ctrl <- rfeControl(functions = rfFuncs,
                   method = "repeatedcv",#repeated K-fold cross-validation
                   number = 10,#10-fold cross-validations
                   repeats = 5, #five separate 10-fold cross-validations are used
                   verbose = FALSE)
#Step 2: Keeping priority to the most important variables, iterate through by building models of given sizes. Ranking of the predictors is recalculated in each iteration.(把刚才的过程在不同的自由度下迭代执行
lmProfile <- rfe(x=trainData[, 1:18], y=trainData$Purchase,
                 sizes = subsets,
                 rfeControl = ctrl)

#Step 3: The model performances are compared across different subset sizes to arrive at the optimal number and list of final predictors.(比较不同自由度的测试错误率,给出最佳自由度模型选择
lmProfile
## 
## Recursive feature selection
## 
## Outer resampling method: Cross-Validated (10 fold, repeated 5 times) 
## 
## Resampling performance over subset size:
## 
##  Variables Accuracy  Kappa AccuracySD KappaSD Selected
##          1   0.7442 0.4569    0.04125 0.08753         
##          2   0.8124 0.6031    0.04002 0.08505         
##          3   0.8182 0.6136    0.04170 0.08790        *
##          4   0.8047 0.5879    0.04314 0.08993         
##          5   0.8000 0.5770    0.04215 0.08861         
##         10   0.8035 0.5826    0.04112 0.08815         
##         15   0.8089 0.5918    0.04209 0.09076         
##         18   0.8084 0.5918    0.04118 0.08894         
## 
## The top 3 variables (out of 3):
##    LoyalCH, PriceDiff, StoreID

input

  • Size: sizes determines what all model sizes (the number of most important features) the rfe should consider

  • rfeControl():
    • functions: what type of algorithm should be used rfFuncs:: random forest based
      • methods: repeated K-fold cross-validation
      • number: 10-fold cross-validations
      • repeats: five separate 10-fold cross-validations are used

output

The Output Shows: - accuracy
- kappa (and their standard deviation) for the different model sizes we provided - The final selected model subset size is marked with a * in the rightmost Selected column.

save.image("../../data/craet_5.Rdata")
sessionInfo()
## R version 3.4.3 (2017-11-30)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Sierra 10.12.6
## 
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] parallel  methods   stats     graphics  grDevices utils     datasets 
## [8] base     
## 
## other attached packages:
##  [1] doMC_1.3.5      iterators_1.0.9 foreach_1.4.4   caret_6.0-78   
##  [5] lattice_0.20-35 forcats_0.3.0   stringr_1.3.0   dplyr_0.7.4    
##  [9] purrr_0.2.4     readr_1.1.1     tidyr_0.8.0     tibble_1.4.2   
## [13] ggplot2_2.2.1   tidyverse_1.2.1
## 
## loaded via a namespace (and not attached):
##  [1] httr_1.3.1          ddalpha_1.3.1.1     sfsmisc_1.1-2      
##  [4] jsonlite_1.5        splines_3.4.3       prodlim_1.6.1      
##  [7] modelr_0.1.1        assertthat_0.2.0    stats4_3.4.3       
## [10] DRR_0.0.3           cellranger_1.1.0    yaml_2.1.18        
## [13] robustbase_0.92-8   ipred_0.9-6         pillar_1.2.1       
## [16] backports_1.1.2     glue_1.2.0          digest_0.6.15      
## [19] randomForest_4.6-12 rvest_0.3.2         colorspace_1.3-2   
## [22] recipes_0.1.2       htmltools_0.3.6     Matrix_1.2-12      
## [25] plyr_1.8.4          psych_1.7.8         timeDate_3043.102  
## [28] pkgconfig_2.0.1     CVST_0.2-1          broom_0.4.3        
## [31] haven_1.1.1         bookdown_0.7        scales_0.5.0.9000  
## [34] gower_0.1.2         lava_1.6            withr_2.1.1.9000   
## [37] nnet_7.3-12         lazyeval_0.2.1      cli_1.0.0          
## [40] mnormt_1.5-5        survival_2.41-3     magrittr_1.5       
## [43] crayon_1.3.4        readxl_1.0.0        evaluate_0.10.1    
## [46] nlme_3.1-131.1      MASS_7.3-49         xml2_1.2.0         
## [49] dimRed_0.1.0        foreign_0.8-69      class_7.3-14       
## [52] blogdown_0.5        tools_3.4.3         hms_0.4.2          
## [55] kernlab_0.9-25      munsell_0.4.3       bindrcpp_0.2       
## [58] e1071_1.6-8         compiler_3.4.3      RcppRoll_0.2.2     
## [61] rlang_0.2.0.9000    grid_3.4.3          rmarkdown_1.9      
## [64] gtable_0.2.0        ModelMetrics_1.1.0  codetools_0.2-15   
## [67] reshape2_1.4.3      R6_2.2.2            lubridate_1.7.3    
## [70] knitr_1.20          bindr_0.1.1         rprojroot_1.3-2    
## [73] stringi_1.1.7       Rcpp_0.12.16        rpart_4.1-13       
## [76] tidyselect_0.2.4    DEoptimR_1.0-8      xfun_0.1