5 How to do feature selection using recursive feature elimination
You might need a rigorous way to determine the important variables first before feeding them to the ML algorithm. This is important.
A good choice of selecting the important features is the recursive feature elimination (RFE)
RFE works in 3 broad steps:
Step 1: Build a ML model on a training dataset and estimate the feature importances on the test dataset.(在确定自由度的情况下,评价变量在测试数据集中的重要性)
Step 2: Keeping priority to the most important variables, iterate through by building models of given sizes. Ranking of the predictors is recalculated in each iteration.(把刚才的过程在不同的自由度下迭代执行)
Step 3: The model performances are compared across different subset sizes to arrive at the optimal number and list of final predictors.(比较不同自由度的测试错误率,给出最佳自由度模型选择)
Load Package And Data
# Load Package And Data
load("../../data/craet_4.Rdata")
library(tidyverse)
library(caret)
#Set Parallel Processing - Decrease computation time
if (!require("doMC")) install.packages("doMC")
library(doMC)
registerDoMC(cores = 4)
Feature select
set.seed(100)
options(warn=-1)
subsets <- c(1:5, 10, 15, 18)
#Step 1: Build a ML model on a training dataset and estimate the feature importances on the test dataset.(在确定自由度的情况下,评价变量在测试数据集中的重要性)
ctrl <- rfeControl(functions = rfFuncs,
method = "repeatedcv",#repeated K-fold cross-validation
number = 10,#10-fold cross-validations
repeats = 5, #five separate 10-fold cross-validations are used
verbose = FALSE)
#Step 2: Keeping priority to the most important variables, iterate through by building models of given sizes. Ranking of the predictors is recalculated in each iteration.(把刚才的过程在不同的自由度下迭代执行
lmProfile <- rfe(x=trainData[, 1:18], y=trainData$Purchase,
sizes = subsets,
rfeControl = ctrl)
#Step 3: The model performances are compared across different subset sizes to arrive at the optimal number and list of final predictors.(比较不同自由度的测试错误率,给出最佳自由度模型选择
lmProfile
##
## Recursive feature selection
##
## Outer resampling method: Cross-Validated (10 fold, repeated 5 times)
##
## Resampling performance over subset size:
##
## Variables Accuracy Kappa AccuracySD KappaSD Selected
## 1 0.7442 0.4569 0.04125 0.08753
## 2 0.8124 0.6031 0.04002 0.08505
## 3 0.8182 0.6136 0.04170 0.08790 *
## 4 0.8047 0.5879 0.04314 0.08993
## 5 0.8000 0.5770 0.04215 0.08861
## 10 0.8035 0.5826 0.04112 0.08815
## 15 0.8089 0.5918 0.04209 0.09076
## 18 0.8084 0.5918 0.04118 0.08894
##
## The top 3 variables (out of 3):
## LoyalCH, PriceDiff, StoreID
input
Size: sizes determines what all model sizes (the number of most important features) the rfe should consider
- rfeControl():
- functions: what type of algorithm should be used rfFuncs:: random forest based
- methods: repeated K-fold cross-validation
- number: 10-fold cross-validations
- repeats: five separate 10-fold cross-validations are used
- functions: what type of algorithm should be used rfFuncs:: random forest based
output
The Output Shows: - accuracy
- kappa (and their standard deviation) for the different model sizes we provided - The final selected model subset size is marked with a * in the rightmost Selected column.
save.image("../../data/craet_5.Rdata")
sessionInfo()
## R version 3.4.3 (2017-11-30)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Sierra 10.12.6
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] parallel methods stats graphics grDevices utils datasets
## [8] base
##
## other attached packages:
## [1] doMC_1.3.5 iterators_1.0.9 foreach_1.4.4 caret_6.0-78
## [5] lattice_0.20-35 forcats_0.3.0 stringr_1.3.0 dplyr_0.7.4
## [9] purrr_0.2.4 readr_1.1.1 tidyr_0.8.0 tibble_1.4.2
## [13] ggplot2_2.2.1 tidyverse_1.2.1
##
## loaded via a namespace (and not attached):
## [1] httr_1.3.1 ddalpha_1.3.1.1 sfsmisc_1.1-2
## [4] jsonlite_1.5 splines_3.4.3 prodlim_1.6.1
## [7] modelr_0.1.1 assertthat_0.2.0 stats4_3.4.3
## [10] DRR_0.0.3 cellranger_1.1.0 yaml_2.1.18
## [13] robustbase_0.92-8 ipred_0.9-6 pillar_1.2.1
## [16] backports_1.1.2 glue_1.2.0 digest_0.6.15
## [19] randomForest_4.6-12 rvest_0.3.2 colorspace_1.3-2
## [22] recipes_0.1.2 htmltools_0.3.6 Matrix_1.2-12
## [25] plyr_1.8.4 psych_1.7.8 timeDate_3043.102
## [28] pkgconfig_2.0.1 CVST_0.2-1 broom_0.4.3
## [31] haven_1.1.1 bookdown_0.7 scales_0.5.0.9000
## [34] gower_0.1.2 lava_1.6 withr_2.1.1.9000
## [37] nnet_7.3-12 lazyeval_0.2.1 cli_1.0.0
## [40] mnormt_1.5-5 survival_2.41-3 magrittr_1.5
## [43] crayon_1.3.4 readxl_1.0.0 evaluate_0.10.1
## [46] nlme_3.1-131.1 MASS_7.3-49 xml2_1.2.0
## [49] dimRed_0.1.0 foreign_0.8-69 class_7.3-14
## [52] blogdown_0.5 tools_3.4.3 hms_0.4.2
## [55] kernlab_0.9-25 munsell_0.4.3 bindrcpp_0.2
## [58] e1071_1.6-8 compiler_3.4.3 RcppRoll_0.2.2
## [61] rlang_0.2.0.9000 grid_3.4.3 rmarkdown_1.9
## [64] gtable_0.2.0 ModelMetrics_1.1.0 codetools_0.2-15
## [67] reshape2_1.4.3 R6_2.2.2 lubridate_1.7.3
## [70] knitr_1.20 bindr_0.1.1 rprojroot_1.3-2
## [73] stringi_1.1.7 Rcpp_0.12.16 rpart_4.1-13
## [76] tidyselect_0.2.4 DEoptimR_1.0-8 xfun_0.1