Load Package And Data

load("../../data/craet_3-2.Rdata")
library(tidyverse)
library(caret)

Why Dummy Variables

对于字符型的因子变量,我们需要把它转变为有序的数值,一般转为 0,1 的二变量, 这样0 就代表基础水平, 1代表比较组

How

# One-Hot Encoding
# Creating dummy variables is converting a categorical variable to as many binary variables as here are categories.
dummies_model <- dummyVars(Purchase ~ ., data=trainData)

# Create the dummy variables using predict. The Y variable (Purchase) will not be present in trainData_mat.
trainData_mat <- predict(dummies_model, newdata = trainData)

# # Convert to dataframe
trainData <- data.frame(trainData_mat)

# # See the structure of the new dataset
str(trainData)
## 'data.frame':    857 obs. of  18 variables:
##  $ WeekofPurchase: num  -1.1 -1.74 -1.68 -1.29 -1.04 ...
##  $ StoreID       : num  -1.29 -1.29 1.33 1.33 1.33 ...
##  $ PriceCH       : num  -1.14 -1.73 -1.73 -1.14 -1.14 ...
##  $ PriceMM       : num  -0.688 -2.898 -2.898 -0.688 -0.688 ...
##  $ DiscCH        : num  -0.452 -0.452 -0.452 -0.452 -0.452 ...
##  $ DiscMM        : num  -0.582 -0.582 -0.582 1.341 1.341 ...
##  $ SpecialCH     : num  -0.429 -0.429 -0.429 2.329 -0.429 ...
##  $ SpecialMM     : num  -0.42 -0.42 -0.42 -0.42 -0.42 ...
##  $ LoyalCH       : num  -0.205 -0.525 1.256 1.324 1.35 ...
##  $ SalePriceMM   : num  0.113 -1.101 -1.101 -1.506 -1.506 ...
##  $ SalePriceCH   : num  -0.431 -0.844 -0.844 -0.431 -0.431 ...
##  $ PriceDiff     : num  0.341 -0.563 -0.563 -1.165 -1.165 ...
##  $ Store7.No     : num  1 1 0 0 0 0 0 0 0 1 ...
##  $ Store7.Yes    : num  0 0 1 1 1 1 1 1 1 0 ...
##  $ PctDiscMM     : num  -0.588 -0.588 -0.588 1.447 1.447 ...
##  $ PctDiscCH     : num  -0.448 -0.448 -0.448 -0.448 -0.448 ...
##  $ ListPriceDiff : num  0.211 -1.988 -1.988 0.211 0.211 ...
##  $ STORE         : num  -0.457 -0.457 -1.15 -1.15 -1.15 ...

Why Normalization

为了消除不同变量由于单位造成的权重影响,我们对数据进行数据标准化

How

  1. range: Normalize values so it ranges between 0 and 1
  2. center: Subtract Mean
  3. scale: Divide by standard deviation
  4. BoxCox: Remove skewness leading to normality. Values must be > 0
  5. YeoJohnson: Like BoxCox, but works for negative values.
  6. expoTrans: Exponential transformation, works for negative values.
  7. pca: Replace with principal components
  8. ica: Replace with independent components
  9. spatialSign: Project the data to a unit circle
preProcess_range_model <- preProcess(trainData, method='range')
trainData <- predict(preProcess_range_model, newdata = trainData)

# Append the Y variable
trainData$Purchase <- y

apply(trainData[, 1:10], 2, FUN=function(x){c('min'=min(x), 'max'=max(x))})
##     WeekofPurchase StoreID PriceCH PriceMM DiscCH DiscMM SpecialCH
## min              0       0       0       0      0      0         0
## max              1       1       1       1      1      1         1
##     SpecialMM LoyalCH SalePriceMM
## min         0       0           0
## max         1       1           1
save.image(file = "../../data/craet_3-3.Rdata")