数据准备好了之后的第一步就是拆分数据集为训练数据和测试数据，一般是 8:2 的比例。

为什么拆分数据呢？

当我们在构建一个机器学习模型上时，真正的目的是为了预测真是世界的数据，而机器学习模型是依靠算法学习训练数据学习Y 与 X 的关系，这种的关系的学习好坏的评判是要依靠没有参与学习模型的数据与预测数据之间的差距来评判的。

# Load the caret package
library(caret)

# Import dataset
orange <- read.csv('../../data/orange_juice_withmissing.csv')
# Create the training and test datasets
set.seed(100)

# Step 1: Get row numbers for the training data
trainRowNumbers <- createDataPartition(orange$Purchase, p=0.8, list=FALSE)

# Step 2: Create the training  dataset
trainData <- orange[trainRowNumbers,]

# Step 3: Create the test dataset
testData <- orange[-trainRowNumbers,]

# Store X and Y for later use.
x = trainData[, 2:18]
y = trainData$Purchase

createDataPartition：输入 Y 和 P 比率（训练数据的比率）输出训练数据的行索引。

save the image for next blog

save.image(file = "../../data/caret.Rdata")

3.1 How to split the dataset into training and validation?

save the image for next blog

Jixing Liu

3.1 How to split the dataset into training and validation?

save the image for next blog

Jixing Liu

使用 R 输出格式化的 Excel

如何拟合一条曲线

努力后的失败，才是诚实的失败

蝇王

如何阅读大量的学术论文, 而不发疯？

多标签分类问题

新药研发

Deep Work

The Hello World Of Neural Network

使用 R 分析可视化你的 iPhone 健康 APP 数据