学习预测函数的参数,并在相同数据集上进行测试是一种错误的做法: 因为仅给出测试用例标签的模型将会获得极高的分数,但对于尚未出现过的数据它则无法预测出任何有用的信息。 这种情况称为过拟合。为了避免这种情况,在进行(监督)机器学习实验时,一般会将数据集划分为训练集和测试集,训练集训练数据,测试集查看训练结果。

按照划分方式不同,分为纯随机采样和分层采样。

纯随机采样:train_test_split()

案例

1
2
3
4
5
6
from sklearn.model_selection import train_test_split
from sklearn import datasets

# 获取鸢尾花数据集
iris = datasets.load_iris()
print(iris.DESCR)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

:Number of Instances: 150 (50 in each of three classes)
:Number of Attributes: 4 numeric, predictive attributes and the class
:Attribute Information:
- sepal length in cm
- sepal width in cm
- petal length in cm
- petal width in cm
- class:
- Iris-Setosa
- Iris-Versicolour
- Iris-Virginica

:Summary Statistics:

============== ==== ==== ======= ===== ====================
Min Max Mean SD Class Correlation
============== ==== ==== ======= ===== ====================
sepal length: 4.3 7.9 5.84 0.83 0.7826
sepal width: 2.0 4.4 3.05 0.43 -0.4194
petal length: 1.0 6.9 3.76 1.76 0.9490 (high!)
petal width: 0.1 2.5 1.20 0.76 0.9565 (high!)
============== ==== ==== ======= ===== ====================

:Missing Attribute Values: None
:Class Distribution: 33.3% for each of 3 classes.
:Creator: R.A. Fisher
:Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
:Date: July, 1988

The famous Iris database, first used by Sir R.A. Fisher. The dataset is taken
from Fisher's paper. Note that it's the same as in R, but not as in the UCI
Machine Learning Repository, which has two wrong data points.

This is perhaps the best known database to be found in the
pattern recognition literature. Fisher's paper is a classic in the field and
is referenced frequently to this day. (See Duda & Hart, for example.) The
data set contains 3 classes of 50 instances each, where each class refers to a
type of iris plant. One class is linearly separable from the other 2; the
latter are NOT linearly separable from each other.

.. topic:: References

- Fisher, R.A. "The use of multiple measurements in taxonomic problems"
Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to
Mathematical Statistics" (John Wiley, NY, 1950).
- Duda, R.O., & Hart, P.E. (1973) Pattern Classification and Scene Analysis.
(Q327.D83) John Wiley & Sons. ISBN 0-471-22361-1. See page 218.
- Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System
Structure and Classification Rule for Recognition in Partially Exposed
Environments". IEEE Transactions on Pattern Analysis and Machine
Intelligence, Vol. PAMI-2, No. 1, 67-71.
- Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule". IEEE Transactions
on Information Theory, May 1972, 431-433.
- See also: 1988 MLC Proceedings, 54-64. Cheeseman et al"s AUTOCLASS II
conceptual clustering system finds 3 classes in the data.
- Many, many more ...
1
print(iris.data.shape, iris.target.shape)
1
(150, 4) (150,)
1
2
3
4
5
6
# 划分数据集
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.4, random_state=0)

print(X_train.data.shape,y_train.shape )
print("------------")
print(X_test.data.shape,y_test.shape )
1
2
3
(90, 4) (90,)
------------
(60, 4) (60,)

X_train,X_test, y_train, y_test =train_test_split(train_data,train_target,test_size=0.4, random_state=0)

参数解释:

  • train_data:划分的样本特征集(例如:鸢尾花数据集的 4 种特征)
  • train_target:划分的样本结果(例如:鸢尾花数据集对应的 4 种种类)
  • test_size:样本划分的占比(例如:上述案例,test_size=0.4 代表测试集占比 40%),如果是整数的话就是样本的数量
  • random_state:可以为整数、RandomState实例或None,默认为None。在进行重复测试时,保证得到一组一样的随机数。例如:random_state=0,每次数据内容划分都会不一样。random_state=1,其他参数一样的情况下你得到的随机数组也是一样。若为None,每次生成的数据都是随机,可能不一样

分层采样:StratifiedShuffleSplit()

如果所拥有的数据集是不平衡的,例如:男女比例、地区比例。这种情况最好使用分层交叉验证来确保训练集和测试集都包含每个类样本分层情况。例如某个样本数据集下男生 40 人,女生 60 人,在抽样对应的训练集的时候,也会按照这个比例来进行划分。

1
2
from  sklearn.model_selection import StratifiedShuffleSplit
StratifiedShuffleSplit(n_splits=10,test_size=None,train_size=None, random_state=None)

参数解释:

  • n_splits:将训练数据分成 train/test 对的组数,可根据需要进行设置,默认为 10
  • test_size:test_size和train_size: 是用来设置train/test对中train和test所占的比例。例如:train_size=0.8 test_size=0.2
  • random_state:随机种子

案例

1
2
3
4
5
6
7
8
9
from  sklearn.model_selection import StratifiedShuffleSplit
import numpy as np

X = np.array([[10, 20], [30, 40], [50, 60], [70, 80], [90,100]])
Y = np.array([0, 1, 1, 0, 0])
ss = StratifiedShuffleSplit(n_splits=5, test_size=0.5, random_state=1)

for train_index, test_index in ss.split(X, Y):
print(train_index, test_index)
1
2
3
4
5
[2 0] [4 1 3]
[2 0] [1 4 3]
[4 2] [3 0 1]
[2 0] [4 1 3]
[2 4] [1 0 3]

说明:

  1. 结果分为训练集和测试集。test_size=0.5:测试集和训练集划分各占 50%,因为数据是 5 组,这里是直接划分为训练集 2 组,测试集 3 组
  2. 关于分层抽样:这 5 组的结果,肯定是包含三个“0”, 两个“1”同样按照上述划分方式划分。