什么是KNN
KNN (K-nearest neighbor) ,即 K 近邻算法. 它的工作原理非常简单.
举个例子,我们有下面的训练集数据,根据电影中出现的打斗镜头和接吻镜头数将其分为爱情片和动作片:
KNN 采用的方法是将新数据的每个特征与样本集中每个样本进行比较,计算其同样本的距离. 然后找出k 个距离最近的样本. 这些样本所属分类最多的分类,就是该数据的分类.
比如,电影“?” 同上面样本集每个样本计算,得到距离. 如果 K 取 3,我们会发现“?” 同三个爱情片的距离最近,因此“?”也是爱情片.
然后写出 KNN 分类算法:
KNN 算法在使用的时候,一般需要对特征数据进行归一化处理:
下面列出算法.
算法详解
我们用 Python 实现了 KNN 算法来解决上面提出的分类问题.
首先,生成算法使用的数据集:
def createDataSet():
'''
创建数据集
'''
group = np.array([[3, 104], [2, 100], [1, 81], [101, 10],[99, 5], [98, 2]])
labels = ['A', 'A', 'A', 'D','D', 'D']
return group, labels
def classify0(inX, dataSet, labels, k):
'''
KNN 分类
:param inX:
:param dataSet:
:param labels:
:param k:
:return:
'''
dataSetSize = dataSet.shape[0]
diffMat = np.tile(inX, (dataSetSize, 1)) - dataSet
sqDiffMat = diffMat ** 2
sqDistances = sqDiffMat.sum(axis=1)
distances = sqDistances ** 0.5
sortedDistIndicies = distances.argsort()
classCount = {}
for i in range(k):
voteIlabel = labels[sortedDistIndicies[i]]
classCount[voteIlabel] = classCount.get(voteIlabel, 0) + 1
sortedClassCount = sorted(classCount.items(), key=operator.itemgetter(1), reverse=True)
print(sortedClassCount)
return sortedClassCount[0][0]
def autoNorm(dataSet):
'''
归一化
:param dataSet:
:return:
'''
minVals = dataSet.min(0)
maxVals = dataSet.max(0)
ranges = maxVals - minVals
normDataSet = np.zeros(np.shape(dataSet))
m = dataSet.shape[0]
normDataSet = dataSet - np.tile(minVals, (m, 1))
normDataSet = normDataSet/np.tile(ranges, (m,1))
return normDataSet, ranges, minVals
def plot(dataSet):
fig = plt.figure()
ax = fig.add_subplot(111)
ax.scatter(dataSet[:, 0], dataSet[:, 1])
plt.show()
整体代码如下:
#!/usr/bin/python
# -*- coding: UTF-8 -*-
import numpy as np
import operator
import matplotlib
import matplotlib.pyplot as plt
def createDataSet():
'''
创建数据集
'''
group = np.array([[3, 104], [2, 100], [1, 81], [101, 10],[99, 5], [98, 2]])
labels = ['A', 'A', 'A', 'D','D', 'D']
return group, labels
def classify0(inX, dataSet, labels, k):
'''
KNN 分类
:param inX:
:param dataSet:
:param labels:
:param k:
:return:
'''
dataSetSize = dataSet.shape[0]
diffMat = np.tile(inX, (dataSetSize, 1)) - dataSet
sqDiffMat = diffMat ** 2
sqDistances = sqDiffMat.sum(axis=1)
distances = sqDistances ** 0.5
sortedDistIndicies = distances.argsort()
classCount = {}
for i in range(k):
voteIlabel = labels[sortedDistIndicies[i]]
classCount[voteIlabel] = classCount.get(voteIlabel, 0) + 1
sortedClassCount = sorted(classCount.items(), key=operator.itemgetter(1), reverse=True)
print(sortedClassCount)
return sortedClassCount[0][0]
def autoNorm(dataSet):
'''
归一化
:param dataSet:
:return:
'''
minVals = dataSet.min(0)
maxVals = dataSet.max(0)
ranges = maxVals - minVals
normDataSet = np.zeros(np.shape(dataSet))
m = dataSet.shape[0]
normDataSet = dataSet - np.tile(minVals, (m, 1))
normDataSet = normDataSet/np.tile(ranges, (m,1))
return normDataSet, ranges, minVals
def plot(dataSet):
fig = plt.figure()
ax = fig.add_subplot(111)
ax.scatter(dataSet[:, 0], dataSet[:, 1])
plt.show()
if __name__ == '__main__':
dataSet, labels = createDataSet()
dataSet, _, _ = autoNorm(dataSet)
print(dataSet)
plot(dataSet)
print(classify0([18, 90], dataSet, labels,4))
KNN的问题
KNN 算法是基于实例的学习. KNN 算法执行时使用全部数据集. 因此如果数据量比较大,KNN 算法会使用大量和存储空间,计算也会非常耗时.