[cs231n] assignment1 KNN部分

1.前言

最近在看CS231n的课程,是由李飞飞领头开设的课程。授课的质量很高,这个是第一个assignment的作业部分。主要是加深对如下几个部分的认识和理解:

  • 图像处理的数据集CIFRAR-10
  • kNN的实现
  • 交叉验证
  • 用numpy实现向量化

2.具体实现部分

kNN算法比较直观,对于每一个测试集的数据,去找和它相邻的最近的k个训练集的数据,然后把它们对应的label进行投票,投票最多的那个就是这个测试集的输出。直接贴上代码,上面一般都附上了注释:

k-Nearest Neighbor (kNN) exercise

Complete and hand in this completed worksheet (including its outputs and any supporting code outside of the worksheet) with your assignment submission. For more details see the assignments page on the course website.

The kNN classifier consists of two stages:

  • During training, the classifier takes the training data and simply remembers it
  • During testing, kNN classifies every test image by comparing to all training images and transfering the labels of the k most similar training examples
  • The value of k is cross-validated

In this exercise you will implement these steps and understand the basic Image Classification pipeline, cross-validation, and gain proficiency in writing efficient, vectorized code.

《[cs231n] assignment1 KNN部分》

We would now like to classify the test data with the kNN classifier. Recall that we can break down this process into two steps:

  1. First we must compute the distances between all test examples and all train examples.
  2. Given these distances, for each test example we find the k nearest examples and have them vote for the label

Lets begin with computing the distance matrix between all training and test examples. For example, if there are Ntr training examples and Nte test examples, this stage should result in a Nte x Ntr matrix where each element (i,j) is the distance between the i-th test and j-th train example.

First, open cs231n/classifiers/k_nearest_neighbor.py and implement the function compute_distances_two_loops that uses a (very inefficient) double loop over all pairs of (test, train) examples and computes the distance matrix one element at a time.

《[cs231n] assignment1 KNN部分》

Inline Question #1: Notice the structured patterns in the distance matrix, where some rows or columns are visible brighter. (Note that with the default color scheme black indicates low distances while white indicates high distances.)

  • What in the data is the cause behind the distinctly bright rows?
  • What causes the columns?

Your Answer: fill this in.

You should expect to see approximately 27% accuracy. Now lets try out a larger k, say k = 5:

You should expect to see a slightly better performance than with k = 1.

Inline Question 2
We can also other distance metrics such as L1 distance.
The performance of a Nearest Neighbor classifier that uses L1 distance will not change if (Select all that apply.):
1. The data is preprocessed by subtracting the mean.
2. The data is preprocessed by subtracting the mean and dividing by the standard deviation.
3. The coordinate axes for the data are rotated.
4. None of the above.

Your Answer:

Your explanation:

Cross-validation

We have implemented the k-Nearest Neighbor classifier but we set the value k = 5 arbitrarily. We will now determine the best value of this hyperparameter with cross-validation.

《[cs231n] assignment1 KNN部分》

Inline Question 3
Which of the following statements about $k$-Nearest Neighbor ($k$-NN) are true in a classification setting, and for all $k$? Select all that apply.
1. The training error of a 1-NN will always be better than that of 5-NN.
2. The test error of a 1-NN will always be better than that of a 5-NN.
3. The decision boundary of the k-NN classifier is linear.
4. The time needed to classify a test example with the k-NN classifier grows with the size of the training set.
5. None of the above.

Your Answer:

Your explanation:


k_nearest_neighbour.py

总结

矩阵的向量化表示是一个很难的事情,在我们的作业中,难点就在于求测试集和训练集的距离中,从二重循环-> 一重循环-> 无循环的过程,对于最后一个过程,主要是将距离表示进行展开然后分别进行合并的思想。此外,如果能熟悉一些python的API,会大大简化功能,比如在进行kNN的选民投票投出最高的那个的时候,我就用了之前在写词频统计用到了Collections.Count这个函数,加上一个将二维的numpy->一维的flatten函数就非常简短地搞定了。

交叉验证这一部分一开始我写的时候有点懵,其实主要就是讲我们的训练集进行拆分,在我们的代码里是拆分了5组,其中4组作为测试集,一组作为验证集。然后不断地进行测试。

《[cs231n] assignment1 KNN部分》

最后关于kNN什么时候会效果最好的问题,从之前的最终效果图中我们可以看到在我们的例子大概在10的时候效果最好,不过即使在这时,也才30%左右,因此kNN作为图像分裂的效果是很差的。

Ref

MATLAB计算矩阵间的欧式距离(不用循环!)

实验小结】cs231n assignment1 knn 部分

点赞

发表评论

电子邮件地址不会被公开。 必填项已用*标注