## 前言

• 近几年机器学习很火，但是我对机器学习的了解仅仅在能做可学习的一种程序，通过大量的数据集训练到达目标，但是内部到达是怎么做的完全不知道。
• 这里决通过 斯坦福大学(coursera)machine-learning 免费公开课进行学习，并且把学到的知识整理为一篇一篇博文。
• 第一篇的篇幅主要讲 机器学习的定义监督学习无监督学习线性回归梯度下降
• 顺便整理一个专有词对应表。

## 一、机器学习的定义

Arthur Samuel 的定义

Tom Mitchell 的定义

Tom Mitchell provides a more modern definition: “A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.”

Example: playing checkers.

E = the experience of playing many games of checkers

T = the task of playing checkers.

P = the probability that the program will win the next game.

In general, any machine learning problem can be assigned to one of two broad classifications:

Supervised learning and Unsupervised learning.

Tom Mitchell 提供了一个更现代的定义：“据说计算机程序从经验 E 中学习某些任务 T 和绩效测量 P，如果它在 T 中的任务中的表现，由 P 测量，随经验 E 而改善。“

E = 玩许多跳棋游戏的经验

T = 玩跳棋的任务。

P = 程序赢得下一场比赛的概率。

• 监督学习: 可以被预测的结果
• 无监督学习: 无法预测的结果

## 二、监督学习

Supervised learning problems are categorized into “regression” and “classification” problems. In a regression problem, we are trying to predict results within a continuous output, meaning that we are trying to map input variables to some continuous function. In a classification problem, we are instead trying to predict results in a discrete output. In other words, we are trying to map input variables into discrete categories.

Example 1:

Given data about the size of houses on the real estate market, try to predict their price. Price as a function of size is a continuous output, so this is a regression problem.

We could turn this example into a classification problem by instead making our output about whether the house “sells for more or less than the asking price.” Here we are classifying the houses based on price into two discrete categories.

Example 2:

(a) Regression - Given a picture of a person, we have to predict their age on the basis of the given picture

(b) Classification - Given a patient with a tumor, we have to predict whether the tumor is malignant or benign.

（b）分类 - 鉴于患有肿瘤的患者，我们必须预测肿瘤是恶性的还是良性的。

## 三、无监督学习

We can derive this structure by clustering the data based on relationships among the variables in the data.

With unsupervised learning there is no feedback based on the prediction results.

Example:

Clustering: Take a collection of 1,000,000 different genes, and find a way to automatically group these genes into groups that are somehow similar or related by different variables, such as lifespan, location, roles, and so on.

Non-clustering: The “Cocktail Party Algorithm”, allows you to find structure in a chaotic environment. (i.e. identifying individual voices and music from a mesh of sounds at a cocktail party).

## 四、练习 1

1. A computer program is said to learn from experience E with respect to some task T and some performance measure P if its performance on T, as measured by P, improves with experience E.Suppose we feed a learning algorithm a lot of historical weather data, and have it learn to predict weather. In this setting, what is T?
• a. The process of the algorithm examining a large amount of historical weather data.
• b. The probability of it correctly predicting a future date’s weather.
• c. The weather prediction task.
• d. None of these.

1. The amount of rain that falls in a day is usually measured in either millimeters (mm) or inches. Suppose you use a learning algorithm to predict how much rain will fall tomorrow. Would you treat this as a classification or a regression problem?
• a. Regression
• b. Classification

1. Suppose you are working on stock market prediction. You would like to predict whether or not a certain company will win a patent infringement lawsuit (by training on data of companies that had to defend against similar lawsuits). Would you treat this as a classification or a regression problem?
• a. Classification
• b. Regression

1. Some of the problems below are best addressed using a supervised learning algorithm, and the others with an unsupervised learning algorithm. Which of the following would you apply supervised learning to? (Select all that apply.) In each case, assume some appropriate dataset is available for your algorithm to learn from.
• a. Given data on how 1000 medical patients respond to an experimental drug (such as effectiveness of the treatment, side effects, etc.), discover whether there are different categories or “types” of patients in terms of how they respond to the drug, and if so what these categories are.
• b. Given genetic (DNA) data from a person, predict the odds of him/her developing diabetes over the next 10 years.
• c. Given a large dataset of medical records from patients suffering from heart disease, try to learn whether there might be different clusters of such patients for which we might tailor separate treatments.
• d. Have a computer examine an audio clip of a piece of music, and classify whether or not there are vocals (i.e., a human voice singing) in that audio clip, or if it is a clip of only musical instruments (and no vocals).

1. Which of these is a reasonable definition of machine learning?
• a. Machine learning is the field of allowing robots to act intelligently.
• b. Machine learning is the science of programming computers.
• c. Machine learning is the field of study that gives computers the ability to learn without being explicitly programmed.
• d. Machine learning learns from labeled data.

1. 一个计算机程序据说可以从经验 E 中学习一些任务 T 和一些绩效测量 P，如果它在 T 上的表现，用 P 来衡量，随着经验的提高而提高 E。假设我们为学习算法提供了大量的历史天气数据，让它学会预测天气。在这种情况下，什么是 T
• a. 该算法检查大量历史天气数据的过程。
• b. 正确预测未来日期天气的概率。
• c. 天气预报任务。
• d. 都不是。

1. 一天中下降的雨量通常以毫米（mm）或英寸为单位。假设您使用学习算法来预测明天将下降多少雨。您会将此视为分类还是回归问题？
• a. 回归
• b. 分类

1. 假设您正在进行股市预测。您想预测某家公司是否会赢得专利侵权诉讼（通过培训必须为类似诉讼辩护的公司的数据）。您会将此视为分类还是回归问题？
• a. 分类
• b. 回归

1. (hasMany)下面的一些问题最好使用监督学习算法解决，其他问题使用无监督学习算法。您将以下哪项应用监督学习？ （选择所有适用的选项。）在每种情况下，假设您的算法可以使用一些适当的数据集来学习。
• a. 根据 1000 名医疗患者对实验药物的反应（如治疗效果，副作用等）的数据，发现患者对药物的反应方式是否存在不同的类别或“类型”，以及如果是这样，这些类别是什么。
• b. 根据一个人的遗传（DNA）数据，预测他/她在未来 10 年内患糖尿病的几率。
• c. 根据患有心脏病的患者的医疗记录的大量数据集，尝试了解是否可能存在不同的这类患者群体，我们可以针对这些患者量身定制单独的治疗方案。
• d. 让计算机检查一段音乐的音频片段，并分类该音频片段中是否存在人声（即，人声唱歌），或者它是否仅是乐器（并且没有人声）的片段。

1. 这些是机器学习的合理定义？
• a. 机器学习是允许机器人智能行动的领域。
• b. 机器学习是计算机编程的科学。
• c. 机器学习是一个研究领域，它使计算机无需明确编程即可学习。
• d. 机器学习从标记数据中学习。

1. c 根据机器学习定义中的得等 T 是目标的任务。
2. a 根据以往的数据制作为曲线函数或者直线函数来预测一个值，所以是回归算法。
3. a 目标要求明确为是否赢得专利侵权诉讼，所以是分类算法。
4. b, c 主要问题是 c 选项中语言陷阱明明是一个是否有其它分类的说法，如果不注意会以为应该用无监督算法。
5. c 这个没啥好说的，自己去翻定义。

## 五、线性回归算法

To establish notation for future use, we’ll use x(i) to denote the input variables (living area in this example), also called input features, and y(i) to denote the output or target variable that we are trying to predict (price). A pair (x(i),y(i)) is called a training example, and the dataset that we’ll be using to learn—a list of m training examples (x(i),y(i)); i=1,... , $m is called a training set. Note that the superscript (i)$ in the notation is simply an index into the training set, and has nothing to do with exponentiation. We will also use X to denote the space of input values, and Y to denote the space of output values. In this example, X = Y = ℝ.

To describe the supervised learning problem slightly more formally, our goal is, given a training set, to learn a function h : X → Y so that h(x) is a good predictor for the corresponding value of y. For historical reasons, this function h is called a hypothesis. Seen pictorially, the process is therefore like this:

When the target variable that we’re trying to predict is continuous, such as in our housing example, we call the learning problem a regression problem. When y can take on only a small number of discrete values (such as if, given the living area, we wanted to predict if a dwelling is a house or an apartment, say), we call it a classification problem.

To break it apart, it is \frac{1}{2}\bar{x} where \bar{x} is the mean of the squares of h_\theta (x_{i}) - y_{i}, or the difference between the predicted value and the actual value.

This function is otherwise called the Squared error function, or Mean squared error. The mean is halved (\frac{1}{2}) as a convenience for the computation of the gradient descent, as the derivative term of the square function will cancel out the \frac{1}{2} term. The following image summarizes what the cost function does:

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27  import math data = [ { 'x': 1, 'y': 1, }, { 'x': 2, 'y': 2, }, { 'x': 3, 'y': 3, } ] def θ(θ0, θ1): def h(x): return θ0 + θ1 * x def cost(): sum = 0 for i in data: sum += math.pow(h(i['x']) - i['y'], 2) return sum / (len(data) * 2) return h, cost h, cost = θ(0, 1) 

## 八、代价函数 - 说明 2

A contour plot is a graph that contains many contour lines. A contour line of a two variable function has a constant value at all points of the same line. An example of such a graph is the one to the right below.

Taking any color and going along the circle, one would expect to get the same value of the cost function. For example, the three green points found on the green line above have the same value for J(θ_0, θ_1) and as a result, they are found along the same line. The circled x displays the value of the cost function for the graph on the left when θ_0 = 800 and θ_1 = -0.15. Taking another h(x) and plotting its contour plot, one gets the following graphs:

When θ_0 = 360 and θ_1 = 0, the value of J(θ_0, θ_1) in the contour plot gets closer to the center thus reducing the cost function error. Now giving our hypothesis function a slightly positive slope results in a better fit of the data.

The graph above minimizes the cost function as much as possible and consequently, the result of θ_1​ and θ_0​ tend to be around 0.12 and 250 respectively. Plotting those values on our graph to the right seems to put our point in the center of the inner most circle.

θ_0 = 360θ_1 = 0 时，等高线图中 J(θ_0, θ_1) 的值越接近中心，从而降低了成本函数误差。现在给出我们的假设函数略微正斜率可以更好地拟合数据。

## 八、梯度下降

Imagine that we graph our hypothesis function based on its fields θ_0 and θ_1 (actually we are graphing the cost function as a function of the parameter estimates). We are not graphing x and y itself, but the parameter range of our hypothesis function and the cost resulting from selecting a particular set of parameters.

We put θ_0 on the x axis and θ_1 on the y axis, with the cost function on the vertical z axis. The points on our graph will be the result of the cost function using our hypothesis with those specific theta parameters. The graph below depicts such a setup.

We will know that we have succeeded when our cost function is at the very bottom of the pits in our graph, i.e. when its value is the minimum. The red arrows show the minimum points in the graph.

The way we do this is by taking the derivative (the tangential line to a function) of our cost function. The slope of the tangent is the derivative at that point and it will give us a direction to move towards. We make steps down the cost function in the direction with the steepest descent. The size of each step is determined by the parameter \alpha, which is called the learning rate.

For example, the distance between each star in the graph above represents a step determined by our parameter \alpha. A smaller \alpha would result in a smaller step and a larger \alpha results in a larger step. The direction in which the step is taken is determined by the partial derivative of J(θ_0, θ_1). Depending on where one starts on the graph, one could end up at different points. The image above shows us two different starting points that end up in two different places.

where j=0,1 represents the feature index number. At each iteration j, one should simultaneously update the parameters θ_1, θ_2, …, θ_n. Updating a specific parameter prior to calculating another one on the j^{(th)} iteration would yield to a wrong implementation.