ZEROMAKE | keep codeing and thinking!
2019-09-15

# 机器学习第二周

## 前言¶

• 继续上一篇的的学习，但是明显发现我的进度完全不够快，拖延症还是泛滥了。
• 这一篇继续第二周的学习进度。

## 一、多个特征¶

Note: [7:25 - $θ^T$ is a 1 by (n+1) matrix and not an (n+1) by 1 matrix]

Linear regression with multiple variables is also known as multivariate linear regression.

We now introduce notation for equations where we can have any number of input variables.

$x_j^{(i)} = \text{value of feature } j \text{ in the }i^{th}\text{ training example} \\ x^{(i)} = \text{the input (features) of the }i^{th}\text{ training example} \\ m = \text{the number of training examples} \\ n = \text{the number of features}$

The multivariable form of the hypothesis function accommodating these multiple features is as follows:

$h_\theta (x) = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + \theta_3 x_3 + \cdots + \theta_n x_n$

In order to develop intuition about this function, we can think about $θ_0$ as the basic price of a house, $θ_1$ as the price per square meter, $θ_2$ as the price per floor, etc. $x_1$ will be the number of square meters in the house, $x_2$ the number of floors, etc.

Using the definition of matrix multiplication, our multivariable hypothesis function can be concisely represented as:

$h_\theta(x) =\begin{bmatrix}\theta_0 \hspace{1em} \theta_1 \hspace{1em} ... \hspace{1em} \theta_n\end{bmatrix}\begin{bmatrix}x_0 \\ x_1 \\ \vdots \\ x_n\end{bmatrix}= \theta^T x$

This is a vectorization of our hypothesis function for one training example; see the lessons on vectorization to learn more.

Remark: Note that for convenience reasons in this course we assume $x_{0}^{(i)} = 1 \text{ for } (i\in { 1,\dots, m })$. This allows us to do matrix operations with theta and x. Hence making the two vectors '$\theta$' and $x^{(i)}$ match each other element-wise (that is, have the same number of elements: n+1).]

Note: $θ^T$1 * (n+1) 的矩阵（横）而不是 (n+1) * 1 的矩阵（纵）

$x_j^{(i)} = \text{第 } i^{th} \text{ 个训练样例中特征 } j \text{ 的值}\\ x^{(i)} = \text{第 }i^{th} \text{ 个训练样例的输入（特征）}\\ m = \text{训练样本数量} \\ n = \text{特征变量数量}$

$h_\theta (x) = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + \theta_3 x_3 + \cdots + \theta_n x_n$

$h_\theta(x) =\begin{bmatrix}\theta_0 \hspace{1em} \theta_1 \hspace{1em} ... \hspace{1em} \theta_n\end{bmatrix}\begin{bmatrix}x_0 \\ x_1 \\ \vdots \\ x_n\end{bmatrix}= \theta^T x$

Remark: 请注意，为了方便起见，在本课程中我们假设 $x_{0}^{(i)} = 1 \text{ for } (i\in { 1,\dots, m })$. 这允许我们用 theta 和 x 进行矩阵运算。因此，使两个向量 '$\theta$' 和 $x^{(i)}$ 彼此元素匹配（即，具有相同数量的元素：n + 1）。

## 二、多个特征变量的梯度下降¶

The gradient descent equation itself is generally the same form; we just have to repeat it for our n features:

$\text{repeat until convergence:} \; \lbrace \\ \; \theta_0 := \theta_0 - \alpha \frac{1}{m} \sum\limits_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) \cdot x_0^{(i)}\\ \; \theta_1 := \theta_1 - \alpha \frac{1}{m} \sum\limits_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) \cdot x_1^{(i)} \\ \; \theta_2 := \theta_2 - \alpha \frac{1}{m} \sum\limits_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) \cdot x_2^{(i)} \\ \vdots \\ \rbrace$

In other words:

$\text{repeat until convergence:} \; \lbrace \\ \; \theta_j := \theta_j - \alpha \frac{1}{m} \sum\limits_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) \cdot x_j^{(i)} \; \text{for j := 0...n}\\ \rbrace$

The following image compares gradient descent with one variable to gradient descent with multiple variables:

$\text{repeat until convergence:} \; \lbrace \\ \; \theta_0 := \theta_0 - \alpha \frac{1}{m} \sum\limits_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) \cdot x_0^{(i)}\\ \; \theta_1 := \theta_1 - \alpha \frac{1}{m} \sum\limits_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) \cdot x_1^{(i)} \\ \; \theta_2 := \theta_2 - \alpha \frac{1}{m} \sum\limits_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) \cdot x_2^{(i)} \\ \vdots \\ \rbrace$

$\text{repeat until convergence:} \; \lbrace \\ \; \theta_j := \theta_j - \alpha \frac{1}{m} \sum\limits_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) \cdot x_j^{(i)} \; \text{for j := 0...n}\\ \rbrace$

## 三、梯度下降实践 1 - 特征缩放¶

Note: The average size of a house is 1000 but 100 is accidentally written instead

We can speed up gradient descent by having each of our input values in roughly the same range. This is because θ will descend quickly on small ranges and slowly on large ranges, and so will oscillate inefficiently down to the optimum when the variables are very uneven.

The way to prevent this is to modify the ranges of our input variables so that they are all roughly the same. Ideally:

$-1 ≤ x_{(i)} ≤ 1\\ \text{or}\\ -0.5 ≤ x_{(i)} ≤ 0.5$

These aren't exact requirements; we are only trying to speed things up. The goal is to get all input variables into roughly one of these ranges, give or take a few.

Two techniques to help with this are feature scaling and mean normalization. Feature scaling involves dividing the input values by the range (i.e. the maximum value minus the minimum value) of the input variable, resulting in a new range of just 1. Mean normalization involves subtracting the average value for an input variable from the values for that input variable resulting in a new average value for the input variable of just zero. To implement both of these techniques, adjust your input values as shown in this formula:

$x_i := \dfrac{x_i - \mu_i}{s_i}$

Where $μ_i$ is the average of all the values for feature (i) and $s_i$ is the range of values (max - min), or $s_i$ is the standard deviation.

Note that dividing by the range, or dividing by the standard deviation, give different results. The quizzes in this course use range - the programming exercises use standard deviation.

For example, if represents housing prices with a range of 100 to 2000 and a mean value of 1000, then, $x_i := \dfrac{price-1000}{1900}$.

Note: 房屋的平均大小为 1000，但也可故意写为 100

$-1 ≤ x_{(i)} ≤ 1\\ \text{or}\\ -0.5 ≤ x_{(i)} ≤ 0.5$

$x_i := \dfrac{x_i - \mu_i}{s_i}$

## 四、梯度下降实践 1 - 学习速率¶

Note: the x-axis label in the right graph should be $θ$ rather than No. of iterations

Debugging gradient descent. Make a plot with number of iterations on the x-axis. Now plot the cost function, $J(θ)$ over the number of iterations of gradient descent. If J(θ) ever increases, then you probably need to decrease $α$.

Automatic convergence test. Declare convergence if $J(θ)$ decreases by less than E in one iteration, where E is some small value such as $10^{-3}$. However in practice it's difficult to choose this threshold value.

It has been proven that if learning rate α is sufficiently small, then J(θ) will decrease on every iteration.

To summarize:

If $α$ is too small: slow convergence.

If $α$ is too large: may not decrease on every iteration and thus may not converge.

Note: 右图中的 x 轴标签应为 $θ$ 而不是迭代次数。

## 五、特征和多项式的回归函数¶

We can combine multiple features into one. For example, we can combine $x_1$ and $x_2$ into a new feature $x_3$ by taking $x_1$$x_2$.

Polynomial Regression

Our hypothesis function need not be linear (a straight line) if that does not fit the data well.

We can change the behavior or curve of our hypothesis function by making it a quadratic, cubic or square root function (or any other form).

For example, if our hypothesis function is $h_\theta(x) = \theta_0 + \theta_1 x_1$ then we can create additional features based on $x_1$, to get the quadratic function $h_\theta(x) = \theta_0 + \theta_1 x_1 + \theta_2 x_1^2$ or the cubic function $h_\theta(x) = \theta_0 + \theta_1 x_1 + \theta_2 x_1^2 + \theta_3 x_1^3$

In the cubic version, we have created new features $x_2$ and $x_3$ where $x_2=x_1^2$ and $x_3=x_1^3$.

To make it a square root function, we could do: $h_\theta(x) = \theta_0 + \theta_1 x_1 + \theta_2 \sqrt{x_1}$

One important thing to keep in mind is, if you choose your features this way then feature scaling becomes very important.

eg. if $x_1$ has range 1 - 1000 then range of $x_1^2$ becomes 1 - 1000000 and that of $x_1^3$ becomes 1 - 1000000000

## 六、正规方程¶

Note: he design matrix X (in the bottom right side of the slide) given in the example should have elements x with subscript 1 and superscripts varying from 1 to m because for all m training sets there are only 2 features $x_0$ and $x_1$; The X matrix is m by (n+1) and NOT n by n.

Gradient descent gives one way of minimizing J. Let’s discuss a second way of doing so, this time performing the minimization explicitly and without resorting to an iterative algorithm. In the "Normal Equation" method, we will minimize J by explicitly taking its derivatives with respect to the θj ’s, and setting them to zero. This allows us to find the optimum theta without iteration. The normal equation formula is given below:

$\theta = (X^T X)^{-1}X^T y$

pinv(X' * X) * X' * y

There is no need to do feature scaling with the normal equation.

The following is a comparison of gradient descent and the normal equation:

Need to choose alpha No need to choose alpha
Needs many iterations No need to iterate
$O(kn^2)$ $O(n^3)$, need to calculate inverse of $X^T X$
Works well when n is large Slow if n is very large

With the normal equation, computing the inversion has complexity $\mathcal{O}(n^3)$. So if we have a very large number of features, the normal equation will be slow. In practice, when n exceeds 10,000 it might be a good time to go from a normal solution to an iterative process.

Note: 在示例中给出的设计矩阵 X（在幻灯片的右下角）应具有元素 x，其下标 1 和上标的范围从 1 到 m，因为对于所有 m 个训练集，只有 2 个特征 $x_0$$x_1$; X 矩阵的 m 乘 (n + 1)，而不是 n * n。

$\theta = (X^T X)^{-1}X^T y$

octave 的代码公式

pinv(X' * X) * X' * y

$O(kn^2)$ $O(n^3)$, 需要计算逆 $X^T X$

## 七、测验¶

1. Suppose m=4 students have taken some class, and the class had a midterm exam and a final exam. You have collected a dataset of their scores on the two exams, which is as follows:
midterm exam (midterm exam)^2 final exam
89 7921 96
72 5184 74
94 8836 87
69 4761 78

You'd like to use polynomial regression to predict a student's final exam score from their midterm exam score. Concretely, suppose you want to fit a model of the form $h_θ(x)=θ_0 + θ_1x1 + θ_2x2$, where $x_1$ is the midterm score and $x_2$ is (midterm score)^2. Further, you plan to use both feature scaling (dividing by the "max-min", or range, of a feature) and mean normalization. What is the normalized feature $x_2^{(4)}$? (Hint: midterm = 69, final = 78 is training example 4.) Please round off your answer to two decimal places and enter in the text box below.

1. You run gradient descent for 15 iterations with $α = 0.3$ and compute $J(θ)$ after each iteration. You find that the value of $J(θ)$ decreases quickly then levels off. Based on this, which of the following conclusions seems most plausible?
• a. $α = 0.3$ is an effective choice of learning rate.
• b. Rather than use the current value of $α$, it'd be more promising to try a smaller value of $α$ (say $α = 0.1$).
• c. Rather than use the current value of $α$, it'd be more promising to try a smaller value of $α$ (say $α = 1.0$).

1. Suppose you have $m = 23$ training examples with $n = 5$ features (excluding the additional all-ones feature for the intercept term, which you should add). The normal equation is $\theta = (X^TX)^{-1}X^Ty$. For the given values of $m$ and $n$, what are the dimensions of $θ$, $X$, and $y$ in this equation?
• a. $X$ is $23\times6$, $y$ is $23\times1$, $θ$ is $6\times1$
• b. $X$ is $23\times5$, $y$ is $23\times1$, $θ$ is $5\times5$
• c. $X$ is $23\times6$, $y$ is $23\times6$, $θ$ is $6\times6$
• d. $X$ is $23\times5$, $y$ is $23\times1$, $θ$ is $5\times1$

1. Suppose you have a dataset with $m=50$ examples and $n=15$ features for each example. You want to use multivariate linear regression to fit the parameters $θ$ to our data. Should you prefer gradient descent or the normal equation?
• a. Gradient descent, since it will always converge to the optimal $α$.
• b. Gradient descent, since $(X^TX)^{-1}$ will be very slow to compute in the normal equation.
• c. The normal equation, since it provides an efficient way to directly find the solution.
• d. The normal equation, since gradient descent might be unable to find the optimal $θ$.

1. Which of the following are reasons for using feature scaling?(many select)
• a. It prevents the matrix $X^TX$ (used in the normal equation) from being non-invertable (singular/degenerate).
• b. It speeds up gradient descent by making it require fewer iterations to get to a good solution.
• c. It speeds up solving for $\theta$ using the normal equation.
• d. It is necessary to prevent gradient descent from getting stuck in local optima.

1. 假设 $m = 4$ 个学生参加了某堂课，并且该班进行了期中考试和期末考试。 您已经收集了两次考试的分数数据集，如下所示：

89 7921 96
72 5184 74
94 8836 87
69 4761 78

1. 您运行梯度下降 15 次迭代 $α=0.3$ 并计算 每次迭代后的 $J(θ)$。 您发现 $J(θ)$ 的值迅速下降，然后下降 关闭。 基于此，以下哪个结论似乎 最合理？
• a. $α = 0.3$ 是学习率的有效选择。
• b. 与其使用 $α$ 的当前值，不如尝试使用较小的 $α$ 值（例如 $α=0.1$）。
• c. 与其使用 $α$ 的当前值，不如尝试使用较大的 $α$ 值（例如 $α=1.0$）。

1. 假设您有 $m=23$ 个训练示例，其中包含 $n=5$ 个特征（不包括用于拦截项的附加全一功能，您应该添加该功能）。 正规方程为 $\theta = (X ^ TX)^{-1}X ^ Ty$。 对于给定的 $m$$n$ 值，方程中 $θ$$X$$y$ 的尺寸分别是多少？
• a. $X$$23\times6$, $y$$23\times1$, $θ$$6\times1$
• b. $X$$23\times5$, $y$$23\times1$, $θ$$5\times5$
• c. $X$$23\times6$, $y$$23\times6$, $θ$$6\times6$
• d. $X$$23\times5$, $y$$23\times1$, $θ$$5\times1$

1. 假设您有一个包含 $m = 50$ 个示例和每个示例 $n = 15$ 个特征的数据集。 您想使用多元线性回归将参数 $θ$ 拟合到我们的数据中。 您应该选择 梯度下降 还是 正规方程
• a. 梯度下降，因为它将始终收敛到最佳 $α$
• b. 梯度下降，因为在正规方程中 $(X ^ TX)^{-1}$ 的计算将非常缓慢。
• c. 正规方程，因为它提供了直接找到解的有效方法。
• d. 正规方程，因为梯度下降可能无法找到最佳的 $θ$

1. 以下哪些是使用特征缩放的原因？（多选）
• a. 这样可以防止矩阵 $X^TX$（在正常方程式中使用）不可逆（奇异/简并）。
• b. 它通过减少迭代次数来获得良好的解决方案，从而加快了梯度下降的速度。
• c. 它可以使用法线方程加快求解 $\theta$ 的速度。
• d. 有必要防止梯度下降陷入局部最优状态。

1. 0.47 $(7921 + 5184 + 8836 + 4761) / 4 = 6675.5$, $8836 - 4761 = 4075$, $\dfrac{4761 - 6675.5}{4075} = 0.4698159509202454$.
2. a, 刚好匹配就是快速计算出正确的值，太大 $\theta$ 可能会出现增大，太小会出现 $\theta$ 值的下降缓慢。
3. a
4. c
5. b, a: 特征缩放仅对梯度下降方程有效，c: 并不会加快线性方程，d: 特征缩放无法处理局部最优解。

## 八、Octave 使用¶

python 环境下需要导入以下库

1import numpy as np2import matplotlib.pyplot as plt

1 + 2 1 + 2
1 - 2 1 - 2
1 * 2 1 * 2
1 / 2 1 / 2
1 ^ 2 1 ** 2

## 八、参考¶

1. c-printf: c字符串格式化说明。
2. numpy手册
3. matplotlib手册
4. scipy手册