Week 3 - Logistic Regression & Regularization #
Classification and Representation #
Classification #
- to determine what class a new input should fall into, $y \in {0, 1, …, n}$
- Classification problems
- Email -> spam/not spam?
- Online transactions -> fraudulent?
- Tumor -> Malignant/benign
- Sometimes, Linear Regression may work, like the sample below:
- But most time the training set won’t be so perfect
- Here we use Logistic regression, which generates a value where is always either 0 or 1
Hypothesis Representation #
- To change our hypotheses $h_θ(x)$ to satisfy $0 \leq h_\theta (x) \leq 1$ .
- Our new form uses the “Logistic Function” , also called the “Sigmoid Function”:
- $\begin{aligned}& h_\theta (x) = g ( \theta^T x )\end{aligned}$
- $\begin{aligned} z = \theta^T x \end{aligned}$
- $\begin{aligned} g(z) = \dfrac{1}{1 + e^{-z}}\end{aligned}$
- $e$ : Exponents
- The following image shows us what the logistic function looks like:
- Interpreting hypothesis output, we can use:
- $h_\theta(x) = P(y=1 | x ; \theta)$ , it give us the probability that output is 1.
- probability that y = 1, given x, paramerterized by $\theta$
- $h_\theta(x) = P(y=1 | x ; \theta) + h_\theta(x) = P(y=0 | x ; \theta) = 1$
- $h_\theta(x) = P(y=1 | x ; \theta)$ , it give us the probability that output is 1.
Decision Boundary #
In order to get our discrete 0 or 1 classification, we can translate the output of the hypothesis function as follows: $$ \begin{aligned}& h_\theta(x) \geq 0.5 \rightarrow y = 1 \& h_\theta(x) < 0.5 \rightarrow y = 0 \end{aligned} $$
- so $\begin{aligned}& g(z) \geq 0.5 \text{, when } z \geq 0\end{aligned}$
- $\begin{aligned}z=0, e^{0}=1 &\Rightarrow g(z)=1/2\\ z \to \infty, e^{-\infty} \to 0 &\Rightarrow g(z)=1 \\ z \to -\infty, e^{\infty}\to \infty &\Rightarrow g(z)=0 \end{aligned}$
- so if our input to $g$ is $\theta^TX$ , then that means: $\begin{aligned}& h_\theta(x) = g(\theta^T x) \geq 0.5 \text{, when}\\ \theta^T x \geq 0\end{aligned}$
- so $\begin{aligned}& g(z) \geq 0.5 \text{, when } z \geq 0\end{aligned}$
Sample:
- $h_\theta(x) = g(\theta_0 + \theta_1x_1 + \theta_2x_2)$ :
- $\theta_0 = -3, \theta_1 = 1, \theta_2 = 1$
So our parameter vector is a column vector with the above values: $$\theta = \begin{bmatrix} -3 \\ 1 \\ 1 \end{bmatrix}$$
Then $z$ becomes $\theta^TX$
We predict " $y=1$ " if
- $$\begin{aligned}-3x_0 + 1x_1 + 1x_2 &\geq 0 \\ -3 + x_1 + x_2 &\geq 0 \\ x_1 + x_2 &\geq 3\end{aligned}$$
So $x_1 + x_2 = 3$ we graphically plot our decision boundary:
- Means:
- Blue = false, Magenta = true
- Line = decision boundary
- Means:
Non-linear decision boundaries #
- Get logistic regression to fit a complex non-linear data set
- $h_\theta(x) = g(\theta_0 +\theta_1x_1 + \theta_2x_2 + \theta_3x_1^2 + \theta_4x_2^2)$
- Say $\theta^T = \begin{bmatrix}-1, 0, 0, 1, 1\end{bmatrix}$ :
- Predict that " $y = 1$ “, if $x_1^2 + x_2^2 \geq 1$
- If we plot $x_1^2 + x_2^2 = 1$ , then this gives us a circle with a radius of 1 around 0:
- Mean we can build more complex decision boundaries by fitting complex parameters to this (relatively) simple hypothesis
Logistic Regression Model #
Cost function #
- Define the optimization object for the cost function we use to fit the parameters
- Training set: ${(x^{(1)}, y ^ {(1)}), (x^{(2)}, y ^ {(2)}),… , (x^{(m)}, y ^ {(m)}) }$
- m example: $x \in \begin{bmatrix}x_0\\ x_1\\ … \\ x_n \end{bmatrix}; x_0 = 1, y \in {0, 1}$
- $h_\theta(x) = \dfrac{1}{1 + e^{-\theta^Tx}}$
- Each example is a feature vector which is $n+1$ dimensional
- Linear regression uses the following function to determine $\theta$
- $J(\theta) = \dfrac {1}{m} \displaystyle \sum_{i=1}^m \dfrac{1}{2}\left (h_\theta (x^{(i)}) - y^{(i)} \right)^2$
- define
Cost
function to simplify the function:- $Cost(h_\theta(x^{(i)}), y^{(i)}) = \dfrac{1}{2}(h_\theta(x^{(i)}) - y^{(i)})^2$
- then we got:
- $J(\theta) = \dfrac{1}{m} \displaystyle \sum_{i=1}^m \mathrm{Cost}(h_\theta(x^{(i)}),y^{(i)})$
- to further simplify it, we can get rid of the superscripts:
- $J(\theta) = \dfrac{1}{m} \displaystyle \sum_{i=1}^m \mathrm{Cost}(h_\theta(x),y)$
- If we use this function for logistic regression, it will be a non-convex function which has many local optimum. Like:
- So we come out a new convex logistic regression cost function:
- $\begin{aligned} & \mathrm{Cost}(h_\theta(x),y) = -\log(h_\theta(x)) ; & \text{if y = 1} \\ & \mathrm{Cost}(h_\theta(x),y) = -\log(1-h_\theta(x)) ; & \text{if y = 0}\end{aligned}$
- We only care $(0 \le h(x) \le 1)$, so:
- $\begin{aligned}& \mathrm{Cost}(h_\theta(x),y) = 0 \text{ if } h_\theta(x) = y \\ & \mathrm{Cost}(h_\theta(x),y) \rightarrow \infty \text{ if } y = 0 ; \mathrm{and} ; h_\theta(x) \rightarrow 1 \\ & \mathrm{Cost}(h_\theta(x),y) \rightarrow \infty \text{ if } y = 1 ; \mathrm{and} ; h_\theta(x) \rightarrow 0 \\ \end{aligned}$
Simplified cost function and gradient descent #
- Compress cost function’s two conditional cases into one case:
- $\mathrm{Cost}(h_\theta(x),y) = - y ; \log(h_\theta(x)) - (1 - y) \log(1 - h_\theta(x))$
- Cause y can either be 0 or 1,
- if y = 1, then $(1 - y) \log(1 - h_\theta(x)) = 0$
- if y = 0, then $y \log(h_\theta(x)) = 0$
- this cost function can be derived from statistics using the principle of maximum likelihood estimation.
- the idea of how to efficiently find parameters’ data for different models.
- Then we can fully write out our entire cost function as follows:
- $J(\theta) = - \frac{1}{m} \displaystyle \sum_{i=1}^m [y^{(i)}\log (h_\theta (x^{(i)})) + (1 - y^{(i)})\log (1 - h_\theta(x^{(i)}))]$
- and a vectorized implementation is:
- $\begin{aligned} & h = g(X\theta)\\ & J(\theta) = \frac{1}{m} \cdot \left(-y^{T}\log(h)-(1-y)^{T}\log(1-h)\right) \end{aligned}$
- Gradient Descent
- the general form is:
- $\begin{aligned}& Repeat ; \lbrace \\ & ; \theta_j := \theta_j - \alpha \dfrac{\partial}{\partial \theta_j}J(\theta) \\ & \rbrace\end{aligned}$
- We can work out the derivative part using calculus to get:
- $\begin{aligned} & Repeat ; \lbrace \\ & ; \theta_j := \theta_j - \frac{\alpha}{m} \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)}) x_j^{(i)} \\ & \rbrace \end{aligned}$
- A vectorized implementation is:
- $\theta := \theta - \frac{\alpha}{m} X^{T} (g(X \theta ) - \vec{y})$
- the general form is:
Advanced Optimization #
Alternatively, instead of gradient descent to minimize the cost function we could use
- Conjugate gradient
- BFGS (Broyden-Fletcher-Goldfarb-Shanno)
- L-BFGS (Limited memory - BFGS)
Some properties
- Advantages
- No need to manually pick alpha (learning rate)
- Have a clever inner loop (line search algorithm) which tries a bunch of alpha values and picks a good one
- Often faster than gradient descent
- Do more than just pick a good learning rate
- Can be used successfully without understanding their complexity
- Disadvantages
- Could make debugging more difficult
- Should not be implemented themselves
- Different libraries may use different implementations - may hit performance
- Advantages
Using advanced cost minimization algorithms
Example:
- $\theta = \begin{bmatrix}\theta_1\\ \theta_2\end{bmatrix}$
- $J(\theta) = (\theta_1 - 5)^2 + (\theta_2 - 5)^2$
- $\dfrac{\partial}{\partial \theta_1}J(\theta) = 2(\theta_1 - 5)$
- $\dfrac{\partial}{\partial \theta_2}J(\theta) = 2(\theta_2 - 5)$
- Example above
- $θ_1$ and $θ_2$ (two parameters)
- Cost function here is $J(\theta) = (\theta_1 - 5)^2 + (\theta_2 - 5)^2$
- The derivatives of the $J(θ)$ with respect to either $θ_1$ and $θ_2$ turns out to be the $2(θ_i - 5)$
First, define our cost function:
``` > function [jVal, gradient] = costFunction(theta) > jVal = [...code to compute J(theta)...]; > gradient = [...code to compute derivative of J(theta)...]; > end ```
jVal
= $(\theta_1 - 5)^2 + (\theta_2 - 5)^2$gradient
is a 2 by 1 vector, and 2 elements are the two partial derivative terms
Then,
``` > options = optimset('GradObj', 'on', 'MaxIter', 100); > initialTheta = zeros(2,1); > [optTheta, functionVal, exitFlag] = fminunc(@costFunction, initialTheta, options); ```
- Here,
- options is a data structure giving options for the algorithm
- fminunc
- function minimize the cost function (find minimum of unconstrained multivariable function)
- @costFunction is a pointer to the costFunction function to be used
- For the octave implementation
- initialTheta must be a matrix of at least two dimensions
- Here,
Multiclass Classification: One-vs-all #
- Divide our problem into n+1 (+1 because the index starts at 0) binary classification problems; in each one, we predict the probability that
y
is a member of one of our classes.
$$\begin{aligned} y & \in \lbrace0, 1, \dots, n\rbrace \\ h_\theta^{(0)}(x) & = P(y = 0 | x ; \theta) \\ h_\theta^{(1)}(x) & = P(y = 1 | x ; \theta) \\ & \vdots \\ h_\theta^{(n)}(x) & = P(y = n | x ; \theta) \\ \mathrm{prediction} & = \max_i( h_\theta ^{(i)}(x) ) \end{aligned}$$
- The following image show how one could classify 3 classes:
- Overall
- Train a logistic regression classifier $h_{θ}^{(i)}(x)$ for each class i to predict the probability that $y = i$
- On a new input, $x$ to make a prediction, pick the class $i$ that maximizes the probability that $h_θ^{(i)}(x) = 1$
The Problem of Overfitting #
Problems #
- Three figures to shows that underfitting, fitting and overfitting: (take housing price as sample)
- under-fitting or high bias: leftmost, $y = θ_0 + θ_1x$ , doesn’t really lie on straight line.
- overfitting: rightmost, $y = \sum_{j=0} ^5 \theta_j x^j$ , not a good predictor.
- fitting one: $y = \theta_0 + \theta_1x + \theta_2x^2$ , obtain a slightly better fit to the data.
Addressing overfitting #
- Reduce the number of features:
- Manually select which features to keep.
- Use a model selection algorithm.
- Regularization
- Keep all the features, but reduce the magnitude of parameters $\theta_j$ .
- Regularization works well when we have a lot of slightly useful features.
Cost Function #
- if we have overfitting from our hypothesis function , we can reduce the weight that some of the terms in our function carry by increasing their cost.
- Say we wanted to make the following function more quadratic:
- $\theta_0 + \theta_1x + \theta_2x^2 + \theta_3x^3 + \theta_4x^4$
- We’ll want to eliminate the influence of $\theta_3x^4$ and $\theta_4x^4$ . Without actually getting rid of these features or changing the form of our hypothesis, we can instead modify our cost function:
- $min_\theta\\ \dfrac{1}{2m}\sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)})^2 + 1000\cdot\theta_3^2 + 1000\cdot\theta_4^2$
- Add two extra terms at the end to inflate the cost of $\theta_3$ and $\theta_4$ . This will in turn greatly reduce the values of $\theta_3x^4$ and $\theta_4x^4$ in our hypothesis function.
- We could also regularize all of our theta parameters in a single summation as:
- $min_\theta\\ \dfrac{1}{2m}\\ \left[ \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)})^2 + \lambda\\ \sum_{j=1}^n \theta_j^2 \right]$
- $\lambda$ (lambda), is the regularization parameter. It determines how much the costs of our theta parameters are inflated.
- But if lambda is chosen to be too large, it may smooth out the function too much and cause under-fitting.
Regularized Linear Regression #
- Gradient Descent
Repeat $$\begin{aligned} \begin{cases} \theta_0 &:= \theta_0 - \alpha \frac{1}{m} \sum_{i=1}^m \left(h_\theta(x^{(i)}) - y^{(i)}\right) x_0^{(i)} \\ \theta_j &:= \theta_j - \alpha \left[\left(\frac{1}{m} \sum_{i=1}^m \left(h_\theta(x^{(i)}) - y^{(i)}\right) x_j^{(i)}\right) + \frac{\lambda}{m}\theta_j\right] &\quad j \in {1,2,\ldots,n} \end{cases} \end{aligned}$$
- The term $\frac{\lambda}{m}\theta_j$ performs our regularization. With some manipulation our update rule can also be represented as:
- $\theta_j := \theta_j(1 - \alpha\frac{\lambda}{m}) - \alpha\frac{1}{m}\sum_{i=1}^m(h_\theta(x^{(i)}) - y^{(i)})x_j^{(i)}$
- The first term in the above equation, $1 - \alpha\frac{\lambda}{m}$ will always be less than 1. Intuitively you can see it as reducing the value of $θ_j$ by some amount on every update. Notice that the second term is now exactly the same as it was before.
Normoal Equation #
- To add in reguarlization, the equation is the same as our original, except that we add another term inside the parentheses:
- $\begin{aligned}& \theta = \left( X^TX + \lambda \cdot L \right)^{-1} X^Ty \& \text{where}\\ \\ L = \begin{bmatrix} 0 & & & & \\ & 1 & & & \\ & & 1 & & \\ & & & \ddots & \\ & & & & 1\end{bmatrix}\end{aligned}$
- L is a matrix with 0 at the top left and 1’s down the diagonal, with 0’s everywhere else. It should have dimension (n+1)×(n+1). Intuitively, this is the identity matrix (though we are not including $x_0$ ), multiplied with a single real number $\lambda$ .
- Recall that if $m \le n$ , then $X^TX$ is non-invertible. However, when we add the term $\lambda \cdot L$ , then $X^TX + \lambda \cdot L$ becomes invertible.
Regularized Logistic Regression #
- Cost function:
- $J(\theta) = - \frac{1}{m} \sum_{i=1}^m \large[ y^{(i)}\\ \log (h_\theta (x^{(i)})) + (1 - y^{(i)})\\ \log (1 - h_\theta(x^{(i)})) \large]$
- Regularize this equation by adding a term to the end:
- $J(\theta) = - \frac{1}{m} \sum_{i=1}^m \large[ y^{(i)}\\ \log (h_\theta (x^{(i)})) + (1 - y^{(i)})\\ \log (1 - h_\theta(x^{(i)}))\large] + \frac{\lambda}{2m}\sum_{j=1}^n \theta_j^2$
- The second sum, $\sum_{j=1}^n \theta_j^2$ means to explicitly exclude the bias term, $\theta_0$ . I.e. the $\theta$ vector is indexed from 0 to n (holding n+1 values, $\theta_0$ through $\theta_n$ ), and this sum explicitly skips $\theta_0$ , by running from 1 to n, skipping 0 (This is because for regularization we don’t penalize $θ_0$ so treat it slightly differently). Thus, when computing the equation, we should continuously update the two following equations:
Words #
logistic [ləu’dʒistik] adj. [数] 符号逻辑的
fraudulent [‘frɔ:djulənt] adj. 欺骗性的;不正的
malignant [mə’liɡnənt] adj. [医] 恶性的;
benign [bi’nain] adj. 良性的;
polynomial [,pɔli’nəumiəl] n. [数] 多项式;
quadratic [kwɔ’drætik] adj. [数] 二次的 n. 二次方程式;
penalize [‘pi:nəlaiz] vt. 处罚;处刑;使不利
sigmoid function [‘sigmɔid] 双弯曲函数