📊 Logistic Regression for Classification
In classification problems, the output variable y y y takes discrete values
Classification Types
1. Binary Classification
Two classes:
y ∈ { 0 , 1 } y \in \{0,1\} y ∈ { 0 , 1 }
We usually call:
0 0 0 → Negative class : 0 represents absence.
1 1 1 → Positive class : 1 represent presence of something (e.g., disease)
2. Multi-class Classification
More than two classes:
y ∈ { 0 , 1 , 2 , 3 , . . . } y \in \{0,1,2,3,...\} y ∈ { 0 , 1 , 2 , 3 , ... }
The Sigmoid Function σ ( x ) \sigma(x) σ ( x )
Sigmoid function (also called logistic function) maps any real-valued number into the (0, 1) interval.
It is commonly used in logistic regression to model probabilities.
The sigmoid function is defined as:
σ ( x ) = 1 1 + e − x = e x e x + 1 = 1 − 1 1 + e x = 1 2 ( 1 + tanh ( x 2 ) ) = 1 − σ ( − x ) \sigma(x) = \frac{1}{1 + e^{-x}} = \frac{e^x}{e^x + 1} = 1 - \frac{1}{1 + e^x} = \frac{1}{2} \left(1 + \tanh\left(\frac{x}{2}\right)\right) = 1 - \sigma(-x) σ ( x ) = 1 + e − x 1 = e x + 1 e x = 1 − 1 + e x 1 = 2 1 ( 1 + tanh ( 2 x ) ) = 1 − σ ( − x )
Where:
z z z is the input to the function (can be any real number)
e e e is the base of the natural logarithm (approximately 2.71828)
Output:
σ ( z ) \sigma(z) σ ( z ) is always between 0 and 1, making it suitable for modeling probabilities.
When z z z is large and positive, σ ( z ) ≈ 1 \sigma(z) \approx 1 σ ( z ) ≈ 1 .
z → + ∞ z \to +\infty z → + ∞ , σ ( z ) → 1 \sigma(z) \to 1 σ ( z ) → 1
When z z z is large and negative, σ ( z ) ≈ 0 \sigma(z) \approx 0 σ ( z ) ≈ 0 .
z → − ∞ z \to -\infty z → − ∞ , σ ( z ) → 0 \sigma(z) \to 0 σ ( z ) → 0
When z = 0 z = 0 z = 0 , σ ( z ) = 0.5 \sigma(z) = 0.5 σ ( z ) = 0.5 .
💡 Logistic Regression Hypothesis h θ ( x ) h_\theta(x) h θ ( x )
Logistic regression ensures:
0 ≤ h θ ( x ) ≤ 1 0 \le h_\theta(x) \le 1 0 ≤ h θ ( x ) ≤ 1
Where
Input: Any real number : ( − ∞ , + ∞ ) (-\infty, +\infty) ( − ∞ , + ∞ )
Output: Always between: ( 0 , 1 ) (0,1) ( 0 , 1 )
Instead of: h θ ( x ) = θ T x h_\theta(x) = \theta^T x h θ ( x ) = θ T x
We apply a transformation that squashes outputs into the probability range [ 0 , 1 ] [0,1] [ 0 , 1 ] .
h θ ( x ) = g ( θ T x ) h_\theta(x) = g(\theta^T x) h θ ( x ) = g ( θ T x )
So the output becomes a probability : h θ ( x ) = P ( y = 1 ∣ x ) h_\theta(x) = P(y=1 \mid x) h θ ( x ) = P ( y = 1 ∣ x )
This can be simplified to:
h θ ( x ) = g ( z ) h_\theta(x) = g(z) h θ ( x ) = g ( z )
Where
z = θ T x z = \theta^T x z = θ T x
and g ( z ) g(z) g ( z ) as the sigmoid function:
g ( z ) = 1 1 + e − z g(z) = \frac{1}{1 + e^{-z}} g ( z ) = 1 + e − z 1
Final Hypothesis
h θ ( x ) = 1 1 + e − θ T x h_\theta(x)
=
\frac{1}{1 + e^{-\theta^T x}} h θ ( x ) = 1 + e − θ T x 1
This ensures:
0 ≤ h θ ( x ) ≤ 1 0 \le h_\theta(x) \le 1 0 ≤ h θ ( x ) ≤ 1
h θ ( x ) = P ( y = 1 ∣ x ; θ ) h_\theta(x) = P(y = 1 \mid x; \theta) h θ ( x ) = P ( y = 1 ∣ x ; θ )
So:
If h θ ( x ) = 0.7 h_\theta(x) = 0.7 h θ ( x ) = 0.7
→ There is a 70% probability that y = 1 y = 1 y = 1
Since probabilities must sum to 1:
P ( y = 0 ∣ x ; θ ) = 1 − P ( y = 1 ∣ x ; θ ) P(y=0 \mid x; \theta)
=
1 - P(y=1 \mid x; \theta) P ( y = 0 ∣ x ; θ ) = 1 − P ( y = 1 ∣ x ; θ )
h θ ( x ) = P ( y = 1 ∣ x ; θ ) h_\theta(x) = P(y = 1 \mid x; \theta) h θ ( x ) = P ( y = 1 ∣ x ; θ )
So if:
h θ ( x ) = 0.7 h_\theta(x) = 0.7 h θ ( x ) = 0.7
Then:
P ( y = 1 ∣ x ; θ ) = 0.7 P(y = 1 \mid x; \theta) = 0.7 P ( y = 1 ∣ x ; θ ) = 0.7
P ( y = 0 ∣ x ; θ ) = 1 − 0.7 = 0.3 P(y = 0 \mid x; \theta) = 1 - 0.7 = 0.3 P ( y = 0 ∣ x ; θ ) = 1 − 0.7 = 0.3
So the correct interpretations are the ones that match:
P ( y = 1 ∣ x ; θ ) = 0.7 P(y = 1 \mid x; \theta) = 0.7 P ( y = 1 ∣ x ; θ ) = 0.7
P ( y = 0 ∣ x ; θ ) = 0.3 P(y = 0 \mid x; \theta) = 0.3 P ( y = 0 ∣ x ; θ ) = 0.3
🔑 Decision Boundary
The decision boundary is the line that separates the area where y = 0 and where y = 1.
It is created by our hypothesis function / model.
Decision Boundary is a Property of the Model
The decision boundary depends only on:
The hypothesis form
The parameters θ \theta θ
It does not depend on the training data once θ \theta θ is fixed.
The training set is used only to learn θ \theta θ .
Logistic regression decision boundary
For logistic regression, the decision boundary is where:
h θ ( x ) = 0.5 h_\theta(x) = 0.5 h θ ( x ) = 0.5
In order to get our discrete 0 or 1 classification, we can translate the output of the hypothesis function as follows:
➕ h θ ( x ) ≥ 0.5 h_\theta(x) \ge 0.5 h θ ( x ) ≥ 0.5 we Predict y = 1 y = 1 y = 1
➖ h θ ( x ) < 0.5 h_\theta(x) < 0.5 h θ ( x ) < 0.5 we Predict y = 0 y = 0 y = 0
When Is h θ ( x ) ≥ 0.5 h_\theta(x) \ge 0.5 h θ ( x ) ≥ 0.5 ?
Since:
g ( z ) ≥ 0.5 when z ≥ 0 g(z) \ge 0.5 \quad \text{when} \quad z \ge 0 g ( z ) ≥ 0.5 when z ≥ 0
and
h θ ( x ) = g ( θ T x ) , h_\theta(x) = g(\theta^T x), h θ ( x ) = g ( θ T x ) ,
we predict:
y = 1 when θ T x ≥ 0 y = 1 \quad \text{when} \quad \theta^T x \ge 0 y = 1 when θ T x ≥ 0
and
y = 0 when θ T x < 0 y = 0 \quad \text{when} \quad \theta^T x < 0 y = 0 when θ T x < 0
Example
Given:
h θ ( x ) = g ( θ 0 + θ 1 x 1 + θ 2 x 2 ) h_\theta(x) = g(\theta_0 + \theta_1 x_1 + \theta_2 x_2) h θ ( x ) = g ( θ 0 + θ 1 x 1 + θ 2 x 2 )
and sigmoid g ( z ) = 0.5 g(z) = 0.5 g ( z ) = 0.5 when z = 0 z = 0 z = 0 , the boundary is:
θ 0 + θ 1 x 1 + θ 2 x 2 = 0 \theta_0 + \theta_1 x_1 + \theta_2 x_2 = 0 θ 0 + θ 1 x 1 + θ 2 x 2 = 0
Given:
θ 0 = − 6 \theta_0 = -6 θ 0 = − 6
θ 1 = 0 \theta_1 = 0 θ 1 = 0
θ 2 = 1 \theta_2 = 1 θ 2 = 1
So:
− 6 + 0 ⋅ x 1 + 1 ⋅ x 2 = 0 -6 + 0 \cdot x_1 + 1 \cdot x_2 = 0 − 6 + 0 ⋅ x 1 + 1 ⋅ x 2 = 0
Simplifies to:
x 2 = 6 x_2 = 6 x 2 = 6
This is a horizontal line , independence on x 1 x_1 x 1
Located at:
x 2 = 6 x_2 = 6 x 2 = 6
Linear Decision Boundary
Suppose:
h θ ( x ) = g ( θ 0 + θ 1 x 1 + θ 2 x 2 ) h_\theta(x) = g(\theta_0 + \theta_1 x_1 + \theta_2 x_2) h θ ( x ) = g ( θ 0 + θ 1 x 1 + θ 2 x 2 )
Let:
θ 0 = − 3 , θ 1 = 1 , θ 2 = 1 \theta_0 = -3, \quad \theta_1 = 1, \quad \theta_2 = 1 θ 0 = − 3 , θ 1 = 1 , θ 2 = 1
Then:
θ T x = − 3 + x 1 + x 2 \theta^T x = -3 + x_1 + x_2 θ T x = − 3 + x 1 + x 2
We predict y = 1 y = 1 y = 1 when:
− 3 + x 1 + x 2 ≥ 0 -3 + x_1 + x_2 \ge 0 − 3 + x 1 + x 2 ≥ 0
Rewriting:
x 1 + x 2 ≥ 3 x_1 + x_2 \ge 3 x 1 + x 2 ≥ 3
Decision Boundary
The decision boundary occurs when:
x 1 + x 2 = 3 x_1 + x_2 = 3 x 1 + x 2 = 3
This is a straight line.
It separates the plane into:
Region where y = 1 y = 1 y = 1
Region where y = 0 y = 0 y = 0
The decision boundary corresponds to:
h θ ( x ) = 0.5 h_\theta(x) = 0.5 h θ ( x ) = 0.5
Nonlinear Decision Boundaries
We can add polynomial features.
Example:
h θ ( x ) = g ( θ 0 + θ 1 x 1 + θ 2 x 2 + θ 3 x 1 2 + θ 4 x 2 2 ) h_\theta(x) =
g(\theta_0 +
\theta_1 x_1+
\theta_2 x_2+
\theta_3 x_1^2+
\theta_4 x_2^2) h θ ( x ) = g ( θ 0 + θ 1 x 1 + θ 2 x 2 + θ 3 x 1 2 + θ 4 x 2 2 )
Suppose:
θ 0 = − 1 , θ 1 = 0 , θ 2 = 0 , θ 3 = 1 , θ 4 = 1 \theta_0 = -1, \quad
\theta_1 = 0, \quad
\theta_2 = 0, \quad
\theta_3 = 1, \quad
\theta_4 = 1 θ 0 = − 1 , θ 1 = 0 , θ 2 = 0 , θ 3 = 1 , θ 4 = 1
Then:
θ T x = − 1 + x 1 2 + x 2 2 \theta^T x = -1 + x_1^2 + x_2^2 θ T x = − 1 + x 1 2 + x 2 2
We predict y = 1 y = 1 y = 1 when:
− 1 + x 1 2 + x 2 2 ≥ 0 -1 + x_1^2 + x_2^2 \ge 0 − 1 + x 1 2 + x 2 2 ≥ 0
Rewriting:
x 1 2 + x 2 2 ≥ 1 x_1^2 + x_2^2 \ge 1 x 1 2 + x 2 2 ≥ 1
Decision Boundary
The boundary is:
x 1 2 + x 2 2 = 1 x_1^2 + x_2^2 = 1 x 1 2 + x 2 2 = 1
This is a circle of radius 1.
So logistic regression can produce nonlinear boundaries using polynomial features.
More Complex Boundaries
By adding higher-order terms such as:
x 1 3 x_1^3 x 1 3
x 1 x 2 x_1 x_2 x 1 x 2
x 1 2 x 2 x_1^2 x_2 x 1 2 x 2
etc.
Logistic regression can represent:
Ellipses
Complex curves
Highly nonlinear shapes
💰 Cost Function / Optimal Objective
The overall cost is:
J ( θ ) = 1 m ∑ i = 1 m Cost ( h θ ( x ( i ) ) , y ( i ) ) J(\theta)
=
\frac{1}{m}
\sum_{i=1}^{m}
\text{Cost}\big(h_\theta(x^{(i)}), y^{(i)}\big) J ( θ ) = m 1 i = 1 ∑ m Cost ( h θ ( x ( i ) ) , y ( i ) )
where:
h θ ( x ) = 1 1 + e − θ T x h_\theta(x) = \frac{1}{1 + e^{-\theta^T x}} h θ ( x ) = 1 + e − θ T x 1
Why Not Use Squared Error Cost?
In linear regression, we use:
J ( θ ) = 1 2 m ∑ ( h θ ( x ) − y ) 2 J(\theta) = \frac{1}{2m}\sum (h_\theta(x) - y)^2 J ( θ ) = 2 m 1 ∑ ( h θ ( x ) − y ) 2
If we use same squared error with sigmoid:
The cost function becomes non-convex
Optimization may get stuck in local minima
Training may fail to find the best parameters
So we need a better cost function.
We define cost separately for the two classes.
The cost function is defined as:
Cost ( h θ ( x ) , y ) = { − log ( h θ ( x ) ) if y = 1 − log ( 1 − h θ ( x ) ) if y = 0 \text{Cost}(h_\theta(x), y) =
\begin{cases}
-\log(h_\theta(x)) & \text{if } y = 1 \\
-\log(1 - h_\theta(x)) & \text{if } y = 0
\end{cases} Cost ( h θ ( x ) , y ) = { − log ( h θ ( x )) − log ( 1 − h θ ( x )) if y = 1 if y = 0
Why This Cost Function Is Better
It is convex
No local minima
Guarantees only one global minimum
Optimization is reliable
Smooth and well-behaved
Because J ( θ ) J(\theta) J ( θ ) is convex:
Gradient descent will converge to the global minimum
We do not get stuck in bad local optima
Training is stable
It penalizes wrong predictions heavily
Cost = 0 when prediction is correct
Cost → ∞ when prediction is very wrong
Encourages the model to be confident and correct
1. Case 1: When y = 1 y = 1 y = 1
We want h θ ( x ) h_\theta(x) h θ ( x ) to be close to 1.
Cost:
− log ( h θ ( x ) ) -\log(h_\theta(x)) − log ( h θ ( x ))
If prediction is close to 1 → cost is small
If h θ ( x ) = 1 h_\theta(x) = 1 h θ ( x ) = 1 → cost = 0
If prediction is close to 0 → cost is very large
-If h θ ( x ) → 0 h_\theta(x) \to 0 h θ ( x ) → 0 → cost → ∞ \to \infty → ∞
So:
Correct confident prediction → small cost
Wrong confident prediction → very large cost
Case 2: When y = 0 y = 0 y = 0
We want h θ ( x ) h_\theta(x) h θ ( x ) to be close to 0.
Cost:
− log ( 1 − h θ ( x ) ) -\log(1 - h_\theta(x)) − log ( 1 − h θ ( x ))
If prediction is close to 1 → cost is very large
-If h θ ( x ) → 1 h_\theta(x) \to 1 h θ ( x ) → 1 → cost → ∞ \to \infty → ∞
If prediction is close to 0 → cost is small
If h θ ( x ) = 0 h_\theta(x) = 0 h θ ( x ) = 0 → cost = 0
Again:
Correct prediction → small cost
Wrong confident prediction → large penalty
Unified Logistic Cost Function
Simplified Cost Function (Single Formula)
We can combine the two cases into one equation:
Cost ( h θ ( x ) , y ) = − y log ( h θ ( x ) ) − ( 1 − y ) log ( 1 − h θ ( x ) ) \text{Cost}(h_\theta(x), y)
= -
y \log(h_\theta(x))-
(1 - y)\log(1 - h_\theta(x)) Cost ( h θ ( x ) , y ) = − y log ( h θ ( x )) − ( 1 − y ) log ( 1 − h θ ( x ))
If y = 1 y = 1 y = 1 :
The second term becomes 0
Cost reduces to:
− log ( h θ ( x ) ) -\log(h_\theta(x)) − log ( h θ ( x ))
If y = 0 y = 0 y = 0 :
The first term becomes 0
Cost reduces to:
− log ( 1 − h θ ( x ) ) -\log(1 - h_\theta(x)) − log ( 1 − h θ ( x ))
So this single formula covers both cases.
Full Cost Function
Full Cost Function Over Dataset
For m m m training examples:
J ( θ ) = − 1 m ∑ i = 1 m [ y ( i ) log ( h θ ( x ( i ) ) ) + ( 1 − y ( i ) ) log ( 1 − h θ ( x ( i ) ) ) ] J(\theta)
=
-\frac{1}{m}
\sum_{i=1}^{m}
\left[
y^{(i)} \log(h_\theta(x^{(i)})) +
(1 - y^{(i)}) \log(1 - h_\theta(x^{(i)}))
\right] J ( θ ) = − m 1 i = 1 ∑ m [ y ( i ) log ( h θ ( x ( i ) )) + ( 1 − y ( i ) ) log ( 1 − h θ ( x ( i ) )) ]
This is called:
Log loss
Cross-entropy loss
Logistic loss
Where:
h θ ( x ) = 1 1 + e − θ T x h_\theta(x) = \frac{1}{1 + e^{-\theta^T x}} h θ ( x ) = 1 + e − θ T x 1
Vectorized Cost Fucntion
Let:
X X X = design matrix
y y y = vector of labels
h = g ( X θ ) h = g(X\theta) h = g ( X θ )
Then:
h = g ( X θ ) h = g(X\theta) h = g ( X θ )
and
J ( θ ) = 1 m ( − y T log ( h ) − ( 1 − y ) T log ( 1 − h ) ) J(\theta)
=
\frac{1}{m}
\left(-
y^T \log(h)-
(1 - y)^T \log(1 - h)
\right) J ( θ ) = m 1 ( − y T log ( h ) − ( 1 − y ) T log ( 1 − h ) )
🧠 Key Takeaways: Cost Function
Logistic regression uses a convex cost function
The simplified cost formula works for both y = 0 y=0 y = 0 and y = 1 y=1 y = 1
Gradient descent update looks the same as linear regression
Vectorization makes implementation efficient
Always include the 1 m \frac{1}{m} m 1 factor in the gradient update
🎢 Gradient Descent
The goal is to minimize the cost function J ( θ ) J(\theta) J ( θ ) by repeatedly updating the parameters.
For m m m training examples:
Repeat until convergence:
for
j = 0 , 1 , … , n j = 0,1,\ldots,n j = 0 , 1 , … , n
Update each parameter θ j \theta_j θ j simultaneously using the rule:
θ j : = θ j − α ∂ ∂ θ j J ( θ ) \theta_j
:=
\theta_j -
\alpha
\frac{\partial}{\partial \theta_j}
J(\theta) θ j := θ j − α ∂ θ j ∂ J ( θ )
θ 0 \theta_0 θ 0 is bias term,
θ 1 , … , θ n \theta_1, \ldots, \theta_n θ 1 , … , θ n are feature weights.
where:
α \alpha α is the learning rate
J ( θ ) J(\theta) J ( θ ) is the cost function
∂ ∂ θ j J ( θ ) \frac{\partial}{\partial \theta_j}J(\theta) ∂ θ j ∂ J ( θ ) is the partial derivative of the cost function with respect to θ j \theta_j θ j
For logistic regression, the cost function is:
J ( θ ) = − 1 m ∑ i = 1 m [ y ( i ) log ( h θ ( x ( i ) ) ) + ( 1 − y ( i ) ) log ( 1 − h θ ( x ( i ) ) ) ] J(\theta)
=
-\frac{1}{m}
\sum_{i=1}^{m}
\left[
y^{(i)} \log(h_\theta(x^{(i)}))+
(1-y^{(i)})
\log(1-h_\theta(x^{(i)}))
\right] J ( θ ) = − m 1 i = 1 ∑ m [ y ( i ) log ( h θ ( x ( i ) )) + ( 1 − y ( i ) ) log ( 1 − h θ ( x ( i ) )) ]
Where:
h θ ( x ) = g ( θ T x ) h_\theta(x)
=
g(\theta^T x) h θ ( x ) = g ( θ T x )
and
g ( z ) = 1 1 + e − z g(z)
=
\frac{1}{1+e^{-z}} g ( z ) = 1 + e − z 1
z = θ T x z = \theta^T x z = θ T x
Substituting this gradient into the update rule gives:
θ j : = θ j − α m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) x j ( i ) \theta_j
:=
\theta_j-
\frac{\alpha}{m}
\sum_{i=1}^{m}
\left(
h_\theta(x^{(i)})-
y^{(i)}
\right)
x_j^{(i)} θ j := θ j − m α i = 1 ∑ m ( h θ ( x ( i ) ) − y ( i ) ) x j ( i )
for
j = 0 , 1 , … , n j = 0,1,\ldots,n j = 0 , 1 , … , n
Logistic Regression Gradient
After computing the derivative, we get:
θ j : = θ j − α m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) x j ( i ) \theta_j :=
\theta_j -
\frac{\alpha}{m}
\sum_{i=1}^{m}
\left(
h_\theta(x^{(i)}) - y^{(i)}
\right)
x_j^{(i)} θ j := θ j − m α i = 1 ∑ m ( h θ ( x ( i ) ) − y ( i ) ) x j ( i )
Important Notes
This is identical in form to linear regression gradient descent.
We must update all θ j \theta_j θ j simultaneously ..
The difference lies in the hypothesis function:
h θ ( x ) = g ( θ T x ) h_\theta(x) = g(\theta^T x) h θ ( x ) = g ( θ T x )
where
g ( z ) = 1 1 + e − z g(z) = \frac{1}{1 + e^{-z}} g ( z ) = 1 + e − z 1
Vectorized Gradient Descent
Let:
X X X = design matrix
y ⃗ \vec{y} y = vector of labels
h = g ( X θ ) h = g(X\theta) h = g ( X θ )
Then the update rule becomes:
h = g ( X θ ) h = g(X\theta) h = g ( X θ )
Then the update rule becomes:
θ : = θ − α m X T ( g ( X θ ) − y ⃗ ) \theta :=
\theta-
\frac{\alpha}{m}
X^T
\left(
g(X\theta) - \vec{y}
\right) θ := θ − m α X T ( g ( X θ ) − y )
Where:
y ⃗ \vec{y} y is the vector of labels
X T X^T X T is the transpose of the design matrix
⚖️ Regularized Logistic Regression
Regularization helps prevent overfitting by penalizing large weights.
Compared to the non-regularized model, the regularized version produces smoother decision boundaries.
Cost Function (Without Regularization)
Recall the logistic regression cost function:
J ( θ ) = − 1 m ∑ i = 1 m [ y ( i ) log ( h θ ( x ( i ) ) ) + ( 1 − y ( i ) ) log ( 1 − h θ ( x ( i ) ) ) ] J(\theta)
= - \frac{1}{m}
\sum_{i=1}^{m}
\left[
y^{(i)} \log(h_\theta(x^{(i)})) +
(1 - y^{(i)}) \log(1 - h_\theta(x^{(i)}))
\right] J ( θ ) = − m 1 i = 1 ∑ m [ y ( i ) log ( h θ ( x ( i ) )) + ( 1 − y ( i ) ) log ( 1 − h θ ( x ( i ) )) ]
Cost Function With Regularization
We add a L2 penalty term: penalty term:
J ( θ ) = − 1 m ∑ i = 1 m [ y ( i ) log ( h θ ( x ( i ) ) ) + ( 1 − y ( i ) ) log ( 1 − h θ ( x ( i ) ) ) ] + λ 2 m ∑ j = 1 n θ j 2 J(\theta)
= - \frac{1}{m}
\sum_{i=1}^{m}
\left[
y^{(i)} \log(h_\theta(x^{(i)})) +
(1 - y^{(i)}) \log(1 - h_\theta(x^{(i)})) \right] +
\frac{\lambda}{2m}
\sum_{j=1}^{n}
\theta_j^2 J ( θ ) = − m 1 i = 1 ∑ m [ y ( i ) log ( h θ ( x ( i ) )) + ( 1 − y ( i ) ) log ( 1 − h θ ( x ( i ) )) ] + 2 m λ j = 1 ∑ n θ j 2
Gradient Descent With Regularization
repeat until convergence:
{
For j = 0 j = 0 j = 0 (bias):
No regularization term here for θ 0 \theta_0 θ 0 :
θ 0 : = θ 0 − α 1 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) x 0 ( i ) \theta_0 :=
\theta_0 -
\alpha
\frac{1}{m}
\sum_{i=1}^{m}
\left(
h_\theta(x^{(i)}) - y^{(i)}
\right)
x_0^{(i)} θ 0 := θ 0 − α m 1 i = 1 ∑ m ( h θ ( x ( i ) ) − y ( i ) ) x 0 ( i )
For j ≥ 1 j \ge 1 j ≥ 1 :
Update for
𝑗 = 1 , 2 , … , 𝑛 𝑗 = 1 , 2 , \dots, 𝑛 j = 1 , 2 , … , n
θ j : = θ j − α [ 1 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) x j ( i ) + λ m θ j ] \theta_j :=
\theta_j -
\alpha
\left[
\frac{1}{m}
\sum_{i=1}^{m}
\left(
h_\theta(x^{(i)}) - y^{(i)}
\right)
x_j^{(i)} +
\frac{\lambda}{m}\theta_j
\right] θ j := θ j − α [ m 1 i = 1 ∑ m ( h θ ( x ( i ) ) − y ( i ) ) x j ( i ) + m λ θ j ]
}
where:
h θ ( x ) = 1 1 + e − θ T x h_\theta(x) = \frac{1}{1 + e^{-\theta^T x}} h θ ( x ) = 1 + e − θ T x 1
This essentially looks similar to linear regression, but with the logistic cost function.
Simplified Update Rule
You can also rewrite it as:
For j ≥ 1 j \ge 1 j ≥ 1 :
θ j : = θ j ( 1 − α λ m ) − α 1 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) x j ( i ) \theta_j :=
\theta_j
\left(1 - \alpha \frac{\lambda}{m}\right) -
\alpha
\frac{1}{m}
\sum_{i=1}^{m}
\left(
h_\theta(x^{(i)}) - y^{(i)}
\right)
x_j^{(i)} θ j := θ j ( 1 − α m λ ) − α m 1 i = 1 ∑ m ( h θ ( x ( i ) ) − y ( i ) ) x j ( i )
🧠 Key Takeaways: Gradient Descent
Logistic regression uses gradient descent just like linear regression.
The update formula is structurally the same.
The cost function is different.
The model is convex, so gradient descent converges to the global minimum.
Vectorized form makes implementation efficient and clean.