Loading ⏳

Fetching content, this won’t take long…

💡 Did you know?

🦈 Sharks existed before trees 🌳.

Loading ⏳

Fetching content, this won’t take long…

💡 Did you know?

🐙 Octopuses have three hearts and blue blood.

AI-Machine-Learning

AI-AgenticAI

AI-DeepLearning

AI-GenAI

AI-Infrastructure

AI-Machine-Learning

AI-Math

AWS

Azure

Hobbies

kubernetes

Management

Programming

Terraform

Z_Appendix

0-root

AI-Machine-Learning

Logistic Regression for Classification: Concept, Sigmoid Function, Cost Function, and Implementation

Complete guide to logistic regression for binary classification, including the sigmoid function, hypothesis model, cost function, decision boundary, gradient descent, and practical machine learning implementation.

Logistic Regression

Classification

Machine Learning

Binary Classification

Supervised Learning

Sigmoid Function

← Previous

Polynomial Regression

Logistic Regression for Classification: Concept, Sigmoid Function, Cost Function, and Implementation

📊 Logistic Regression for Classification

In classification problems, the output variable $y$ takes discrete values

Classification Types

1. Binary Classification

Two classes:

y \in \{0,1\}

We usually call:

$0$ → Negative class: 0 represents absence.
$1$ → Positive class: 1 represent presence of something (e.g., disease)

2. Multi-class Classification

More than two classes:

y \in \{0,1,2,3,...\}

The Sigmoid Function $\sigma(x)$

Sigmoid function (also called logistic function) maps any real-valued number into the (0, 1) interval.

It is commonly used in logistic regression to model probabilities.

The sigmoid function is defined as:

\sigma(x) = \frac{1}{1 + e^{-x}} = \frac{e^x}{e^x + 1} = 1 - \frac{1}{1 + e^x} = \frac{1}{2} \left(1 + \tanh\left(\frac{x}{2}\right)\right) = 1 - \sigma(-x)

Where:

$z$ is the input to the function (can be any real number)
$e$ is the base of the natural logarithm (approximately 2.71828)

Output:

$\sigma(z)$ is always between 0 and 1, making it suitable for modeling probabilities.

When $z$ $z$ is large and positive, $\sigma(z) \approx 1$ $σ (z) \approx 1$ .
- $z \to +\infty$ , $\sigma(z) \to 1$
When $z$ $z$ is large and negative, $\sigma(z) \approx 0$ $σ (z) \approx 0$ .
- $z \to -\infty$ , $\sigma(z) \to 0$
When $z = 0$ , $\sigma(z) = 0.5$ .

💡 Logistic Regression Hypothesis $h_\theta(x)$

Logistic regression ensures:

0 \le h_\theta(x) \le 1

Where

Input: Any real number : $(-\infty, +\infty)$
Output: Always between: $(0,1)$

Instead of: $h_\theta(x) = \theta^T x$

We apply a transformation that squashes outputs into the probability range $[0,1]$ .

h_\theta(x) = g(\theta^T x)

So the output becomes a probability: $h_\theta(x) = P(y=1 \mid x)$

This can be simplified to:

h_\theta(x) = g(z)

Where

z = \theta^T x

and $g(z)$ as the sigmoid function:

g(z) = \frac{1}{1 + e^{-z}}

Final Hypothesis

h_\theta(x) = \frac{1}{1 + e^{-\theta^T x}}

This ensures:

0 \le h_\theta(x) \le 1

h_\theta(x) = P(y = 1 \mid x; \theta)

So:

If $h_\theta(x) = 0.7$ → There is a 70% probability that $y = 1$

Since probabilities must sum to 1:

P(y=0 \mid x; \theta) = 1 - P(y=1 \mid x; \theta)

h_\theta(x) = P(y = 1 \mid x; \theta)

So if:

h_\theta(x) = 0.7

Then:

$P(y = 1 \mid x; \theta) = 0.7$
$P(y = 0 \mid x; \theta) = 1 - 0.7 = 0.3$

So the correct interpretations are the ones that match:

$P(y = 1 \mid x; \theta) = 0.7$
$P(y = 0 \mid x; \theta) = 0.3$

🔑 Decision Boundary

The decision boundary is the line that separates the area where y = 0 and where y = 1.

It is created by our hypothesis function / model.

Decision Boundary is a Property of the Model

The decision boundary depends only on:

The hypothesis form
The parameters $\theta$

It does not depend on the training data once $\theta$ is fixed.

The training set is used only to learn $\theta$ .

Logistic regression decision boundary

For logistic regression, the decision boundary is where:

h_\theta(x) = 0.5

In order to get our discrete 0 or 1 classification, we can translate the output of the hypothesis function as follows:

➕ $h_\theta(x) \ge 0.5$ we Predict $y = 1$
➖ $h_\theta(x) < 0.5$ we Predict $y = 0$

When Is $h_\theta(x) \ge 0.5$ ?

Since:

g(z) \ge 0.5 \quad \text{when} \quad z \ge 0

and

h_\theta(x) = g(\theta^T x),

we predict:

y = 1 \quad \text{when} \quad \theta^T x \ge 0

and

y = 0 \quad \text{when} \quad \theta^T x < 0

Example

Given:

h_\theta(x) = g(\theta_0 + \theta_1 x_1 + \theta_2 x_2)

and sigmoid $g(z) = 0.5$ when $z = 0$ , the boundary is:

\theta_0 + \theta_1 x_1 + \theta_2 x_2 = 0

Given:

$\theta_0 = -6$
$\theta_1 = 0$
$\theta_2 = 1$

So:

-6 + 0 \cdot x_1 + 1 \cdot x_2 = 0

Simplifies to:

x_2 = 6

This is a horizontal line, independence on $x_1$
Located at: $x_2 = 6$

Linear Decision Boundary

Suppose:

h_\theta(x) = g(\theta_0 + \theta_1 x_1 + \theta_2 x_2)

Let:

\theta_0 = -3, \quad \theta_1 = 1, \quad \theta_2 = 1

Then:

\theta^T x = -3 + x_1 + x_2

We predict $y = 1$ when:

-3 + x_1 + x_2 \ge 0

Rewriting:

x_1 + x_2 \ge 3

Decision Boundary

The decision boundary occurs when:

x_1 + x_2 = 3

This is a straight line.

It separates the plane into:

Region where $y = 1$
Region where $y = 0$

The decision boundary corresponds to:

h_\theta(x) = 0.5

Nonlinear Decision Boundaries

We can add polynomial features.

Example:

h_\theta(x) = g(\theta_0 + \theta_1 x_1+ \theta_2 x_2+ \theta_3 x_1^2+ \theta_4 x_2^2)

Suppose:

\theta_0 = -1, \quad \theta_1 = 0, \quad \theta_2 = 0, \quad \theta_3 = 1, \quad \theta_4 = 1

Then:

\theta^T x = -1 + x_1^2 + x_2^2

We predict $y = 1$ when:

-1 + x_1^2 + x_2^2 \ge 0

Rewriting:

x_1^2 + x_2^2 \ge 1

Decision Boundary

The boundary is:

x_1^2 + x_2^2 = 1

This is a circle of radius 1.

So logistic regression can produce nonlinear boundaries using polynomial features.

More Complex Boundaries

By adding higher-order terms such as:

$x_1^3$
$x_1 x_2$
$x_1^2 x_2$
etc.

Logistic regression can represent:

Ellipses
Complex curves
Highly nonlinear shapes

💰 Cost Function / Optimal Objective

The overall cost is:

J(\theta) = \frac{1}{m} \sum_{i=1}^{m} \text{Cost}\big(h_\theta(x^{(i)}), y^{(i)}\big)

where:

h_\theta(x) = \frac{1}{1 + e^{-\theta^T x}}

Why Not Use Squared Error Cost?

In linear regression, we use:

J(\theta) = \frac{1}{2m}\sum (h_\theta(x) - y)^2

If we use same squared error with sigmoid:

The cost function becomes non-convex
Optimization may get stuck in local minima
Training may fail to find the best parameters

So we need a better cost function.

We define cost separately for the two classes.

The cost function is defined as:

\text{Cost}(h_\theta(x), y) = \begin{cases} -\log(h_\theta(x)) & \text{if } y = 1 \\ -\log(1 - h_\theta(x)) & \text{if } y = 0 \end{cases}

Why This Cost Function Is Better

It is convex

No local minima
Guarantees only one global minimum
Optimization is reliable
Smooth and well-behaved

Because $J(\theta)$ is convex:

Gradient descent will converge to the global minimum
We do not get stuck in bad local optima
Training is stable

It penalizes wrong predictions heavily

Cost = 0 when prediction is correct
Cost → ∞ when prediction is very wrong
Encourages the model to be confident and correct

1. Case 1: When $y = 1$

We want $h_\theta(x)$ to be close to 1.

Cost:

-\log(h_\theta(x))

If prediction is close to 1 → cost is small

If $h_\theta(x) = 1$ → cost = 0

If prediction is close to 0 → cost is very large -If $h_\theta(x) \to 0$ → cost $\to \infty$

So:

Correct confident prediction → small cost
Wrong confident prediction → very large cost

Case 2: When $y = 0$

We want $h_\theta(x)$ to be close to 0.

Cost:

-\log(1 - h_\theta(x))

If prediction is close to 1 → cost is very large -If $h_\theta(x) \to 1$ → cost $\to \infty$

If prediction is close to 0 → cost is small

If $h_\theta(x) = 0$ → cost = 0

Again:

Correct prediction → small cost
Wrong confident prediction → large penalty

Unified Logistic Cost Function

Simplified Cost Function (Single Formula)

We can combine the two cases into one equation:

\text{Cost}(h_\theta(x), y) = - y \log(h_\theta(x))- (1 - y)\log(1 - h_\theta(x))

If $y = 1$ $y = 1$ :
- The second term becomes 0
- Cost reduces to:

-\log(h_\theta(x))

If $y = 0$ $y = 0$ :
- The first term becomes 0
- Cost reduces to:

-\log(1 - h_\theta(x))

So this single formula covers both cases.

Full Cost Function

Full Cost Function Over Dataset

For $m$ training examples:

J(\theta) = -\frac{1}{m} \sum_{i=1}^{m} \left[ y^{(i)} \log(h_\theta(x^{(i)})) + (1 - y^{(i)}) \log(1 - h_\theta(x^{(i)})) \right]

This is called:

Log loss
Cross-entropy loss
Logistic loss

Where:

h_\theta(x) = \frac{1}{1 + e^{-\theta^T x}}

Vectorized Cost Fucntion

Let:

$X$ = design matrix
$y$ = vector of labels
$h = g(X\theta)$

Then:

h = g(X\theta)

and

J(\theta) = \frac{1}{m} \left(- y^T \log(h)- (1 - y)^T \log(1 - h) \right)

🧠 Key Takeaways: Cost Function

Logistic regression uses a convex cost function
The simplified cost formula works for both $y=0$ and $y=1$
Gradient descent update looks the same as linear regression
Vectorization makes implementation efficient
Always include the $\frac{1}{m}$ factor in the gradient update

🎢 Gradient Descent

The goal is to minimize the cost function $J(\theta)$ by repeatedly updating the parameters.

For $m$ training examples:

Repeat until convergence:

for

j = 0,1,\ldots,n

Update each parameter $\theta_j$ simultaneously using the rule:

\theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta)

$\theta_0$ is bias term,
$\theta_1, \ldots, \theta_n$ are feature weights.

where:

$\alpha$ is the learning rate
$J(\theta)$ is the cost function
$\frac{\partial}{\partial \theta_j}J(\theta)$ is the partial derivative of the cost function with respect to $\theta_j$

For logistic regression, the cost function is:

J(\theta) = -\frac{1}{m} \sum_{i=1}^{m} \left[ y^{(i)} \log(h_\theta(x^{(i)}))+ (1-y^{(i)}) \log(1-h_\theta(x^{(i)})) \right]

Where:

h_\theta(x) = g(\theta^T x)

and

g(z) = \frac{1}{1+e^{-z}}

$z = \theta^T x$

Substituting this gradient into the update rule gives:

\theta_j := \theta_j- \frac{\alpha}{m} \sum_{i=1}^{m} \left( h_\theta(x^{(i)})- y^{(i)} \right) x_j^{(i)}

for

j = 0,1,\ldots,n

Logistic Regression Gradient

After computing the derivative, we get:

\theta_j := \theta_j - \frac{\alpha}{m} \sum_{i=1}^{m} \left( h_\theta(x^{(i)}) - y^{(i)} \right) x_j^{(i)}

Important Notes

This is identical in form to linear regression gradient descent.
We must update all $\theta_j$ simultaneously..
The difference lies in the hypothesis function:

h_\theta(x) = g(\theta^T x)

where

g(z) = \frac{1}{1 + e^{-z}}

Vectorized Gradient Descent

Let:

$X$ = design matrix
$\vec{y}$ = vector of labels
$h = g(X\theta)$

Then the update rule becomes:

h = g(X\theta)

Then the update rule becomes:

\theta := \theta- \frac{\alpha}{m} X^T \left( g(X\theta) - \vec{y} \right)

Where:

$\vec{y}$ is the vector of labels
$X^T$ is the transpose of the design matrix

⚖️ Regularized Logistic Regression

Regularization helps prevent overfitting by penalizing large weights.

Compared to the non-regularized model, the regularized version produces smoother decision boundaries.

Cost Function (Without Regularization)

Recall the logistic regression cost function:

J(\theta) = - \frac{1}{m} \sum_{i=1}^{m} \left[ y^{(i)} \log(h_\theta(x^{(i)})) + (1 - y^{(i)}) \log(1 - h_\theta(x^{(i)})) \right]

Cost Function With Regularization

We add a L2 penalty term: penalty term:

J(\theta) = - \frac{1}{m} \sum_{i=1}^{m} \left[ y^{(i)} \log(h_\theta(x^{(i)})) + (1 - y^{(i)}) \log(1 - h_\theta(x^{(i)})) \right] + \frac{\lambda}{2m} \sum_{j=1}^{n} \theta_j^2

Gradient Descent With Regularization

repeat until convergence: {

For $j = 0$ (bias):

No regularization term here for $\theta_0$ :

\theta_0 := \theta_0 - \alpha \frac{1}{m} \sum_{i=1}^{m} \left( h_\theta(x^{(i)}) - y^{(i)} \right) x_0^{(i)}

For $j \ge 1$ :

Update for $𝑗 = 1 , 2 , \dots, 𝑛$

\theta_j := \theta_j - \alpha \left[ \frac{1}{m} \sum_{i=1}^{m} \left( h_\theta(x^{(i)}) - y^{(i)} \right) x_j^{(i)} + \frac{\lambda}{m}\theta_j \right]

}

where:

h_\theta(x) = \frac{1}{1 + e^{-\theta^T x}}

This essentially looks similar to linear regression, but with the logistic cost function.

Simplified Update Rule

You can also rewrite it as:

For $j \ge 1$ :

\theta_j := \theta_j \left(1 - \alpha \frac{\lambda}{m}\right) - \alpha \frac{1}{m} \sum_{i=1}^{m} \left( h_\theta(x^{(i)}) - y^{(i)} \right) x_j^{(i)}

🧠 Key Takeaways: Gradient Descent

Logistic regression uses gradient descent just like linear regression.
The update formula is structurally the same.
The cost function is different.
The model is convex, so gradient descent converges to the global minimum.
Vectorized form makes implementation efficient and clean.

Written by Hitesh Sahu, a passionate developer and blogger.

Fri Feb 27 2026

Share This on

← Previous

Polynomial Regression

Logistic Regression for Classification: Concept, Sigmoid Function, Cost Function, and Implementation

AI-Machine-Learning/3-0-Logistic-Regression

Loading ⏳

Fetching content, this won’t take long…

💡 Did you know?

🦈 Sharks existed before trees 🌳.

AI-Machine-Learning

AI-AgenticAI

AI-DeepLearning

AI-GenAI

AI-Infrastructure

AI-Machine-Learning

AI-Math

AWS

Azure

Hobbies

kubernetes

Management

Programming

Terraform

Z_Appendix

0-root

AI-Machine-Learning

Logistic Regression for Classification: Concept, Sigmoid Function, Cost Function, and Implementation

Complete guide to logistic regression for binary classification, including the sigmoid function, hypothesis model, cost function, decision boundary, gradient descent, and practical machine learning implementation.

Logistic Regression

Classification

Machine Learning

Binary Classification

Supervised Learning

Sigmoid Function

← Previous

Polynomial Regression

Logistic Regression for Classification: Concept, Sigmoid Function, Cost Function, and Implementation

📊 Logistic Regression for Classification

In classification problems, the output variable $y$ takes discrete values

Classification Types

1. Binary Classification

Two classes:

y \in \{0,1\}

We usually call:

$0$ → Negative class: 0 represents absence.
$1$ → Positive class: 1 represent presence of something (e.g., disease)

2. Multi-class Classification

More than two classes:

y \in \{0,1,2,3,...\}

The Sigmoid Function $\sigma(x)$

Sigmoid function (also called logistic function) maps any real-valued number into the (0, 1) interval.

It is commonly used in logistic regression to model probabilities.

The sigmoid function is defined as:

\sigma(x) = \frac{1}{1 + e^{-x}} = \frac{e^x}{e^x + 1} = 1 - \frac{1}{1 + e^x} = \frac{1}{2} \left(1 + \tanh\left(\frac{x}{2}\right)\right) = 1 - \sigma(-x)

Where:

$z$ is the input to the function (can be any real number)
$e$ is the base of the natural logarithm (approximately 2.71828)

Output:

$\sigma(z)$ is always between 0 and 1, making it suitable for modeling probabilities.

When $z$ $z$ is large and positive, $\sigma(z) \approx 1$ $σ (z) \approx 1$ .
- $z \to +\infty$ , $\sigma(z) \to 1$
When $z$ $z$ is large and negative, $\sigma(z) \approx 0$ $σ (z) \approx 0$ .
- $z \to -\infty$ , $\sigma(z) \to 0$
When $z = 0$ , $\sigma(z) = 0.5$ .

💡 Logistic Regression Hypothesis $h_\theta(x)$

Logistic regression ensures:

0 \le h_\theta(x) \le 1

Where

Input: Any real number : $(-\infty, +\infty)$
Output: Always between: $(0,1)$

Instead of: $h_\theta(x) = \theta^T x$

We apply a transformation that squashes outputs into the probability range $[0,1]$ .

h_\theta(x) = g(\theta^T x)

So the output becomes a probability: $h_\theta(x) = P(y=1 \mid x)$

This can be simplified to:

h_\theta(x) = g(z)

Where

z = \theta^T x

and $g(z)$ as the sigmoid function:

g(z) = \frac{1}{1 + e^{-z}}

Final Hypothesis

h_\theta(x) = \frac{1}{1 + e^{-\theta^T x}}

This ensures:

0 \le h_\theta(x) \le 1

h_\theta(x) = P(y = 1 \mid x; \theta)

So:

If $h_\theta(x) = 0.7$ → There is a 70% probability that $y = 1$

Since probabilities must sum to 1:

P(y=0 \mid x; \theta) = 1 - P(y=1 \mid x; \theta)

h_\theta(x) = P(y = 1 \mid x; \theta)

So if:

h_\theta(x) = 0.7

Then:

$P(y = 1 \mid x; \theta) = 0.7$
$P(y = 0 \mid x; \theta) = 1 - 0.7 = 0.3$

So the correct interpretations are the ones that match:

$P(y = 1 \mid x; \theta) = 0.7$
$P(y = 0 \mid x; \theta) = 0.3$

🔑 Decision Boundary

The decision boundary is the line that separates the area where y = 0 and where y = 1.

It is created by our hypothesis function / model.

Decision Boundary is a Property of the Model

The decision boundary depends only on:

The hypothesis form
The parameters $\theta$

It does not depend on the training data once $\theta$ is fixed.

The training set is used only to learn $\theta$ .

Logistic regression decision boundary

For logistic regression, the decision boundary is where:

h_\theta(x) = 0.5

In order to get our discrete 0 or 1 classification, we can translate the output of the hypothesis function as follows:

➕ $h_\theta(x) \ge 0.5$ we Predict $y = 1$
➖ $h_\theta(x) < 0.5$ we Predict $y = 0$

When Is $h_\theta(x) \ge 0.5$ ?

Since:

g(z) \ge 0.5 \quad \text{when} \quad z \ge 0

and

h_\theta(x) = g(\theta^T x),

we predict:

y = 1 \quad \text{when} \quad \theta^T x \ge 0

and

y = 0 \quad \text{when} \quad \theta^T x < 0

Example

Given:

h_\theta(x) = g(\theta_0 + \theta_1 x_1 + \theta_2 x_2)

and sigmoid $g(z) = 0.5$ when $z = 0$ , the boundary is:

\theta_0 + \theta_1 x_1 + \theta_2 x_2 = 0

Given:

$\theta_0 = -6$
$\theta_1 = 0$
$\theta_2 = 1$

So:

-6 + 0 \cdot x_1 + 1 \cdot x_2 = 0

Simplifies to:

x_2 = 6

This is a horizontal line, independence on $x_1$
Located at: $x_2 = 6$

Linear Decision Boundary

Suppose:

h_\theta(x) = g(\theta_0 + \theta_1 x_1 + \theta_2 x_2)

Let:

\theta_0 = -3, \quad \theta_1 = 1, \quad \theta_2 = 1

Then:

\theta^T x = -3 + x_1 + x_2

We predict $y = 1$ when:

-3 + x_1 + x_2 \ge 0

Rewriting:

x_1 + x_2 \ge 3

Decision Boundary

The decision boundary occurs when:

x_1 + x_2 = 3

This is a straight line.

It separates the plane into:

Region where $y = 1$
Region where $y = 0$

The decision boundary corresponds to:

h_\theta(x) = 0.5

Nonlinear Decision Boundaries

We can add polynomial features.

Example:

h_\theta(x) = g(\theta_0 + \theta_1 x_1+ \theta_2 x_2+ \theta_3 x_1^2+ \theta_4 x_2^2)

Suppose:

\theta_0 = -1, \quad \theta_1 = 0, \quad \theta_2 = 0, \quad \theta_3 = 1, \quad \theta_4 = 1

Then:

\theta^T x = -1 + x_1^2 + x_2^2

We predict $y = 1$ when:

-1 + x_1^2 + x_2^2 \ge 0

Rewriting:

x_1^2 + x_2^2 \ge 1

Decision Boundary

The boundary is:

x_1^2 + x_2^2 = 1

This is a circle of radius 1.

So logistic regression can produce nonlinear boundaries using polynomial features.

More Complex Boundaries

By adding higher-order terms such as:

$x_1^3$
$x_1 x_2$
$x_1^2 x_2$
etc.

Logistic regression can represent:

Ellipses
Complex curves
Highly nonlinear shapes

💰 Cost Function / Optimal Objective

The overall cost is:

J(\theta) = \frac{1}{m} \sum_{i=1}^{m} \text{Cost}\big(h_\theta(x^{(i)}), y^{(i)}\big)

where:

h_\theta(x) = \frac{1}{1 + e^{-\theta^T x}}

Why Not Use Squared Error Cost?

In linear regression, we use:

J(\theta) = \frac{1}{2m}\sum (h_\theta(x) - y)^2

If we use same squared error with sigmoid:

The cost function becomes non-convex
Optimization may get stuck in local minima
Training may fail to find the best parameters

So we need a better cost function.

We define cost separately for the two classes.

The cost function is defined as:

\text{Cost}(h_\theta(x), y) = \begin{cases} -\log(h_\theta(x)) & \text{if } y = 1 \\ -\log(1 - h_\theta(x)) & \text{if } y = 0 \end{cases}

Why This Cost Function Is Better

It is convex

No local minima
Guarantees only one global minimum
Optimization is reliable
Smooth and well-behaved

Because $J(\theta)$ is convex:

Gradient descent will converge to the global minimum
We do not get stuck in bad local optima
Training is stable

It penalizes wrong predictions heavily

Cost = 0 when prediction is correct
Cost → ∞ when prediction is very wrong
Encourages the model to be confident and correct

1. Case 1: When $y = 1$

We want $h_\theta(x)$ to be close to 1.

Cost:

-\log(h_\theta(x))

If prediction is close to 1 → cost is small

If $h_\theta(x) = 1$ → cost = 0

If prediction is close to 0 → cost is very large -If $h_\theta(x) \to 0$ → cost $\to \infty$

So:

Correct confident prediction → small cost
Wrong confident prediction → very large cost

Case 2: When $y = 0$

We want $h_\theta(x)$ to be close to 0.

Cost:

-\log(1 - h_\theta(x))

If prediction is close to 1 → cost is very large -If $h_\theta(x) \to 1$ → cost $\to \infty$

If prediction is close to 0 → cost is small

If $h_\theta(x) = 0$ → cost = 0

Again:

Correct prediction → small cost
Wrong confident prediction → large penalty

Unified Logistic Cost Function

Simplified Cost Function (Single Formula)

We can combine the two cases into one equation:

\text{Cost}(h_\theta(x), y) = - y \log(h_\theta(x))- (1 - y)\log(1 - h_\theta(x))

If $y = 1$ $y = 1$ :
- The second term becomes 0
- Cost reduces to:

-\log(h_\theta(x))

If $y = 0$ $y = 0$ :
- The first term becomes 0
- Cost reduces to:

-\log(1 - h_\theta(x))

So this single formula covers both cases.

Full Cost Function

Full Cost Function Over Dataset

For $m$ training examples:

J(\theta) = -\frac{1}{m} \sum_{i=1}^{m} \left[ y^{(i)} \log(h_\theta(x^{(i)})) + (1 - y^{(i)}) \log(1 - h_\theta(x^{(i)})) \right]

This is called:

Log loss
Cross-entropy loss
Logistic loss

Where:

h_\theta(x) = \frac{1}{1 + e^{-\theta^T x}}

Vectorized Cost Fucntion

Let:

$X$ = design matrix
$y$ = vector of labels
$h = g(X\theta)$

Then:

h = g(X\theta)

and

J(\theta) = \frac{1}{m} \left(- y^T \log(h)- (1 - y)^T \log(1 - h) \right)

🧠 Key Takeaways: Cost Function

Logistic regression uses a convex cost function
The simplified cost formula works for both $y=0$ and $y=1$
Gradient descent update looks the same as linear regression
Vectorization makes implementation efficient
Always include the $\frac{1}{m}$ factor in the gradient update

🎢 Gradient Descent

The goal is to minimize the cost function $J(\theta)$ by repeatedly updating the parameters.

For $m$ training examples:

Repeat until convergence:

for

j = 0,1,\ldots,n

Update each parameter $\theta_j$ simultaneously using the rule:

\theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta)

$\theta_0$ is bias term,
$\theta_1, \ldots, \theta_n$ are feature weights.

where:

$\alpha$ is the learning rate
$J(\theta)$ is the cost function
$\frac{\partial}{\partial \theta_j}J(\theta)$ is the partial derivative of the cost function with respect to $\theta_j$

For logistic regression, the cost function is:

J(\theta) = -\frac{1}{m} \sum_{i=1}^{m} \left[ y^{(i)} \log(h_\theta(x^{(i)}))+ (1-y^{(i)}) \log(1-h_\theta(x^{(i)})) \right]

Where:

h_\theta(x) = g(\theta^T x)

and

g(z) = \frac{1}{1+e^{-z}}

$z = \theta^T x$

Substituting this gradient into the update rule gives:

\theta_j := \theta_j- \frac{\alpha}{m} \sum_{i=1}^{m} \left( h_\theta(x^{(i)})- y^{(i)} \right) x_j^{(i)}

for

j = 0,1,\ldots,n

Logistic Regression Gradient

After computing the derivative, we get:

\theta_j := \theta_j - \frac{\alpha}{m} \sum_{i=1}^{m} \left( h_\theta(x^{(i)}) - y^{(i)} \right) x_j^{(i)}

Important Notes

This is identical in form to linear regression gradient descent.
We must update all $\theta_j$ simultaneously..
The difference lies in the hypothesis function:

h_\theta(x) = g(\theta^T x)

where

g(z) = \frac{1}{1 + e^{-z}}

Vectorized Gradient Descent

Let:

$X$ = design matrix
$\vec{y}$ = vector of labels
$h = g(X\theta)$

Then the update rule becomes:

h = g(X\theta)

Then the update rule becomes:

\theta := \theta- \frac{\alpha}{m} X^T \left( g(X\theta) - \vec{y} \right)

Where:

$\vec{y}$ is the vector of labels
$X^T$ is the transpose of the design matrix

⚖️ Regularized Logistic Regression

Regularization helps prevent overfitting by penalizing large weights.

Compared to the non-regularized model, the regularized version produces smoother decision boundaries.

Cost Function (Without Regularization)

Recall the logistic regression cost function:

J(\theta) = - \frac{1}{m} \sum_{i=1}^{m} \left[ y^{(i)} \log(h_\theta(x^{(i)})) + (1 - y^{(i)}) \log(1 - h_\theta(x^{(i)})) \right]

Cost Function With Regularization

We add a L2 penalty term: penalty term:

J(\theta) = - \frac{1}{m} \sum_{i=1}^{m} \left[ y^{(i)} \log(h_\theta(x^{(i)})) + (1 - y^{(i)}) \log(1 - h_\theta(x^{(i)})) \right] + \frac{\lambda}{2m} \sum_{j=1}^{n} \theta_j^2

Gradient Descent With Regularization

repeat until convergence: {

For $j = 0$ (bias):

No regularization term here for $\theta_0$ :

\theta_0 := \theta_0 - \alpha \frac{1}{m} \sum_{i=1}^{m} \left( h_\theta(x^{(i)}) - y^{(i)} \right) x_0^{(i)}

For $j \ge 1$ :

Update for $𝑗 = 1 , 2 , \dots, 𝑛$

\theta_j := \theta_j - \alpha \left[ \frac{1}{m} \sum_{i=1}^{m} \left( h_\theta(x^{(i)}) - y^{(i)} \right) x_j^{(i)} + \frac{\lambda}{m}\theta_j \right]

}

where:

h_\theta(x) = \frac{1}{1 + e^{-\theta^T x}}

This essentially looks similar to linear regression, but with the logistic cost function.

Simplified Update Rule

You can also rewrite it as:

For $j \ge 1$ :

\theta_j := \theta_j \left(1 - \alpha \frac{\lambda}{m}\right) - \alpha \frac{1}{m} \sum_{i=1}^{m} \left( h_\theta(x^{(i)}) - y^{(i)} \right) x_j^{(i)}

🧠 Key Takeaways: Gradient Descent

Logistic regression uses gradient descent just like linear regression.
The update formula is structurally the same.
The cost function is different.
The model is convex, so gradient descent converges to the global minimum.
Vectorized form makes implementation efficient and clean.

Written by Hitesh Sahu, a passionate developer and blogger.

Fri Feb 27 2026

Share This on

← Previous

Polynomial Regression

Logistic Regression for Classification: Concept, Sigmoid Function, Cost Function, and Implementation

AI-Machine-Learning/3-0-Logistic-Regression

Fetching content, this won’t take long…

🦈 Sharks existed before trees 🌳.

Fetching content, this won’t take long…

🐙 Octopuses have three hearts and blue blood.

AI-Machine-Learning

AI-AgenticAI

AI-DeepLearning

AI-GenAI

AI-Infrastructure

AI-Machine-Learning

AI-Math

AWS

Azure

Hobbies

kubernetes

Management

Programming

Terraform

Z_Appendix

0-root

Logistic Regression for Classification: Concept, Sigmoid Function, Cost Function, and Implementation

Complete guide to logistic regression for binary classification, including the sigmoid function, hypothesis model, cost function, decision boundary, gradient descent, and practical machine learning implementation.

📊 Logistic Regression for Classification

Classification Types

1. Binary Classification

2. Multi-class Classification

The Sigmoid Function σ(x)\sigma(x)σ(x)

Output:

💡 Logistic Regression Hypothesis hθ(x)h_\theta(x)hθ​(x)

Final Hypothesis

🔑 Decision Boundary

Decision Boundary is a Property of the Model

Logistic regression decision boundary

When Is hθ(x)≥0.5h_\theta(x) \ge 0.5hθ​(x)≥0.5?

Example

Linear Decision Boundary

Decision Boundary

Nonlinear Decision Boundaries

Example:

Decision Boundary

More Complex Boundaries

💰 Cost Function / Optimal Objective

Why Not Use Squared Error Cost?

We define cost separately for the two classes.

Why This Cost Function Is Better

1. Case 1: When y=1y = 1y=1

Case 2: When y=0y = 0y=0

Unified Logistic Cost Function

Simplified Cost Function (Single Formula)

Full Cost Function

Full Cost Function Over Dataset

Vectorized Cost Fucntion

🧠 Key Takeaways: Cost Function

🎢 Gradient Descent

Repeat until convergence:

Logistic Regression Gradient

Important Notes

Vectorized Gradient Descent

⚖️ Regularized Logistic Regression

Cost Function (Without Regularization)

Cost Function With Regularization

Gradient Descent With Regularization

For j=0j = 0j=0 (bias):

For j≥1j \ge 1j≥1:

Simplified Update Rule

🧠 Key Takeaways: Gradient Descent

Written by Hitesh Sahu, a passionate developer and blogger.

Fetching content, this won’t take long…

🦈 Sharks existed before trees 🌳.

AI-Machine-Learning

AI-AgenticAI

AI-DeepLearning

AI-GenAI

AI-Infrastructure

AI-Machine-Learning

AI-Math

AWS

Azure

Hobbies

kubernetes

The Sigmoid Function $\sigma(x)$

💡 Logistic Regression Hypothesis $h_\theta(x)$

When Is $h_\theta(x) \ge 0.5$ ?

1. Case 1: When $y = 1$

Case 2: When $y = 0$

For $j = 0$ (bias):

For $j \ge 1$ :

The Sigmoid Function $\sigma(x)$

💡 Logistic Regression Hypothesis $h_\theta(x)$

When Is $h_\theta(x) \ge 0.5$ ?

1. Case 1: When $y = 1$

Case 2: When $y = 0$