Linear Regression Explained: Single Variable and Multivariate Models with Gradient Descent
Learn linear regression in machine learning, including single-variable and multivariate models, hypothesis function, cost function (MSE), gradient descent optimization, feature scaling, assumptions, and real-world implementation examples.
Logistic Regression for Classification: Concept, Sigmoid Function, Cost Function, and Implementation
telc A2 – Hörverstehen (Listening) 🎧
📐 Linear Regression
Linear regression is a supervised learning algorithm used to predict a continuous output variable based on one or more input features.
- It is widely used for prediction, forecasting, and as a baseline model.
- It assumes that the relationship between inputs and output is linear.
When to Use Linear Regression
Linear regression is ideal for:
- Price prediction
- Trend analysis
- Baseline modeling
- Interpretable relationships
- Fast and simple forecasting
It is often used as a baseline before trying more complex models.
Key Assumptions
Linear regression works best when:
- The relationship is approximately linear
- Errors are independent
- Variance of errors is constant
- Residuals are normally distributed
Understanding these assumptions is important for reliable modeling.
🧮 Training Set
Training Set is the Data that is fed to Leaning Algorithm.
The Learning function outputs a hypothesis function
- = inputs
- = expected output
- = Row in Training Set: single training example
- = i-th training example
- = Number of training examples - rows in training set
- = Number of features - columns in training set
Example: 🏠 House price
| Size (x₁) | Rooms (x₂) | Price (y) |
|---|---|---|
| 50 | 1 | 150 |
| 80 | 2 | 230 |
| 120 | 3 | 310 |
- 📏 → size of house
- 🛏 → number of Rooms
- = 3 training examples
- = 2 features
💡 Hypothesis
Function that maps input to output is called Hypothesis function.
Supervised learning works like this:
Training Set → Learning Algorithm → Hypothesis Function
The algorithm outputs a function called: h (hypothesis)
- = hypothesis, trained Algo that can map to
Finding
Our goal is to find the best values of that minimize prediction error.
1. Single Variable Linear Regression :
Linear regression is method of finding a Continues linear relationship between Y and X
When there is only one feature, the model is:
This represents a straight line, where:
- = is the predicted value
- = is the input feature
- = is the Y intercept
- = is the slope of line
Example:
-
→ line passes through origin
-
→ horizontal line

2. Multi Variate Linear Regression:
Linear regression with multiple variables
For multiple features:

Matrix form:
Parameter Vector
Target Vector
This form is computationally efficient.
Calculating Hypothesis Function
Using Closed-form solution:
- : Base price of house
- : Price increase per unit size
- : Price increase per additional room
Final Hypothesis Function:
Model Inference
Test For:
- Size = 100
- Rooms = 2
Predicted price = 280
💰 Cost Function ()
How bad are our guesses?
Goal of the algorithm is to choose & such that come close to .
- Minimize thus minimize Error
- Used to measure how well the model performs
Goal is to for a given hypothesis
Find that minimizes
Mean Squared Error Cost Function (MSE)
The squared error works well for regression problems because it:
- Penalizes large errors
- Is mathematically convenient
- Produces a convex function
The cost function is defined as:
Where:
- = number of training examples
- = prediction for given input
- = actual value for given input
The objective is to minimize this function.
Plotting Cost Function
One Feature () and Two Parameters (, )
Parabola
- as Y-Axis
- as X-Axis
3D Parabola
-
as Z-Axis
-
as X-Axis
-
as Y-Axis

Two Feature () and Three Parameters (, , )
h = $\theta_0 + \theta_1 x_1 + \theta_2 x_2
That lives in 4 dimensions impossible to visualize
J(, , ) as W-Axis
- as X-Axis
- as Y-Axis
- as Z-Axis
Contour Figure ⛰️
Contour plot is seeing surface plot passing through a horizontal 2D clip plane
Each circle represents:
- All points that have the same height = Cost J(θ).
In the contour plot:
- X-axis →
- Y-axis →

Why Contour Plot is Circular?
Contour Plot is Circular because Cost function is Convex.
- Circle represent all points that have the same height.
- Smaller circles → smaller cost
- Larger circles → larger cost
Gradient Descent Moves towards the center of the contour plot where cost is minimum.
- It moves perpendicular to the contour lines because that is the direction of steepest descent.
🎢 Gradient Descent
Gradient descent is an optimization algorithm used to minimize the cost function by iteratively moving towards the minimum.
That’s just a fancy name for:
Try numbers → see error → improve numbers → repeat.
Works like
- Start somewhere
- Just like going down from a hill.
- Look around and find local minima and keep on going down repeat till find optimal solution.
- Multiple minima can be found
Start somewhere → Take steps downhill → Reach minimum

Algorithm: Single Variant Linear Regression
For feature index j= 0, 1 repeat until convergence
For
- Simultaneous compute , and store in temp values
- Simultaneous Update ,
Where:
- Learning Rate, How big steps we take down hill
- is the parameter index
- Assignment Operation eg a= a+1
- Truth Assertion eg a==a
This process is repeated until convergence.

Algo: Multivariate Linear Regression
Steps:
- For feature index j= 0,1,....n repeat until convergence
For
- Simultaneous Compute and store in temp values
- Simultaneous Update
Final Hypothesis Function
Which is equivalent to:
Learning Rate ()
Alpha defines rate of Learning
The update rule is:
Or Simplified
- Small → slow learning
- Large → overshooting the minimum
- Proper → fast convergence

Deciding Learning Rate ()
Plot the cost function, over the number of iterations of gradient descent.
If decreases every iteration, then you are probably using a good learning rate.
- Smooth Steadily decreasing
- Flattening near minimum
- No wild jumps & no upward trend
# Good Learning Rate
|
|\
| \
| \
Cost | \
| \____
|
+----------------
Iterations
If J(θ) continuously increases then you probably need to decrease .
- Reason: Large causes overshooting the minimum, leading to divergence or oscillation.
- Fix: Reduce learning rate.
|
| /
| /
Cost | /
| /
| /
+----------------
Iterations
If J(θ) decreases but very slowly, then you probably need to increase
- Reason: Small causes slow convergence, taking many iterations to approach the minimum.
- Fix: Increase learning rate.
Cost
|
|\
| \
| \
| \
| \______
+----------------
Iterations
If J(θ) oscillates, then you probably need to reduce and scale features.
- Reason:
- Large leading to oscillation around the minimum.
- Features are on different scales,
- Fix: Reduce learning rate and apply feature scaling t
Cost
|
| /\ /\ /\
| / \ / \ / \
| /
+----------------
Iterations
Debugging Learning Rate ()
| Behavior | Problem | Fix |
|---|---|---|
| Cost increases | α too large | Reduce α |
| Oscillates | α too large | Reduce α + scale |
| Very slow decrease | α too small | Increase α |
| No improvement | Features not scaled | Apply scaling |
Derivative Term
Derivative terms defines rate of change of Cost function wrt
- At local minima Derivative Term is = 0.
-
Derivative term automatically takes small step when it starts to converge towards local minimal. Having a fixed alpha helps
-
Derivative term automatically converge towards its local minima from both +ve and -ve slopes:

Feature Scaling
Feature scaling helps to make the cost function more circular, which allows gradient descent to converge faster.
- Make sure feature are on same scale otherwise contour will be skew elliptical.
- Try to get feature into
- Ideally should be withing range
Problem
Difference in scale of features can create skew ellipse in cost function.
- example:
- Size of house (0-1000) vs Number of rooms (1-10)
A skew ellipse will have a long axis and a short axis.
- Gradient descent will oscillate across the long axis and take a long time to converge to the minimum.

📏 Solution
1. Min-Max Normalization
This is done by subtracting the minimum value of the feature and dividing by the range (max - min).
- Scale features to [0, 1]
2. Mean Normalization:
This is done by subtracting the mean of the feature and dividing by the range (max - min).
- Scale features to have mean 0 and range [-3, 3]
Alternatively
Where:
- = mean of feature = Avg(X)
- = standard deviation = max(X) - min(X)
