Hitesh Sahu
Hitesh SahuHitesh Sahu
  1. Home
  2. ›
  3. posts
  4. ›
  5. …

  6. ›
  7. 1 Neural Network

Loading ⏳
Fetching content, this won’t take long…


💡 Did you know?

🦈 Sharks existed before trees 🌳.

🍪 This website uses cookies

No personal data is stored on our servers however third party tools Google Analytics cookies to measure traffic and improve your website experience. Learn more

Cover Image for Neural Network Hypothesis and Intuition

Neural Network Hypothesis and Intuition

Explore the hypothesis and intuition behind neural networks, including their structure, activation functions, and how they process inputs to produce outputs.

Hitesh Sahu
Written by Hitesh Sahu, a passionate developer and blogger.

Fri Feb 27 2026

Share This on

Neural Networks Overview

The Feature Explosion Problem

Why We need Neural Networks?

Suppose we have x1,x2,…,xnx_1, x_2, \dots, x_nx1​,x2​,…,xn​ as input features and we want to compute a hypothesis hθ(x)h_\theta(x)hθ​(x).

We can

g(θ0+θ1x1+θ2x2+⋯+θnxn)g(\theta_0+ \theta_1 x_1 + \theta_2 x_2 + \dots + \theta_n x_n)g(θ0​+θ1​x1​+θ2​x2​+⋯+θn​xn​)

For quadratic features

g(θ0+θ1x1+θ2x2+⋯+θnxn+θn+1x12+… )g(\theta_0+ \theta_1 x_1 + \theta_2 x_2 + \dots + \theta_n x_n + \theta_{n+1} x_1^2 + \dots)g(θ0​+θ1​x1​+θ2​x2​+⋯+θn​xn​+θn+1​x12​+…)

  • Quadratic terms grow rougly as n2/2n^2/2n2/2

so we will end up with 5000 additional features if we have 100 features.

For cubic features

g(θ0+θ1x1+θ2x2+⋯+θnxn+θn+1x12+⋯+θn+kx13+… )g(\theta_0+ \theta_1 x_1 + \theta_2 x_2 + \dots + \theta_n x_n + \theta_{n+1} x_1^2 + \dots + \theta_{n+k} x_1^3 + \dots)g(θ0​+θ1​x1​+θ2​x2​+⋯+θn​xn​+θn+1​x12​+⋯+θn+k​x13​+…)

  • Cubic terms grows as O(n3)O(n^3)O(n3)

  • So we will end up with 166,000 additional features if we have 100 features.

As the features become more complex, the number of parameters θ\thetaθ grows rapidly.

It becomes

  • computationally expensive to compute the hypothesis with many features.
  • Memory Heavy to store all the parameters.
  • Prone to overfitting due to the large number of parameters.

In this case, we can use a neural network to compute the hypothesis more efficiently.

Practical Example: Image Recognition

Suppose we have a 100 × 100 pixel image as input.

  • Each pixel is a feature, so we have 10,000 features.
  • For RGB images, we have 3 color channels, so we have 30,000 features.

If we want to compute a hypothesis with quadratic features, we would have on the order of 450 million features, which is computationally infeasible.

Conclusion

Polynomial logistic regression: Works for small 𝑛𝑛n

  • Explodes combinatorially for large 𝑛𝑛n
  • Computationally impossible for large feature sets (like images)
  • We need a non-linear model that can capture complex relationships without explicitly generating all polynomial features.

Neural Networks as a Solution

Neural networks can compute complex hypotheses without explicitly generating all polynomial features.

Neurons as Computational Units

At a simple level, neurons are computational units.

They:

  • dendrites(head): Take inputs x1,x2,…,xnx_1, x_2, \dots, x_nx1​,x2​,…,xn​
  • Process them: apply weights and activation function
  • axon(tail): Produce an output hθ(x)h_\theta(x)hθ​(x)

Artificial Neurons

In artificial neural networks, we model neurons as mathematical functions.

In our machine learning model:

Inputs are features: x1,x2,…,xnx_1, x_2, \dots, x_nx1​,x2​,…,xn​

[x0x1x2⋮xn]\begin{bmatrix} x_0 \\ x_1 \\ x_2 \\ \vdots \\ x_n \end{bmatrix} ​x0​x1​x2​⋮xn​​​
  • x0x_0x0​ is the bias unit, and it is always equal to 1

Neuron Does complex transformations on the input features:

θ=[θ0θ1θ2⋮θn]\theta = \begin{bmatrix} \theta_0 \\ \theta_1 \\ \theta_2 \\ \vdots \\ \theta_n \end{bmatrix}θ=​θ0​θ1​θ2​⋮θn​​​

Output is the hypothesis hθ(x)h_\theta(x)hθ​(x)

where hθ(x)=g(θTx)h_\theta(x) = g(\theta^T x)hθ​(x)=g(θTx)

And it can be rewritten as: z=θTxz = \theta^T xz=θTx

so the hypothesis can be expressed as: hθ(x)=g(z)h_\theta(x) = g(z)hθ​(x)=g(z)

g(z)g(z)g(z) is the activation function that introduces non-linearity into the model.

Activation Function

g(z)g(z)g(z) is the sigmoid activation function. for Hypothesis function:

Neural networks use Logistic (Sigmoid) as logistic regression:

hθ(x)=g(z)h_\theta(x) = g(z)hθ​(x)=g(z)

g(z)=11+e−zg(z) = \frac{1}{1 + e^{-z}}g(z)=1+e−z1​

For a single unit:

hθ(x)=11+e−θTxh_\theta(x) = \frac{1}{1 + e^{-\theta^T x}}hθ​(x)=1+e−θTx1​

This function is called the sigmoid activation function.

In neural networks:

  • Parameters θ\thetaθ are called weights
  • Outputs of neurons are called activations

Neural Network Structure

A Artificial Neural Network is a computational graph with layers of artificial neurons.

A simple network looks like:

[x0x1x2]→Neuron→hθ(x)\begin{bmatrix} x_0 \\ x_1 \\ x_2 \end{bmatrix} \rightarrow \text{Neuron} \rightarrow h_\theta(x)​x0​x1​x2​​​→Neuron→hθ​(x)

Complex networks have multiple layers:

Input→Hidden→Output\text{Input} \rightarrow \text{Hidden} \rightarrow \text{Output}Input→Hidden→Output [x0x1x2x3]→[a1(2)a2(2)a3(2)]→hθ(x)\begin{bmatrix} x_0 \\ x_1 \\ x_2 \\ x_3 \end{bmatrix} \rightarrow \begin{bmatrix} a^{(2)}_1 \\ a^{(2)}_2 \\ a^{(2)}_3 \end{bmatrix} \rightarrow h_\theta(x)​x0​x1​x2​x3​​​→​a1(2)​a2(2)​a3(2)​​​→hθ​(x)
graph LR
  Input --> Hidden-Layer
  Hidden-Layer--> Output
  • Layer 1: Input layer: takes features x1,x2,…,xnx_1, x_2, \dots, x_nx1​,x2​,…,xn​ and bias unit x0x_0x0​
  • Layer 2: Hidden layer: computes intermediate activations a1(2),a2(2),…a^{(2)}_1, a^{(2)}_2, \dotsa1(2)​,a2(2)​,…
  • Layer n: Output layer: gives us the final hypothesis hθ(x)h_\theta(x)hθ​(x)

Where:

ai(j)a^{(j)}_iai(j)​ = activation of unit iii in layer jjj

  • a1(2)a_1^{(2)}a1(2)​: first neuron in hidden layer

    Θ(j)\Theta^{(j)}Θ(j) = weight matrix mapping layer jjj to layer j+1j+1j+1

Each Input Layer is densely connected to each Activation Function of next Layer:

Top Down

graph 

subgraph Input Layer
x1(((x1)))
x2(((x2)))
x3(((x3)))
end

subgraph Hidden Layers
a1{a1}
a2{a2}
a3{a3}
end

subgraph Output_Layer
y(((hθx)))
end

x1 --> a1
x1 --> a2
x1 --> a3

x2 --> a1
x2 --> a2
x2 --> a3

x3 --> a1
x3 --> a2
x3 --> a3

a1 --> y
a2 --> y
a3 --> y

IN Left to Right


graph LR

subgraph Input Layer
    x1(((x1)))
    x2(((x2)))
    x3(((x3)))
end

subgraph Hidden Layer
    a1{a1}
    a2{a2}
    a3{a3}
end

subgraph Output Layer
    y(((hθx)))
end

x1 --> a1
x1 --> a2
x1 --> a3

x2 --> a1
x2 --> a2
x2 --> a3

x3 --> a1
x3 --> a2
x3 --> a3

a1 --> y
a2 --> y
a3 --> y

Simplified:


graph LR

x1(((x1))) --> a1{a1}
x1 --> a2{a2}
x1 --> a3{a3}

x2(((x2))) --> a1
x2 --> a2
x2 --> a3

x3(((x3))) --> a1
x3 --> a2
x3 --> a3

a1 --> y(((hθx)))
a2 --> y
a3 --> y

Advance 4 Layer Neural Network:


graph LR

%% Input Layer
subgraph Input Layer
x1(((x1)))
x2(((x2)))
x3(((x3)))
end

%% Hidden Layer 1
subgraph Hidden Layer 1
a1{a1}
a2{a2}
a3{a3}
end

%% Hidden Layer 2
subgraph Hidden Layer 2
b1{b1}
b2{b2}
b3{b3}
end

%% Output Layer
subgraph Output Layer
y(((hθx)))
end

%% Connections: Input → Hidden 1
x1 --> a1
x1 --> a2
x1 --> a3

x2 --> a1
x2 --> a2
x2 --> a3

x3 --> a1
x3 --> a2
x3 --> a3

%% Connections: Hidden 1 → Hidden 2
a1 --> b1
a1 --> b2
a1 --> b3

a2 --> b1
a2 --> b2
a2 --> b3

a3 --> b1
a3 --> b2
a3 --> b3

%% Connections: Hidden 2 → Output
b1 --> y
b2 --> y
b3 --> y


Neural Network Hypothesis

Neural networks are simply multiple logistic regression units stacked together.

Each layer:

  1. Takes activations from previous layer
  2. Multiplies by weight matrix
  3. Applies sigmoid activation
  4. Passes result forward

Computing Hidden Layer Activations

Each hidden unit is computed as:

a1(2)=g(Θ10(1)x0+Θ11(1)x1+Θ12(1)x2+Θ13(1)x3)a^{(2)}_1 = g(\Theta^{(1)}_{10}x_0 + \Theta^{(1)}_{11}x_1 + \Theta^{(1)}_{12}x_2 + \Theta^{(1)}_{13}x_3)a1(2)​=g(Θ10(1)​x0​+Θ11(1)​x1​+Θ12(1)​x2​+Θ13(1)​x3​) a2(2)=g(Θ20(1)x0+Θ21(1)x1+Θ22(1)x2+Θ23(1)x3)a^{(2)}_2 = g(\Theta^{(1)}_{20}x_0 + \Theta^{(1)}_{21}x_1 + \Theta^{(1)}_{22}x_2 + \Theta^{(1)}_{23}x_3)a2(2)​=g(Θ20(1)​x0​+Θ21(1)​x1​+Θ22(1)​x2​+Θ23(1)​x3​) a3(2)=g(Θ30(1)x0+Θ31(1)x1+Θ32(1)x2+Θ33(1)x3)a^{(2)}_3 = g(\Theta^{(1)}_{30}x_0 + \Theta^{(1)}_{31}x_1 + \Theta^{(1)}_{32}x_2 + \Theta^{(1)}_{33}x_3)a3(2)​=g(Θ30(1)​x0​+Θ31(1)​x1​+Θ32(1)​x2​+Θ33(1)​x3​)

This means:

  • We use a 3 × 4 matrix of weights
  • Each row corresponds to one hidden unit
  • Each row multiplies all input features (including bias)

Output Layer

The final hypothesis is:

hΘ(x)=a1(3)h_\Theta(x) = a^{(3)}_1hΘ​(x)=a1(3)​ a1(3)=g(Θ10(2)a0(2)+Θ11(2)a1(2)+Θ12(2)a2(2)+Θ13(2)a3(2))a^{(3)}_1 = g(\Theta^{(2)}_{10}a^{(2)}_0 + \Theta^{(2)}_{11}a^{(2)}_1+ \Theta^{(2)}_{12}a^{(2)}_2+ \Theta^{(2)}_{13}a^{(2)}_3)a1(3)​=g(Θ10(2)​a0(2)​+Θ11(2)​a1(2)​+Θ12(2)​a2(2)​+Θ13(2)​a3(2)​)

So:

  • Hidden layer outputs become inputs to the next layer
  • Another weight matrix Θ(2)\Theta^{(2)}Θ(2) is applied
  • Then the sigmoid function is applied again

Weight Matrix Dimensions

If:

  • Layer jjj has sjs_jsj​ units
  • Layer j+1j+1j+1 has sj+1s_{j+1}sj+1​ units

Then:

Θ(j)∈R sj+1×(sj+1)\Theta^{(j)} \in \mathbb{R}^{\, s_{j+1} \times (s_j + 1)}Θ(j)∈Rsj+1​×(sj​+1)

Why the +1?

Because of the bias unit.

Important detail:

  • Input side includes bias
  • Output side does NOT include bias

AI-DeepLearning/1-Neural-Network
Let's work together
+49 176-2019-2523
hiteshkrsahu@gmail.com
WhatsApp
Skype
Munich 🥨, Germany 🇩🇪, EU
Playstore
Hitesh Sahu's apps on Google Play Store
Need Help?
Let's Connect
Navigation
  Home/About
  Skills
  Work/Projects
  Lab/Experiments
  Contribution
  Awards
  Art/Sketches
  Thoughts
  Contact
Links
  Sitemap
  Legal Notice
  Privacy Policy

Made with

NextJS logo

NextJS by

hitesh Sahu

| © 2026 All rights reserved.