Hitesh Sahu
Hitesh SahuHitesh Sahu
  1. Home
  2. ›
  3. posts
  4. ›
  5. …

  6. ›
  7. 1 Neural Network

Loading ⏳
Fetching content, this won’t take long…


💡 Did you know?

🦈 Sharks existed before trees 🌳.

🍪 This website uses cookies

No personal data is stored on our servers however third party tools Google Analytics cookies to measure traffic and improve your website experience. Learn more

Cover Image for Neural Network Hypothesis and Intuition

Neural Network Hypothesis and Intuition

Explore the hypothesis and intuition behind neural networks, including their structure, activation functions, and how they process inputs to produce outputs.

Hitesh Sahu
Written by Hitesh Sahu, a passionate developer and blogger.

Fri Feb 27 2026

Share This on

← Previous

Deep Learning Path 🤖

Next →

Vectorized Neural Networks Model Representation

Neural Networks Overview

The Feature Explosion Problem

Why oo we need Neural Networks?

Suppose we have x1,x2,…,xnx_1, x_2, \dots, x_nx1​,x2​,…,xn​ as input features and we want to compute a hypothesis hθ(x)h_\theta(x)hθ​(x).

We can

g(θ0+θ1x1+θ2x2+⋯+θnxn)g(\theta_0+ \theta_1 x_1 + \theta_2 x_2 + \dots + \theta_n x_n)g(θ0​+θ1​x1​+θ2​x2​+⋯+θn​xn​)

For quadratic features

g(θ0+θ1x1+θ2x2+⋯+θnxn+θn+1x12+… )g(\theta_0+ \theta_1 x_1 + \theta_2 x_2 + \dots + \theta_n x_n + \theta_{n+1} x_1^2 + \dots)g(θ0​+θ1​x1​+θ2​x2​+⋯+θn​xn​+θn+1​x12​+…)

  • Quadratic terms grow roughly as n2/2n^2/2n2/2

so we will end up with 5000 additional features if we have 100 features.

For cubic features

g(θ0+θ1x1+θ2x2+⋯+θnxn+θn+1x12+⋯+θn+kx13+… )g(\theta_0+ \theta_1 x_1 + \theta_2 x_2 + \dots + \theta_n x_n + \theta_{n+1} x_1^2 + \dots + \theta_{n+k} x_1^3 + \dots)g(θ0​+θ1​x1​+θ2​x2​+⋯+θn​xn​+θn+1​x12​+⋯+θn+k​x13​+…)

  • Cubic terms grows as O(n3)O(n^3)O(n3)

  • So we will end up with 166,000 additional features if we have 100 features.

As the features become more complex, the number of parameters θ\thetaθ grows rapidly.

It becomes

  • computationally expensive to compute the hypothesis with many features.
  • Memory-Heavy to store all the parameters.
  • Prone to overfitting due to the large number of parameters.

In this case, we can use a neural network to compute the hypothesis more efficiently.

Practical Example: Image Recognition

Suppose we have a 100 × 100 pixel image as input.

  • Each pixel is a feature, so we have 10,000 features.
  • For RGB images, we have 3 color channels, so we have 30,000 features.

If we want to compute a hypothesis with quadratic features, we would have on the order of 450 million features, which is computationally infeasible.

Conclusion

Polynomial logistic regression: Works for small 𝑛𝑛n

  • Explodes combinatorially for large 𝑛𝑛n
  • Computationally impossible for large feature sets (like images)
  • We need a non-linear model that can capture complex relationships without explicitly generating all polynomial features.

Neural Networks as a Solution

Neural networks can compute complex hypotheses without explicitly generating all polynomial features.

NN Types and Applications

1. Standard Feedforward Networks

Often used for:

  • Housing price prediction
  • Online advertising

2. Convolutional Neural Networks (CNNs)

Used primarily for image data

  • Exploit spatial structure in images

3. Recurrent Neural Networks (RNNs)

Used for sequence data

Examples:

  • Audio (time series)
  • Language (word-by-word sequence)

Custom Neural Networks

Tailored for specific applications Used in complex systems like autonomous driving:

  • CNNs for images
  • Other components for radar
  • Combined into custom architectures

Neurons as Computational Units

At a simple level, neurons are computational units.

They:

  • dendrites(head): Take inputs x1,x2,…,xnx_1, x_2, \dots, x_nx1​,x2​,…,xn​
  • Process them: apply weights and activation function
  • axon(tail): Produce an output hθ(x)h_\theta(x)hθ​(x)

Artificial Neurons Model: Logistic Unit

In artificial neural networks, we model neurons as mathematical Logistic functions.


graph LR

subgraph Input Layer
x0(((x0)))    
x1(((x1)))
x2(((x2)))
x3(((x3)))
end

subgraph Activation Layer
a1{a1}
end

x0-->a1
x1-->a1
x2-->a1
x3-->a1


subgraph Output Layer
y(((hθx)))
end

a1-->y

A simple network looks like:

[x0x1x2x3]→Neuron→hθ(x)\begin{bmatrix} x_0 \\ x_1 \\ x_2 \\ x_3 \end{bmatrix} \rightarrow \text{Neuron} \rightarrow h_\theta(x)​x0​x1​x2​x3​​​→Neuron→hθ​(x)

In our machine learning model:

Inputs are features: x1,x2,…,xnx_1, x_2, \dots, x_nx1​,x2​,…,xn​

x=[x0x1x2⋮xn]x= \begin{bmatrix} x_0 \\ x_1 \\ x_2 \\ \vdots \\ x_n \end{bmatrix}x=​x0​x1​x2​⋮xn​​​

Parameters θ\thetaθ are called weights

θ=[θ0θ1θ2⋮θn]\theta = \begin{bmatrix} \theta_0 \\ \theta_1 \\ \theta_2 \\ \vdots \\ \theta_n \end{bmatrix}θ=​θ0​θ1​θ2​⋮θn​​​

Bias unit

x0x_0x0​ is the bias unit/ bias Neuron and it is always equal to 1

  • For simplicity we dont draw x0x_0x0​

Output

Outputs of neurons are called activations Weights which is the hypothesis hθ(x)h_\theta(x)hθ​(x)

where hθ(x)=g(θTx)h_\theta(x) = g(\theta^T x)hθ​(x)=g(θTx)

And we can rewrite: z=θTxz = \theta^T xz=θTx

so the hypothesis can be expressed as: hθ(x)=g(z)h_\theta(x) = g(z)hθ​(x)=g(z)

g(z)g(z)g(z) is the activation function that introduces non-linearity into the model.

Activation Function (g(⋅)g(\cdot)g(⋅))

g(z)g(z)g(z) is the activation function. for Hypothesis function:

  • Example: ReLU, sigmoid

Neural networks using sigmoid activation function for logistic regression:

hΘ(x)=g(z)h_\Theta(x) = g(z)hΘ​(x)=g(z)

Where

g(z)=11+e−zg(z) = \frac{1}{1 + e^{-z}}g(z)=1+e−z1​

Where z=θTxz = \theta^T xz=θTx


Neural Network Structure

An Artificial Neural Network is a computational graph with layers of artificial neurons.

Neural networks are simply multiple logistic regression units stacked together.

Each layer:

  1. Takes activations from previous layer
  2. Multiplies by weight matrix
  3. Applies sigmoid activation
  4. Passes result forward
Input→Hidden→Output\text{Input} \rightarrow \text{Hidden} \rightarrow \text{Output}Input→Hidden→Output [x0x1x2x3]→[a1(2)a2(2)a3(2)]→hθ(x)\begin{bmatrix} x_0 \\ x_1 \\ x_2 \\ x_3 \end{bmatrix} \rightarrow \begin{bmatrix} a^{(2)}_1 \\ a^{(2)}_2 \\ a^{(2)}_3 \end{bmatrix} \rightarrow h_\theta(x)​x0​x1​x2​x3​​​→​a1(2)​a2(2)​a3(2)​​​→hθ​(x)
graph LR
    Input --> Hidden-Layer
    Hidden-Layer --> Output

1. Layer 1: Input layer

Takes features as Input

  • x1,x2,…,xnx_1, x_2, \dots, x_nx1​,x2​,…,xn​
  • x0x_0x0​ = 1 is bias unit that is not drawn

2. Layer 2: Hidden layer:

All Intermediate Layers between Input & Output Layer

  • Computes intermediate activations a1(2),a2(2),…,an(2),a^{(2)}_1, a^{(2)}_2, \dots , a^{(2)}_n,a1(2)​,a2(2)​,…,an(2)​,
  • a0(2)a^{(2)}_0a0(2)​ =1 is bias unit that is not drawn

ai(j)a^{(j)}_iai(j)​ = Activation output of iii th neuron in layer jjj

  • iii = Neuron Index inside that layer
  • jjj = layer number

Example:

  • a1(2)a_1^{(2)}a1(2)​ = first neuron in Second hidden layer
  • a3(2)a_3^{(2)}a3(2)​ = 3rd Neuron in Second hidden layer

Computing Hidden Layer Activations

Activation can be computed as :

a1(2)=g(Θ10(1)x0+Θ11(1)x1+Θ12(1)x2+Θ13(1)x3)a^{(2)}_1 = g(\Theta^{(1)}_{10}x_0 + \Theta^{(1)}_{11}x_1 + \Theta^{(1)}_{12}x_2 + \Theta^{(1)}_{13}x_3)a1(2)​=g(Θ10(1)​x0​+Θ11(1)​x1​+Θ12(1)​x2​+Θ13(1)​x3​) a2(2)=g(Θ20(1)x0+Θ21(1)x1+Θ22(1)x2+Θ23(1)x3)a^{(2)}_2 = g(\Theta^{(1)}_{20}x_0 + \Theta^{(1)}_{21}x_1 + \Theta^{(1)}_{22}x_2 + \Theta^{(1)}_{23}x_3)a2(2)​=g(Θ20(1)​x0​+Θ21(1)​x1​+Θ22(1)​x2​+Θ23(1)​x3​) a3(2)=g(Θ30(1)x0+Θ31(1)x1+Θ32(1)x2+Θ33(1)x3)a^{(2)}_3 = g(\Theta^{(1)}_{30}x_0 + \Theta^{(1)}_{31}x_1 + \Theta^{(1)}_{32}x_2 + \Theta^{(1)}_{33}x_3)a3(2)​=g(Θ30(1)​x0​+Θ31(1)​x1​+Θ32(1)​x2​+Θ33(1)​x3​)

Where g(.) is sigmoid

Weight Matrices Θ(j)\Theta^{(j)}Θ(j)

Weight Matrix of jthjthjth layer

  • weight matrix maps layer jjj to layer j+1j+1j+1
  • Each jjj th layer gets its own matrix of weights Θ(j)\Theta^{(j)}Θ(j)
  • Layers are indexed starting from 1, not 0.

Weight Matrix Dimensions

Outputlayer Neurons x (InputLayer Neurons + 1) Dimensioned Matrix

  • Input side includes bias
  • Output side does NOT include bias

Θ(j)\Theta(j)Θ(j) would be a Matrix of dimension:

s(j+1)X(sj+1)s_{(j+1)} X (s_j+1)s(j+1)​X(sj​+1) Θ(j)∈Rsj+1×(sj+1)\Theta^{(j)} \in \mathbb{R}^{s_{j+1} \times (s_j + 1)}Θ(j)∈Rsj+1​×(sj​+1)

Where:

  • sjs_jsj​ = number of neurons/units in jjjth layer
  • s(j+1)s_{(j+1)}s(j+1)​ units in Output layer j+1j+1j+1

Practical Example:

Each Input Layer is densely connected to each Activation Function of next Layer:

  • Input layer: 3 units + 1 Bias
  • Hidden layer 1: 4 units + 1 Bias
  • Output layer: 1 unit

graph LR
    %% ===== Layer 1 =====
    subgraph "Layer 1: Input Layer"
        x0((x0 = 1))
        x1(((x1)))
        x2(((x2)))
        x3(((x3)))
    end

    %% ===== Layer 2 =====
    subgraph "Layer 2: Hidden Layer"
        a0{a0 = 1}
        a1{a1}
        a2{a2}
        a3{a3}
        a4{a4}
    end

    %% ===== Layer 3 =====
    subgraph "Layer 3: Output Layer"
        y(((hθx)))
    end

  
    %% Input → Hidden
    x0 --> a1
    x0 --> a2
    x0 --> a3
    x0 --> a4

    x1 --> a1
    x1 --> a2
    x1 --> a3
    x1 --> a4
  
    x2 --> a1
    x2 --> a2
    x2 --> a3
    x2 --> a4
  
    x3 --> a1
    x3 --> a2
    x3 --> a3
    x3 --> a4


%% Hidden → Output
    a0 --> y
    a1 --> y
    a2 --> y
    a3 --> y
    a4 --> y

Simplified

Showing Bias unit but only 1 connection for reference


graph LR
    x0(((x0))) --> a1{a1}
    x1(((x1))) --> a1{a1}
    x1 --> a2{a2}
    x1 --> a3{a3}
    x1 --> a4{a4}
        
    x2(((x2))) --> a1
    x2 --> a2
    x2 --> a3
    x2 --> a4
  
    x3(((x3))) --> a1
    x3 --> a2
    x3 --> a3
    x3 --> a4

    a0{a0} --> y(((hθx)))
    a1 --> y
    a2 --> y
    a3 --> y
    a4 --> y

Given

  • Input layer: x0,x1,x2,x3x_0, x_1, x_2, x_3 x0​,x1​,x2​,x3​
  • Hidden layer: a0,a1,a2,a3,a4a_0, a_1, a_2, a_3, a_4a0​,a1​,a2​,a3​,a4​
  • Output layer: hθ(x)h_\theta(x)hθ​(x)

Where:

  • x0=1x_0 = 1x0​=1 → bias unit for the input layer
  • a0=1a_0 = 1a0​=1 → bias unit for the hidden layer

Weight Matrix:

Layer 1 (S1S_1S1​)

Input Layer → Hidden Layer

Θ(1)∈R4×4\Theta^{(1)} \in \mathbb{R}^{4 \times 4}Θ(1)∈R4×4 Θ(1)=4X4Matrix\Theta^{(1)} = 4 X 4 Matrix Θ(1)=4X4Matrix

=> 4 (a_1, a_2, a_3, a_4) × 4 (x0,x1,x2,x3)(x_0, x_1, x_2, x_3)(x0​,x1​,x2​,x3​)

Layer 2 (S2S_2S2​)

Hidden Layer → Output Layer

Θ(2)∈R1×5\Theta^{(2)} \in \mathbb{R}^{1 \times 5}Θ(2)∈R1×5 Θ(2)=1X5Matrix\Theta^{(2)} = 1 X 5 MatrixΘ(2)=1X5Matrix

=> 1 (output neuron) × 5(a0,a1,a2,a3,a4)(a_0, a_1, a_2, a_3, a_4)(a0​,a1​,a2​,a3​,a4​)

Activation of Neurons in Layer 2

First Neuron in layer 2

a1(2)=g(Θ10(1)x0+Θ11(1)x1+Θ12(1)x2+Θ13(1)x3)a^{(2)}_1 = g(\Theta^{(1)}_{10}x_0 + \Theta^{(1)}_{11}x_1 + \Theta^{(1)}_{12}x_2 + \Theta^{(1)}_{13}x_3)a1(2)​=g(Θ10(1)​x0​+Θ11(1)​x1​+Θ12(1)​x2​+Θ13(1)​x3​)

Second Neuron in layer 2

a2(2)=g(Θ20(1)x0+Θ21(1)x1+Θ22(1)x2+Θ23(1)x3)a^{(2)}_2 = g(\Theta^{(1)}_{20}x_0 + \Theta^{(1)}_{21}x_1 + \Theta^{(1)}_{22}x_2 + \Theta^{(1)}_{23}x_3)a2(2)​=g(Θ20(1)​x0​+Θ21(1)​x1​+Θ22(1)​x2​+Θ23(1)​x3​)

Third Neuron in layer 2

a3(2)=g(Θ30(1)x0+Θ31(1)x1+Θ32(1)x2+Θ33(1)x3)a^{(2)}_3 = g(\Theta^{(1)}_{30}x_0 + \Theta^{(1)}_{31}x_1 + \Theta^{(1)}_{32}x_2 + \Theta^{(1)}_{33}x_3)a3(2)​=g(Θ30(1)​x0​+Θ31(1)​x1​+Θ32(1)​x2​+Θ33(1)​x3​)

Where

  • ggg: sigmoid activation function of collective term

3. Output layer

Gives us the final hypothesis hΘ(x)h_\Theta(x)hΘ​(x)

  • Hidden layer outputs become inputs to the next layer
  • Another weight matrix Θ(2)\Theta^{(2)}Θ(2) is applied
  • Then the sigmoid function is applied again

Output Layer Hypothesis

The final hypothesis is the first neuron of 3rd layer

hΘ(x)=a1(3)h_\Theta(x) = a^{(3)}_1hΘ​(x)=a1(3)​

Which is equals to

a1(3)=g(Θ10(2)a0(2)+Θ11(2)a1(2)+Θ12(2)a2(2)+Θ13(2)a3(2))a^{(3)}_1 = g(\Theta^{(2)}_{10}a^{(2)}_0 + \Theta^{(2)}_{11}a^{(2)}_1+ \Theta^{(2)}_{12}a^{(2)}_2+ \Theta^{(2)}_{13}a^{(2)}_3)a1(3)​=g(Θ10(2)​a0(2)​+Θ11(2)​a1(2)​+Θ12(2)​a2(2)​+Θ13(2)​a3(2)​)

Where

  • ggg: Sigmoid of final term
  • Θ(2)\Theta^{(2)}Θ(2) is weight Matrix for final Output Layer

Advance Example: 4 Layer Neural Network:

  • Input layer: 3 units
  • Hidden layer 1: 3 units
  • Hidden layer 2: 3 units
  • Output layer: 1 unit

graph LR

%% Input Layer
    subgraph Input Layer
        x1(((x1)))
        x2(((x2)))
        x3(((x3)))
    end

%% Hidden Layer 1
    subgraph Hidden Layer 1
        a1{a1}
        a2{a2}
        a3{a3}
    end

%% Hidden Layer 2
    subgraph Hidden Layer 2
        b1{b1}
        b2{b2}
        b3{b3}
    end

%% Output Layer
    subgraph Output Layer
        y(((hθx)))
    end

%% Connections: Input → Hidden 1
    x1 --> a1
    x1 --> a2
    x1 --> a3
    x2 --> a1
    x2 --> a2
    x2 --> a3
    x3 --> a1
    x3 --> a2
    x3 --> a3
%% Connections: Hidden 1 → Hidden 2
    a1 --> b1
    a1 --> b2
    a1 --> b3
    a2 --> b1
    a2 --> b2
    a2 --> b3
    a3 --> b1
    a3 --> b2
    a3 --> b3
%% Connections: Hidden 2 → Output
    b1 --> y
    b2 --> y
    b3 --> y

Weight Matrix:

Θ(1)∈R3×4\Theta^{(1)} \in \mathbb{R}^{3 \times 4}Θ(1)∈R3×4 Θ(2)∈R3×4\Theta^{(2)} \in \mathbb{R}^{3 \times 4}Θ(2)∈R3×4 Θ(3)∈R1×4\Theta^{(3)} \in \mathbb{R}^{1 \times 4}Θ(3)∈R1×4

Activation of Neurons in Layer 2

First Neuron in layer 2

a1(2)=g(Θ10(1)x0+Θ11(1)x1+Θ12(1)x2+Θ13(1)x3)a^{(2)}_1 = g(\Theta^{(1)}_{10}x_0 + \Theta^{(1)}_{11}x_1 + \Theta^{(1)}_{12}x_2 + \Theta^{(1)}_{13}x_3)a1(2)​=g(Θ10(1)​x0​+Θ11(1)​x1​+Θ12(1)​x2​+Θ13(1)​x3​)

Second Neuron in layer 2

a2(2)=g(Θ20(1)x0+Θ21(1)x1+Θ22(1)x2+Θ23(1)x3)a^{(2)}_2 = g(\Theta^{(1)}_{20}x_0 + \Theta^{(1)}_{21}x_1 + \Theta^{(1)}_{22}x_2 + \Theta^{(1)}_{23}x_3)a2(2)​=g(Θ20(1)​x0​+Θ21(1)​x1​+Θ22(1)​x2​+Θ23(1)​x3​)

Third Neuron in layer 2

a3(2)=g(Θ30(1)x0+Θ31(1)x1+Θ32(1)x2+Θ33(1)x3)a^{(2)}_3 = g(\Theta^{(1)}_{30}x_0 + \Theta^{(1)}_{31}x_1 + \Theta^{(1)}_{32}x_2 + \Theta^{(1)}_{33}x_3)a3(2)​=g(Θ30(1)​x0​+Θ31(1)​x1​+Θ32(1)​x2​+Θ33(1)​x3​)

Generalized

ai(2)=g(Θi0(1)x0+Θi1(1)x1+Θi2(1)x2+Θi3(1)x3)a^{(2)}_i = g(\Theta^{(1)}_{i0}x_0 + \Theta^{(1)}_{i1}x_1 + \Theta^{(1)}_{i2}x_2 + \Theta^{(1)}_{i3}x_3)ai(2)​=g(Θi0(1)​x0​+Θi1(1)​x1​+Θi2(1)​x2​+Θi3(1)​x3​)

for i=1,2,3i = 1,2,3i=1,2,3

Activation of Neurons in Layer 3

Following Same pattern

a(3)=g(z(3))a^{(3)} = g(z^{(3)})a(3)=g(z(3))

For each neuron iii in layer 3:

ai(3)=g(Θi0(2)a0(2)+Θi1(2)a1(2)+Θi2(2)a2(2)+Θi3(2)a3(2))a^{(3)}_i = g\left( \Theta^{(2)}_{i0}a^{(2)}_0 + \Theta^{(2)}_{i1}a^{(2)}_1 + \Theta^{(2)}_{i2}a^{(2)}_2 + \Theta^{(2)}_{i3}a^{(2)}_3 \right)ai(3)​=g(Θi0(2)​a0(2)​+Θi1(2)​a1(2)​+Θi2(2)​a2(2)​+Θi3(2)​a3(2)​)

for i=1,2,3i = 1,2,3i=1,2,3

hΘ(x)=a(4)h_\Theta(x) = a^{(4)}hΘ​(x)=a(4)

Forward Pass:

a(1)=xa^{(1)} = xa(1)=x a(2)=g(Θ(1)a(1))a^{(2)} = g\left(\Theta^{(1)} a^{(1)}\right)a(2)=g(Θ(1)a(1)) a(3)=g(Θ(2)a(2))a^{(3)} = g\left(\Theta^{(2)} a^{(2)}\right)a(3)=g(Θ(2)a(2)) a(4)=g(Θ(3)a(3))a^{(4)} = g\left(\Theta^{(3)} a^{(3)}\right)a(4)=g(Θ(3)a(3)) hΘ(x)=a(4)h_\Theta(x) = a^{(4)}hΘ​(x)=a(4)

Output Layer Hypothesis

The final hypothesis is First neuron of 3rd layer

hΘ(x)=a1(4)h_\Theta(x) = a^{(4)}_1hΘ​(x)=a1(4)​

Forward Propagation

For each layer:

z(j+1)=Θ(j)a(j)z^{(j+1)} = \Theta^{(j)} a^{(j)}z(j+1)=Θ(j)a(j) a(j+1)=g(z(j+1))a^{(j+1)} = g\left(z^{(j+1)}\right)a(j+1)=g(z(j+1))
AI-DeepLearning/1-Neural-Network
Let's work together
+49 176-2019-2523
hiteshkrsahu@gmail.com
WhatsApp
Skype
Munich 🥨, Germany 🇩🇪, EU
Playstore
Hitesh Sahu's apps on Google Play Store
Need Help?
Let's Connect
Navigation
  Home/About
  Skills
  Work/Projects
  Lab/Experiments
  Contribution
  Awards
  Art/Sketches
  Thoughts
  Contact
Links
  Sitemap
  Legal Notice
  Privacy Policy

Made with

NextJS logo

NextJS by

hitesh Sahu

| © 2026 All rights reserved.