Loading ⏳

Fetching content, this won’t take long…

💡 Did you know?

🍌 Bananas are berries, but strawberries are not.

Loading ⏳

Fetching content, this won’t take long…

💡 Did you know?

🐙 Octopuses have three hearts and blue blood.

AI-DeepLearning

AI-AgenticAI

AI-DeepLearning

AI-GenAI

AI-Infrastructure

AI-Machine-Learning

AI-Math

AWS

Azure

Hobbies

kubernetes

Management

Programming

Terraform

Z_Appendix

0-root

AI-DeepLearning

Neural Network Hypothesis and Intuition

Explore the hypothesis and intuition behind neural networks, including their structure, activation functions, and how they process inputs to produce outputs.

Data Science

Machine Learning

Deep Learning

Neural Networks

Artificial Intelligence

Computational Graphs

← Previous

Deep Learning Path 🤖

Training a Neural Network

Neural Networks Overview

The Feature Explosion Problem

Why do we need Neural Networks?

Suppose we have $x_1, x_2, \dots, x_n$ as input features and we want to compute a hypothesis $h_\theta(x)$ .

We can

$g(\theta_0+ \theta_1 x_1 + \theta_2 x_2 + \dots + \theta_n x_n)$

For quadratic features

$g(\theta_0+ \theta_1 x_1 + \theta_2 x_2 + \dots + \theta_n x_n + \theta_{n+1} x_1^2 + \dots)$

Quadratic terms grow roughly as $n^2/2$

so we will end up with 5000 additional features if we have 100 features.

For cubic features

$g(\theta_0+ \theta_1 x_1 + \theta_2 x_2 + \dots + \theta_n x_n + \theta_{n+1} x_1^2 + \dots + \theta_{n+k} x_1^3 + \dots)$

Cubic terms grows as $O(n^3)$
So we will end up with 166,000 additional features if we have 100 features.

As the features become more complex, the number of parameters $\theta$ grows rapidly.

It becomes

computationally expensive to compute the hypothesis with many features.
Memory-Heavy to store all the parameters.
Prone to overfitting due to the large number of parameters.

In this case, we can use a neural network to compute the hypothesis more efficiently.

Practical Example: Image Recognition

Suppose we have a 100 × 100 pixel image as input.

Each pixel is a feature, so we have 10,000 features.
For RGB images, we have 3 color channels, so we have 30,000 features.

If we want to compute a hypothesis with quadratic features, we would have on the order of 450 million features, which is computationally infeasible.

Conclusion

Polynomial logistic regression: Works for small $𝑛$

Explodes combinatorially for large $𝑛$
Computationally impossible for large feature sets (like images)
We need a non-linear model that can capture complex relationships without explicitly generating all polynomial features.

Neural Networks as a Solution

Neural networks can compute complex hypotheses without explicitly generating all polynomial features.

NN Types and Applications

Comparison of Neural Network Types

Network Type	Best For	Memory	Spatial Awareness
Feedforward	Tabular data	No	No
CNN	Images	No	Yes
RNN	Sequences	Yes	Limited

Evolution of Architectures

Today, many RNN tasks are increasingly replaced by:

Transformers
Attention mechanisms

flowchart LR

    A[Feedforward Networks]

    A --> B[CNNs]

    A --> C[RNNs]

    C --> D[LSTM / GRU]

    D --> E[Transformers]

1. 🔀 `Standard Feedforward Networks`

Information moves only in one direction:

Also called:

Fully Connected Networks
Dense Neural Networks
Multi-Layer Perceptrons (MLP)

\text{Input} \rightarrow \text{Hidden Layers} \rightarrow \text{Output}

No cycles or memory.

flowchart LR

    A1[Input 1]
    A2[Input 2]
    A3[Input 3]

    A1 --> H1
    A1 --> H2

    A2 --> H1
    A2 --> H2

    A3 --> H1
    A3 --> H2

    H1[Hidden Layer]
    H2[Hidden Layer]

    H1 --> O[Output]
    H2 --> O

Often used for:

Housing price prediction
Online advertising
Credit scoring
Customer churn prediction
Fraud detection

Mathematical Representation

For one layer:

y = f(Wx + b)

Where:

$x$ = input vector
$W$ = weights
$b$ = bias
$f$ = activation function

2. 🏞️ `Convolutional Neural Networks (CNNs)`

CNNs are specialized neural networks designed primarily used for image data

Exploit spatial structure in images

Instead of connecting every neuron to every pixel to CNNs detect patterns using

convolution filters
feature maps

flowchart TD

    A[Input Image]

    A --> B[Convolution Layer]

    B --> C[Feature Maps]

    C --> D[Pooling Layer]

    D --> E[Fully Connected Layer]

    E --> F[Prediction]

Mathematical Representation of Convolution

A filter slides across the image.

(I * K)(x,y) = \sum_m \sum_n I(m,n)K(x-m,y-n)

Where:

$I$ = image
$K$ = kernel/filter
$K$ = kernel/filter

flowchart LR

    A[Image]

    A --> B[Edges]

    B --> C[Textures]

    C --> D[Shapes]

    D --> E[Objects]

Use Case

Image classification
Face recognition
Medical imaging
Self-driving cars
Object detection

3. 🔢 `Recurrent Neural Networks (RNNs)`

RNNs process data step-by-step while remembering previous information.

Unlike feedforward networks, RNNs have memory.
They can capture temporal dependencies in data.

Used for sequence data

Examples:

Audio (time series)
Language (word-by-word sequence)

RNN Architecture

flowchart LR

    X1[Word 1] --> H1
    H1 --> H2

    X2[Word 2] --> H2
    H2 --> H3

    X3[Word 3] --> H3

    H1[Hidden State]
    H2[Hidden State]
    H3[Hidden State]

Mathematical Representation of RNNs

h_t = f(x_t, h_{t-1})

Where:

$x_t$ = current input
$h_{t-1}$ = previous hidden state
$h_t$ = updated memory state

Use Cases for RNNs

Audio / Time Series

Speech recognition
Sensor data
Financial forecasting

Language Processing

Translation
Chatbots
Text generation
Next

4. Custom Neural Networks

Tailored for specific applications Used in complex systems like autonomous driving:

CNNs for images
Other components for radar
Combined into custom architectures

Neurons as Computational Units

At a simple level, neurons are computational units.

They:

dendrites(head): Take inputs $x_1, x_2, \dots, x_n$
Process them: apply weights and activation function
axon(tail): Produce an output $h_\theta(x)$

Artificial Neurons Model: Logistic Unit

In artificial neural networks, we model neurons as mathematical Logistic functions.


graph LR

subgraph Input Layer
x0(((x0)))    
x1(((x1)))
x2(((x2)))
x3(((x3)))
end

subgraph Activation Layer
a1{a1}
end

x0-->a1
x1-->a1
x2-->a1
x3-->a1


subgraph Output Layer
y(((hθx)))
end

a1-->y

A simple network looks like:

\begin{bmatrix} x_0 \\ x_1 \\ x_2 \\ x_3 \end{bmatrix} \rightarrow \text{Neuron} \rightarrow h_\theta(x)

In our machine learning model:

Inputs are features: $x_1, x_2, \dots, x_n$

x= \begin{bmatrix} x_0 \\ x_1 \\ x_2 \\ \vdots \\ x_n \end{bmatrix}

Parameters $\theta$ are called weights

\theta = \begin{bmatrix} \theta_0 \\ \theta_1 \\ \theta_2 \\ \vdots \\ \theta_n \end{bmatrix}

Bias unit

$x_0$ is the bias unit/ bias Neuron and it is always equal to 1

For simplicity we dont draw $x_0$

Output

Outputs of neurons are called activations Weights which is the hypothesis $h_\theta(x)$

where $h_\theta(x) = g(\theta^T x)$

And we can rewrite: $z = \theta^T x$

so the hypothesis can be expressed as: $h_\theta(x) = g(z)$

$g(z)$ is the activation function that introduces non-linearity into the model.

Activation Function $g(\cdot)$

$g(z)$ is the activation function. for Hypothesis function:

Example: ReLU, sigmoid, tanh, etc.

Neural networks using sigmoid activation function for logistic regression:

$h_\Theta(x) = g(z)$

Where

g(z) = \frac{1}{1 + e^{-z}}

Where $z = \theta^T x$

ReLU vs Sigmoid Activation Function

Sigmoid

Input:  -5    0    5
Output: 0.01 0.5 0.99

Outputs behave like probabilities.

Sigmoid squashes outputs between:

0 and 1

Its derivative becomes very small for large positive or negative values.

Vanishing gradient

Deep neural networks may contain:

dozens
hundreds
thousands

of layers.

During backpropagation:

gradients are multiplied repeatedly across layers.

If gradients are small:

0.1 × 0.1 × 0.1 × 0.1 ...

they rapidly approach zero.

Example

Suppose gradient values are:

0.2 × 0.2 × 0.2 × 0.2 × 0.2

Result:

0.00032

Very tiny gradients mean:

almost no weight updates
slow or stalled learning

Exploding vs Vanishing Gradients

Problem	What Happens
Vanishing gradients	Gradients become too small
Exploding gradients	Gradients become too large

Rectified linear unit `ReLU`

ReLU activates only positive values. Negative values become zero.

Input:  -3  -1   0   2   5
Output:  0   0   0   2   5

ReLU helps because:

gradients remain stronger
training becomes faster
deep networks scale better

Enables: This enables:

deeper networks
faster convergence
stable learning

Which was not possible with sigmoid before

Feature	ReLU (Rectified Linear Unit)	Sigmoid
How it works	If the signal is positive, it passes it forward; if negative, it outputs zero	Converts output into a probability between 0 and 1
Example (Cat model)	Strong whisker detection → output `3.5` (feature exists strongly)	Final output → `0.92` → “92% chance this is a cat”
Output Range	`[0, ∞)`	`[0, 1]`
Best suited for	Hidden layers; very fast and helps deep models learn efficiently	Output layer for yes/no classification problems
Performance	Computationally efficient	Comparatively slower
Formula	`f(x) = max(0, x)`	`f(x) = 1 / (1 + e^(-x))`
Optimized for	Avoiding vanishing gradients	Predicting probabilities

Neural Network Structure

An Artificial Neural Network is a computational graph with layers of artificial neurons.

Neural networks are simply multiple logistic regression units stacked together.

Each layer:

Takes activations from previous layer
Multiplies by weight matrix
Applies sigmoid activation
Passes result forward

\text{Input} \rightarrow \text{Hidden} \rightarrow \text{Output}

\begin{bmatrix} x_0 \\ x_1 \\ x_2 \\ x_3 \end{bmatrix} \rightarrow \begin{bmatrix} a^{(2)}_1 \\ a^{(2)}_2 \\ a^{(2)}_3 \end{bmatrix} \rightarrow h_\theta(x)

graph LR
    Input --> Hidden-Layer
    Hidden-Layer --> Output

1. Layer 1: Input layer

Takes features as Input

$x_1, x_2, \dots, x_n$
$x_0$ = 1 is bias unit that is not drawn

2. Layer 2: Hidden layer:

All Intermediate Layers between Input & Output Layer

Computes intermediate activations $a^{(2)}_1, a^{(2)}_2, \dots , a^{(2)}_n,$
$a^{(2)}_0$ =1 is bias unit that is not drawn

$a^{(j)}_i$ = Activation output of $i$ th `neuron` in `layer` $j$

$i$ = Neuron Index inside that layer
$j$ = layer number

Example:

$a_1^{(2)}$ = first neuron in Second hidden layer
$a_3^{(2)}$ = 3rd Neuron in Second hidden layer

Computing Hidden Layer Activations

Activation can be computed as :

a^{(2)}_1 = g(\Theta^{(1)}_{10}x_0 + \Theta^{(1)}_{11}x_1 + \Theta^{(1)}_{12}x_2 + \Theta^{(1)}_{13}x_3)

a^{(2)}_2 = g(\Theta^{(1)}_{20}x_0 + \Theta^{(1)}_{21}x_1 + \Theta^{(1)}_{22}x_2 + \Theta^{(1)}_{23}x_3)

a^{(2)}_3 = g(\Theta^{(1)}_{30}x_0 + \Theta^{(1)}_{31}x_1 + \Theta^{(1)}_{32}x_2 + \Theta^{(1)}_{33}x_3)

Where g(.) is sigmoid

Weight Matrices $\Theta^{(j)}$

Weight Matrix of $jth$ layer

weight matrix maps layer $j$ to layer $j+1$
Each $j$ th layer gets its own matrix of weights $\Theta^{(j)}$
Layers are indexed starting from 1, not 0.

Weight Matrix Dimensions

`Outputlayer Neurons x (InputLayer Neurons + 1)` Dimensioned Matrix

Input side includes bias
Output side does NOT include bias

$\Theta(j)$ would be a Matrix of dimension:

s_{(j+1)} X (s_j+1)

\Theta^{(j)} \in \mathbb{R}^{s_{j+1} \times (s_j + 1)}

Where:

$s_j$ = number of neurons/units in $j$ th layer
$s_{(j+1)}$ units in Output layer $j+1$

Practical Example:

Each Input Layer is densely connected to each Activation Function of next Layer:

Input layer: 3 units + 1 Bias
Hidden layer 1: 4 units + 1 Bias
Output layer: 1 unit


graph LR
    %% ===== Layer 1 =====
    subgraph "Layer 1: Input Layer"
        x0((x0 = 1))
        x1(((x1)))
        x2(((x2)))
        x3(((x3)))
    end

    %% ===== Layer 2 =====
    subgraph "Layer 2: Hidden Layer"
        a0{a0 = 1}
        a1{a1}
        a2{a2}
        a3{a3}
        a4{a4}
    end

    %% ===== Layer 3 =====
    subgraph "Layer 3: Output Layer"
        y(((hθx)))
    end

  
    %% Input → Hidden
    x0 --> a1
    x0 --> a2
    x0 --> a3
    x0 --> a4

    x1 --> a1
    x1 --> a2
    x1 --> a3
    x1 --> a4
  
    x2 --> a1
    x2 --> a2
    x2 --> a3
    x2 --> a4
  
    x3 --> a1
    x3 --> a2
    x3 --> a3
    x3 --> a4


%% Hidden → Output
    a0 --> y
    a1 --> y
    a2 --> y
    a3 --> y
    a4 --> y

Simplified

Showing Bias unit but only 1 connection for reference


graph LR
    x0(((x0))) --> a1{a1}
    x1(((x1))) --> a1{a1}
    x1 --> a2{a2}
    x1 --> a3{a3}
    x1 --> a4{a4}
        
    x2(((x2))) --> a1
    x2 --> a2
    x2 --> a3
    x2 --> a4
  
    x3(((x3))) --> a1
    x3 --> a2
    x3 --> a3
    x3 --> a4

    a0{a0} --> y(((hθx)))
    a1 --> y
    a2 --> y
    a3 --> y
    a4 --> y

Given

Input layer: $x_0, x_1, x_2, x_3$
Hidden layer: $a_0, a_1, a_2, a_3, a_4$
Output layer: $h_\theta(x)$

Where:

$x_0 = 1$ → bias unit for the input layer
$a_0 = 1$ → bias unit for the hidden layer

Weight Matrix:

Layer 1 ( $S_1$ )

Input Layer → Hidden Layer

\Theta^{(1)} \in \mathbb{R}^{4 \times 4}

\Theta^{(1)} = 4 X 4 Matrix

=> 4 (a_1, a_2, a_3, a_4) × 4 $(x_0, x_1, x_2, x_3)$

Layer 2 ( $S_2$ )

Hidden Layer → Output Layer

\Theta^{(2)} \in \mathbb{R}^{1 \times 5}

\Theta^{(2)} = 1 X 5 Matrix

=> 1 (output neuron) × 5 $(a_0, a_1, a_2, a_3, a_4)$

Activation of Neurons in Layer 2

First Neuron in layer 2

a^{(2)}_1 = g(\Theta^{(1)}_{10}x_0 + \Theta^{(1)}_{11}x_1 + \Theta^{(1)}_{12}x_2 + \Theta^{(1)}_{13}x_3)

Second Neuron in layer 2

a^{(2)}_2 = g(\Theta^{(1)}_{20}x_0 + \Theta^{(1)}_{21}x_1 + \Theta^{(1)}_{22}x_2 + \Theta^{(1)}_{23}x_3)

Third Neuron in layer 2

a^{(2)}_3 = g(\Theta^{(1)}_{30}x_0 + \Theta^{(1)}_{31}x_1 + \Theta^{(1)}_{32}x_2 + \Theta^{(1)}_{33}x_3)

Where

$g$ : sigmoid activation function of collective term

3. Output layer

Gives us the final hypothesis $h_\Theta(x)$

Hidden layer outputs become inputs to the next layer
Another weight matrix $\Theta^{(2)}$ is applied
Then the sigmoid function is applied again

Output Layer Hypothesis

The final hypothesis is the first neuron of 3rd layer

h_\Theta(x) = a^{(3)}_1

Which is equals to

a^{(3)}_1 = g(\Theta^{(2)}_{10}a^{(2)}_0 + \Theta^{(2)}_{11}a^{(2)}_1+ \Theta^{(2)}_{12}a^{(2)}_2+ \Theta^{(2)}_{13}a^{(2)}_3)

Where

$g$ : Sigmoid of final term
$\Theta^{(2)}$ is weight Matrix for final Output Layer

🔢 Normalization Techniques

Normalization techniques are used to stabilize and accelerate the training of neural networks by normalizing the activations of layers. stabilize training

It helps:

stabilize training
speed up convergence
reduce sensitivity to initialization
prevent activations from becoming too large or too small

1. Batch Normalization (BatchNorm)

Batch Normalization is a technique that normalizes the activations of a layer across the batch dimension during training.

Introduced in 2015 and widely used in CNNs.

BatchNorm

BatchNorm → normalize across examples in a batch.

Intuition

"Let's compare each kid to the average height of the whole class."


  Entire Class
  ┌─────────────┐
  │ Alice       │
  │ Bob         │
  │ Chris       │
  │ ...         │
  └─────────────┘
          ↓
  Find class average
          ↓
  Adjust everyone

Illustration of BatchNorm process:

Statistics

For CNN input: $(N, C, H, W)$

BatchNorm computes mean and variance for each channel across:

(N, H, W)

Where:

$N$ = batch size
$C$ = number of channels
$H$ = height
$W$ = width

flowchart TD

    A["Batch of Images 🖼️🖼️🖼️"]
    A --> B["Compute Mean and Variance ✨"]
    B --> C["Normalize Activations 🔢"]
    C --> D["Scale and Shift 🔣"]
    D --> E["Next Layer"]

Formula

Given activations $x$ :

Mean:

\mu_B = \frac{1}{m}\sum_{i=1}^{m}x_i

Variance:

\sigma_B^2 = \frac{1}{m}\sum_{i=1}^{m}(x_i-\mu_B)^2

Normalize:

\hat{x}_i = \frac{x_i-\mu_B}{\sqrt{\sigma_B^2+\epsilon}}

Scale and shift:

y_i = \gamma \hat{x}_i + \beta

$\gamma$ and $\beta$ are learnable parameters that allow the model to restore the original distribution if needed.
$\epsilon$ is a small constant to prevent division by zero.
$B$ denotes the batch of examples.
$m$ is the number of examples in the batch.
$x_i$ is the activation of the $i$ -th example in the batch.

Advantages

Work Best when we have large batches
Works extremely well for CNNs
Faster convergence
Regularization effect

Limitations

Depends on batch size: Small batches reduce effectiveness
Requires synchronization in distributed training

Typical Use Cases

ResNet
VGG - EfficientNet
Most traditional CNNs

2. Layer Normalization (LayerNorm)

Layer Normalization normalizes across the features of a single sample, making it independent of batch size.

LayerNorm

LayerNorm → normalize across features

Popularized by Transformers.

Transformers / LLMs → LayerNorm (or RMSNorm)
RNNs / Sequence Models → LayerNorm

Intuition

"Let's compare your subjects with each other."

    
    Your Report Card
    
    Math       90
    Science    70
    English    80
    Art        60
          ↓
    Find YOUR average
          ↓
    Balance YOUR scores

Illustration of LayerNorm process:

flowchart TD

    A[Single Sample 🧩]
    A --> B[Compute Mean and Variance across Features ✨]
    B --> C[Normalize Activations 🔢]
    C --> D[Scale and Shift 🔣]
    D --> E[Next Layer]

Formula

For a hidden vector of size $d$ :

\mu = \frac{1}{d}\sum_{i=1}^{d}x_i

\sigma^2 = \frac{1}{d}\sum_{i=1}^{d}(x_i-\mu)^2

Normalize:

\hat{x}_i = \frac{x_i-\mu}{\sqrt{\sigma^2+\epsilon}}

Scale and shift:

y_i = \gamma \hat{x}_i + \beta

Statistics Computed Across

For one sample we have $d$ features:

[x_1, x_2, x_3, \ldots, x_d]

Mean and variance are computed across all features: $(d)$ for that single sample.

No dependency on other samples in the batch.

Advantages

Works with batch size 1
Stable for sequence models
Same behavior during training and inference
Ideal for Transformers

Limitations

Usually weaker than BatchNorm for CNNs

Typical Use Cases

GPT
BERT
Llama
RNNs
Transformers

3. Group Normalization (GroupNorm)

Group Normalization divides channels into groups and normalizes within each group.

GroupNorm → normalize across groups of channels

GroupNorm

Designed to remove dependence on batch size.
Small-batch CNNs → GroupNorm

Intuition

"Let's compare your subjects in groups."


    Whole Class
    
    ┌─ Red Team ─┐
    │ Alice      │
    │ Bob        │
    │ Chris      │
    └────────────┘
    
    ┌─ Blue Team ─┐
    │ Emma        │
    │ John        │
    │ Lisa        │
    └────────────┘
    
    Each team
    finds its own average

Illustration of GroupNorm process:

flowchart TD

    A["Channels 📊"]
    A --> B["Divide into Groups 🧩"]
    B --> C["Compute Mean and Variance within Each Group ✨"] 
    C --> D["Normalize Activations 🔢"]
    D --> E["Scale and Shift 🔣"]
    E --> F[Next Layer]

Formula

Suppose $C$ channels are divided into $G$ groups.:

C = 64

G = 4

Each group contains:

\frac{C}{G} = 16 \ channels.

Mean and variance are computed within each group:

\mu_g = \frac{1}{m}\sum_{i=1}^{m}x_i

\sigma_g^2 = \frac{1}{m}\sum_{i=1}^{m}(x_i-\mu_g)^2

Normalize:

\hat{x}_i = \frac{x_i-\mu_g}{\sqrt{\sigma_g^2+\epsilon}}

Statistics Computed Across

For:

(N,C,H,W)

GroupNorm computes statistics across:

\left(\frac{C}{G}, H, W\right)

within each sample.

Advantages

Independent of batch size
Works well with small batches
Excellent for detection and segmentation

Limitations

Slightly slower than BatchNorm
Less common than LayerNorm in Transformers

Typical Use Cases

Mask R-CNN
Object Detection
Segmentation Models
Memory-constrained CNN training

BatchNorm vs LayerNorm vs GroupNorm Comparison

Feature	BatchNorm	LayerNorm	GroupNorm
Uses Batch Statistics	✅	❌	❌
Works with Batch Size = 1	❌	✅	✅
Best for CNNs	✅	❌	✅
Best for Transformers	❌	✅	❌
Same Train/Inference Behavior	❌	✅	✅
Multi-GPU Friendly	⚠️	✅	✅

Written by Hitesh Sahu, a passionate developer and blogger.

Fri Feb 27 2026

Share This on

← Previous

Deep Learning Path 🤖

Training a Neural Network

AI-DeepLearning/1-Neural-Network

Loading ⏳

Fetching content, this won’t take long…

💡 Did you know?

🍌 Bananas are berries, but strawberries are not.

AI-DeepLearning

AI-AgenticAI

AI-DeepLearning

AI-GenAI

AI-Infrastructure

AI-Machine-Learning

AI-Math

AWS

Azure

Hobbies

kubernetes

Management

Programming

Terraform

Z_Appendix

0-root

AI-DeepLearning

Neural Network Hypothesis and Intuition

Explore the hypothesis and intuition behind neural networks, including their structure, activation functions, and how they process inputs to produce outputs.

Data Science

Machine Learning

Deep Learning

Neural Networks

Artificial Intelligence

Computational Graphs

← Previous

Deep Learning Path 🤖

Training a Neural Network

Neural Networks Overview

The Feature Explosion Problem

Why do we need Neural Networks?

Suppose we have $x_1, x_2, \dots, x_n$ as input features and we want to compute a hypothesis $h_\theta(x)$ .

We can

$g(\theta_0+ \theta_1 x_1 + \theta_2 x_2 + \dots + \theta_n x_n)$

For quadratic features

$g(\theta_0+ \theta_1 x_1 + \theta_2 x_2 + \dots + \theta_n x_n + \theta_{n+1} x_1^2 + \dots)$

Quadratic terms grow roughly as $n^2/2$

so we will end up with 5000 additional features if we have 100 features.

For cubic features

$g(\theta_0+ \theta_1 x_1 + \theta_2 x_2 + \dots + \theta_n x_n + \theta_{n+1} x_1^2 + \dots + \theta_{n+k} x_1^3 + \dots)$

Cubic terms grows as $O(n^3)$
So we will end up with 166,000 additional features if we have 100 features.

As the features become more complex, the number of parameters $\theta$ grows rapidly.

It becomes

computationally expensive to compute the hypothesis with many features.
Memory-Heavy to store all the parameters.
Prone to overfitting due to the large number of parameters.

In this case, we can use a neural network to compute the hypothesis more efficiently.

Practical Example: Image Recognition

Suppose we have a 100 × 100 pixel image as input.

Each pixel is a feature, so we have 10,000 features.
For RGB images, we have 3 color channels, so we have 30,000 features.

If we want to compute a hypothesis with quadratic features, we would have on the order of 450 million features, which is computationally infeasible.

Conclusion

Polynomial logistic regression: Works for small $𝑛$

Explodes combinatorially for large $𝑛$
Computationally impossible for large feature sets (like images)
We need a non-linear model that can capture complex relationships without explicitly generating all polynomial features.

Neural Networks as a Solution

Neural networks can compute complex hypotheses without explicitly generating all polynomial features.

NN Types and Applications

Comparison of Neural Network Types

Network Type	Best For	Memory	Spatial Awareness
Feedforward	Tabular data	No	No
CNN	Images	No	Yes
RNN	Sequences	Yes	Limited

Evolution of Architectures

Today, many RNN tasks are increasingly replaced by:

Transformers
Attention mechanisms

flowchart LR

    A[Feedforward Networks]

    A --> B[CNNs]

    A --> C[RNNs]

    C --> D[LSTM / GRU]

    D --> E[Transformers]

1. 🔀 `Standard Feedforward Networks`

Information moves only in one direction:

Also called:

Fully Connected Networks
Dense Neural Networks
Multi-Layer Perceptrons (MLP)

\text{Input} \rightarrow \text{Hidden Layers} \rightarrow \text{Output}

No cycles or memory.

flowchart LR

    A1[Input 1]
    A2[Input 2]
    A3[Input 3]

    A1 --> H1
    A1 --> H2

    A2 --> H1
    A2 --> H2

    A3 --> H1
    A3 --> H2

    H1[Hidden Layer]
    H2[Hidden Layer]

    H1 --> O[Output]
    H2 --> O

Often used for:

Housing price prediction
Online advertising
Credit scoring
Customer churn prediction
Fraud detection

Mathematical Representation

For one layer:

y = f(Wx + b)

Where:

$x$ = input vector
$W$ = weights
$b$ = bias
$f$ = activation function

2. 🏞️ `Convolutional Neural Networks (CNNs)`

CNNs are specialized neural networks designed primarily used for image data

Exploit spatial structure in images

Instead of connecting every neuron to every pixel to CNNs detect patterns using

convolution filters
feature maps

flowchart TD

    A[Input Image]

    A --> B[Convolution Layer]

    B --> C[Feature Maps]

    C --> D[Pooling Layer]

    D --> E[Fully Connected Layer]

    E --> F[Prediction]

Mathematical Representation of Convolution

A filter slides across the image.

(I * K)(x,y) = \sum_m \sum_n I(m,n)K(x-m,y-n)

Where:

$I$ = image
$K$ = kernel/filter
$K$ = kernel/filter

flowchart LR

    A[Image]

    A --> B[Edges]

    B --> C[Textures]

    C --> D[Shapes]

    D --> E[Objects]

Use Case

Image classification
Face recognition
Medical imaging
Self-driving cars
Object detection

3. 🔢 `Recurrent Neural Networks (RNNs)`

RNNs process data step-by-step while remembering previous information.

Unlike feedforward networks, RNNs have memory.
They can capture temporal dependencies in data.

Used for sequence data

Examples:

Audio (time series)
Language (word-by-word sequence)

RNN Architecture

flowchart LR

    X1[Word 1] --> H1
    H1 --> H2

    X2[Word 2] --> H2
    H2 --> H3

    X3[Word 3] --> H3

    H1[Hidden State]
    H2[Hidden State]
    H3[Hidden State]

Mathematical Representation of RNNs

h_t = f(x_t, h_{t-1})

Where:

$x_t$ = current input
$h_{t-1}$ = previous hidden state
$h_t$ = updated memory state

Use Cases for RNNs

Audio / Time Series

Speech recognition
Sensor data
Financial forecasting

Language Processing

Translation
Chatbots
Text generation
Next

4. Custom Neural Networks

Tailored for specific applications Used in complex systems like autonomous driving:

CNNs for images
Other components for radar
Combined into custom architectures

Neurons as Computational Units

At a simple level, neurons are computational units.

They:

dendrites(head): Take inputs $x_1, x_2, \dots, x_n$
Process them: apply weights and activation function
axon(tail): Produce an output $h_\theta(x)$

Artificial Neurons Model: Logistic Unit

In artificial neural networks, we model neurons as mathematical Logistic functions.


graph LR

subgraph Input Layer
x0(((x0)))    
x1(((x1)))
x2(((x2)))
x3(((x3)))
end

subgraph Activation Layer
a1{a1}
end

x0-->a1
x1-->a1
x2-->a1
x3-->a1


subgraph Output Layer
y(((hθx)))
end

a1-->y

A simple network looks like:

\begin{bmatrix} x_0 \\ x_1 \\ x_2 \\ x_3 \end{bmatrix} \rightarrow \text{Neuron} \rightarrow h_\theta(x)

In our machine learning model:

Inputs are features: $x_1, x_2, \dots, x_n$

x= \begin{bmatrix} x_0 \\ x_1 \\ x_2 \\ \vdots \\ x_n \end{bmatrix}

Parameters $\theta$ are called weights

\theta = \begin{bmatrix} \theta_0 \\ \theta_1 \\ \theta_2 \\ \vdots \\ \theta_n \end{bmatrix}

Bias unit

$x_0$ is the bias unit/ bias Neuron and it is always equal to 1

For simplicity we dont draw $x_0$

Output

Outputs of neurons are called activations Weights which is the hypothesis $h_\theta(x)$

where $h_\theta(x) = g(\theta^T x)$

And we can rewrite: $z = \theta^T x$

so the hypothesis can be expressed as: $h_\theta(x) = g(z)$

$g(z)$ is the activation function that introduces non-linearity into the model.

Activation Function $g(\cdot)$

$g(z)$ is the activation function. for Hypothesis function:

Example: ReLU, sigmoid, tanh, etc.

Neural networks using sigmoid activation function for logistic regression:

$h_\Theta(x) = g(z)$

Where

g(z) = \frac{1}{1 + e^{-z}}

Where $z = \theta^T x$

ReLU vs Sigmoid Activation Function

Sigmoid

Input:  -5    0    5
Output: 0.01 0.5 0.99

Outputs behave like probabilities.

Sigmoid squashes outputs between:

0 and 1

Its derivative becomes very small for large positive or negative values.

Vanishing gradient

Deep neural networks may contain:

dozens
hundreds
thousands

of layers.

During backpropagation:

gradients are multiplied repeatedly across layers.

If gradients are small:

0.1 × 0.1 × 0.1 × 0.1 ...

they rapidly approach zero.

Example

Suppose gradient values are:

0.2 × 0.2 × 0.2 × 0.2 × 0.2

Result:

0.00032

Very tiny gradients mean:

almost no weight updates
slow or stalled learning

Exploding vs Vanishing Gradients

Problem	What Happens
Vanishing gradients	Gradients become too small
Exploding gradients	Gradients become too large

Rectified linear unit `ReLU`

ReLU activates only positive values. Negative values become zero.

Input:  -3  -1   0   2   5
Output:  0   0   0   2   5

ReLU helps because:

gradients remain stronger
training becomes faster
deep networks scale better

Enables: This enables:

deeper networks
faster convergence
stable learning

Which was not possible with sigmoid before

Feature	ReLU (Rectified Linear Unit)	Sigmoid
How it works	If the signal is positive, it passes it forward; if negative, it outputs zero	Converts output into a probability between 0 and 1
Example (Cat model)	Strong whisker detection → output `3.5` (feature exists strongly)	Final output → `0.92` → “92% chance this is a cat”
Output Range	`[0, ∞)`	`[0, 1]`
Best suited for	Hidden layers; very fast and helps deep models learn efficiently	Output layer for yes/no classification problems
Performance	Computationally efficient	Comparatively slower
Formula	`f(x) = max(0, x)`	`f(x) = 1 / (1 + e^(-x))`
Optimized for	Avoiding vanishing gradients	Predicting probabilities

Neural Network Structure

An Artificial Neural Network is a computational graph with layers of artificial neurons.

Neural networks are simply multiple logistic regression units stacked together.

Each layer:

Takes activations from previous layer
Multiplies by weight matrix
Applies sigmoid activation
Passes result forward

\text{Input} \rightarrow \text{Hidden} \rightarrow \text{Output}

\begin{bmatrix} x_0 \\ x_1 \\ x_2 \\ x_3 \end{bmatrix} \rightarrow \begin{bmatrix} a^{(2)}_1 \\ a^{(2)}_2 \\ a^{(2)}_3 \end{bmatrix} \rightarrow h_\theta(x)

graph LR
    Input --> Hidden-Layer
    Hidden-Layer --> Output

1. Layer 1: Input layer

Takes features as Input

$x_1, x_2, \dots, x_n$
$x_0$ = 1 is bias unit that is not drawn

2. Layer 2: Hidden layer:

All Intermediate Layers between Input & Output Layer

Computes intermediate activations $a^{(2)}_1, a^{(2)}_2, \dots , a^{(2)}_n,$
$a^{(2)}_0$ =1 is bias unit that is not drawn

$a^{(j)}_i$ = Activation output of $i$ th `neuron` in `layer` $j$

$i$ = Neuron Index inside that layer
$j$ = layer number

Example:

$a_1^{(2)}$ = first neuron in Second hidden layer
$a_3^{(2)}$ = 3rd Neuron in Second hidden layer

Computing Hidden Layer Activations

Activation can be computed as :

a^{(2)}_1 = g(\Theta^{(1)}_{10}x_0 + \Theta^{(1)}_{11}x_1 + \Theta^{(1)}_{12}x_2 + \Theta^{(1)}_{13}x_3)

a^{(2)}_2 = g(\Theta^{(1)}_{20}x_0 + \Theta^{(1)}_{21}x_1 + \Theta^{(1)}_{22}x_2 + \Theta^{(1)}_{23}x_3)

a^{(2)}_3 = g(\Theta^{(1)}_{30}x_0 + \Theta^{(1)}_{31}x_1 + \Theta^{(1)}_{32}x_2 + \Theta^{(1)}_{33}x_3)

Where g(.) is sigmoid

Weight Matrices $\Theta^{(j)}$

Weight Matrix of $jth$ layer

weight matrix maps layer $j$ to layer $j+1$
Each $j$ th layer gets its own matrix of weights $\Theta^{(j)}$
Layers are indexed starting from 1, not 0.

Weight Matrix Dimensions

`Outputlayer Neurons x (InputLayer Neurons + 1)` Dimensioned Matrix

Input side includes bias
Output side does NOT include bias

$\Theta(j)$ would be a Matrix of dimension:

s_{(j+1)} X (s_j+1)

\Theta^{(j)} \in \mathbb{R}^{s_{j+1} \times (s_j + 1)}

Where:

$s_j$ = number of neurons/units in $j$ th layer
$s_{(j+1)}$ units in Output layer $j+1$

Practical Example:

Each Input Layer is densely connected to each Activation Function of next Layer:

Input layer: 3 units + 1 Bias
Hidden layer 1: 4 units + 1 Bias
Output layer: 1 unit


graph LR
    %% ===== Layer 1 =====
    subgraph "Layer 1: Input Layer"
        x0((x0 = 1))
        x1(((x1)))
        x2(((x2)))
        x3(((x3)))
    end

    %% ===== Layer 2 =====
    subgraph "Layer 2: Hidden Layer"
        a0{a0 = 1}
        a1{a1}
        a2{a2}
        a3{a3}
        a4{a4}
    end

    %% ===== Layer 3 =====
    subgraph "Layer 3: Output Layer"
        y(((hθx)))
    end

  
    %% Input → Hidden
    x0 --> a1
    x0 --> a2
    x0 --> a3
    x0 --> a4

    x1 --> a1
    x1 --> a2
    x1 --> a3
    x1 --> a4
  
    x2 --> a1
    x2 --> a2
    x2 --> a3
    x2 --> a4
  
    x3 --> a1
    x3 --> a2
    x3 --> a3
    x3 --> a4


%% Hidden → Output
    a0 --> y
    a1 --> y
    a2 --> y
    a3 --> y
    a4 --> y

Simplified

Showing Bias unit but only 1 connection for reference


graph LR
    x0(((x0))) --> a1{a1}
    x1(((x1))) --> a1{a1}
    x1 --> a2{a2}
    x1 --> a3{a3}
    x1 --> a4{a4}
        
    x2(((x2))) --> a1
    x2 --> a2
    x2 --> a3
    x2 --> a4
  
    x3(((x3))) --> a1
    x3 --> a2
    x3 --> a3
    x3 --> a4

    a0{a0} --> y(((hθx)))
    a1 --> y
    a2 --> y
    a3 --> y
    a4 --> y

Given

Input layer: $x_0, x_1, x_2, x_3$
Hidden layer: $a_0, a_1, a_2, a_3, a_4$
Output layer: $h_\theta(x)$

Where:

$x_0 = 1$ → bias unit for the input layer
$a_0 = 1$ → bias unit for the hidden layer

Weight Matrix:

Layer 1 ( $S_1$ )

Input Layer → Hidden Layer

\Theta^{(1)} \in \mathbb{R}^{4 \times 4}

\Theta^{(1)} = 4 X 4 Matrix

=> 4 (a_1, a_2, a_3, a_4) × 4 $(x_0, x_1, x_2, x_3)$

Layer 2 ( $S_2$ )

Hidden Layer → Output Layer

\Theta^{(2)} \in \mathbb{R}^{1 \times 5}

\Theta^{(2)} = 1 X 5 Matrix

=> 1 (output neuron) × 5 $(a_0, a_1, a_2, a_3, a_4)$

Activation of Neurons in Layer 2

First Neuron in layer 2

a^{(2)}_1 = g(\Theta^{(1)}_{10}x_0 + \Theta^{(1)}_{11}x_1 + \Theta^{(1)}_{12}x_2 + \Theta^{(1)}_{13}x_3)

Second Neuron in layer 2

a^{(2)}_2 = g(\Theta^{(1)}_{20}x_0 + \Theta^{(1)}_{21}x_1 + \Theta^{(1)}_{22}x_2 + \Theta^{(1)}_{23}x_3)

Third Neuron in layer 2

a^{(2)}_3 = g(\Theta^{(1)}_{30}x_0 + \Theta^{(1)}_{31}x_1 + \Theta^{(1)}_{32}x_2 + \Theta^{(1)}_{33}x_3)

Where

$g$ : sigmoid activation function of collective term

3. Output layer

Gives us the final hypothesis $h_\Theta(x)$

Hidden layer outputs become inputs to the next layer
Another weight matrix $\Theta^{(2)}$ is applied
Then the sigmoid function is applied again

Output Layer Hypothesis

The final hypothesis is the first neuron of 3rd layer

h_\Theta(x) = a^{(3)}_1

Which is equals to

a^{(3)}_1 = g(\Theta^{(2)}_{10}a^{(2)}_0 + \Theta^{(2)}_{11}a^{(2)}_1+ \Theta^{(2)}_{12}a^{(2)}_2+ \Theta^{(2)}_{13}a^{(2)}_3)

Where

$g$ : Sigmoid of final term
$\Theta^{(2)}$ is weight Matrix for final Output Layer

🔢 Normalization Techniques

Normalization techniques are used to stabilize and accelerate the training of neural networks by normalizing the activations of layers. stabilize training

It helps:

stabilize training
speed up convergence
reduce sensitivity to initialization
prevent activations from becoming too large or too small

1. Batch Normalization (BatchNorm)

Batch Normalization is a technique that normalizes the activations of a layer across the batch dimension during training.

Introduced in 2015 and widely used in CNNs.

BatchNorm

BatchNorm → normalize across examples in a batch.

Intuition

"Let's compare each kid to the average height of the whole class."


  Entire Class
  ┌─────────────┐
  │ Alice       │
  │ Bob         │
  │ Chris       │
  │ ...         │
  └─────────────┘
          ↓
  Find class average
          ↓
  Adjust everyone

Illustration of BatchNorm process:

Statistics

For CNN input: $(N, C, H, W)$

BatchNorm computes mean and variance for each channel across:

(N, H, W)

Where:

$N$ = batch size
$C$ = number of channels
$H$ = height
$W$ = width

flowchart TD

    A["Batch of Images 🖼️🖼️🖼️"]
    A --> B["Compute Mean and Variance ✨"]
    B --> C["Normalize Activations 🔢"]
    C --> D["Scale and Shift 🔣"]
    D --> E["Next Layer"]

Formula

Given activations $x$ :

Mean:

\mu_B = \frac{1}{m}\sum_{i=1}^{m}x_i

Variance:

\sigma_B^2 = \frac{1}{m}\sum_{i=1}^{m}(x_i-\mu_B)^2

Normalize:

\hat{x}_i = \frac{x_i-\mu_B}{\sqrt{\sigma_B^2+\epsilon}}

Scale and shift:

y_i = \gamma \hat{x}_i + \beta

$\gamma$ and $\beta$ are learnable parameters that allow the model to restore the original distribution if needed.
$\epsilon$ is a small constant to prevent division by zero.
$B$ denotes the batch of examples.
$m$ is the number of examples in the batch.
$x_i$ is the activation of the $i$ -th example in the batch.

Advantages

Work Best when we have large batches
Works extremely well for CNNs
Faster convergence
Regularization effect

Limitations

Depends on batch size: Small batches reduce effectiveness
Requires synchronization in distributed training

Typical Use Cases

ResNet
VGG - EfficientNet
Most traditional CNNs

2. Layer Normalization (LayerNorm)

Layer Normalization normalizes across the features of a single sample, making it independent of batch size.

LayerNorm

LayerNorm → normalize across features

Popularized by Transformers.

Transformers / LLMs → LayerNorm (or RMSNorm)
RNNs / Sequence Models → LayerNorm

Intuition

"Let's compare your subjects with each other."

    
    Your Report Card
    
    Math       90
    Science    70
    English    80
    Art        60
          ↓
    Find YOUR average
          ↓
    Balance YOUR scores

Illustration of LayerNorm process:

flowchart TD

    A[Single Sample 🧩]
    A --> B[Compute Mean and Variance across Features ✨]
    B --> C[Normalize Activations 🔢]
    C --> D[Scale and Shift 🔣]
    D --> E[Next Layer]

Formula

For a hidden vector of size $d$ :

\mu = \frac{1}{d}\sum_{i=1}^{d}x_i

\sigma^2 = \frac{1}{d}\sum_{i=1}^{d}(x_i-\mu)^2

Normalize:

\hat{x}_i = \frac{x_i-\mu}{\sqrt{\sigma^2+\epsilon}}

Scale and shift:

y_i = \gamma \hat{x}_i + \beta

Statistics Computed Across

For one sample we have $d$ features:

[x_1, x_2, x_3, \ldots, x_d]

Mean and variance are computed across all features: $(d)$ for that single sample.

No dependency on other samples in the batch.

Advantages

Works with batch size 1
Stable for sequence models
Same behavior during training and inference
Ideal for Transformers

Limitations

Usually weaker than BatchNorm for CNNs

Typical Use Cases

GPT
BERT
Llama
RNNs
Transformers

3. Group Normalization (GroupNorm)

Group Normalization divides channels into groups and normalizes within each group.

GroupNorm → normalize across groups of channels

GroupNorm

Designed to remove dependence on batch size.
Small-batch CNNs → GroupNorm

Intuition

"Let's compare your subjects in groups."


    Whole Class
    
    ┌─ Red Team ─┐
    │ Alice      │
    │ Bob        │
    │ Chris      │
    └────────────┘
    
    ┌─ Blue Team ─┐
    │ Emma        │
    │ John        │
    │ Lisa        │
    └────────────┘
    
    Each team
    finds its own average

Illustration of GroupNorm process:

flowchart TD

    A["Channels 📊"]
    A --> B["Divide into Groups 🧩"]
    B --> C["Compute Mean and Variance within Each Group ✨"] 
    C --> D["Normalize Activations 🔢"]
    D --> E["Scale and Shift 🔣"]
    E --> F[Next Layer]

Formula

Suppose $C$ channels are divided into $G$ groups.:

C = 64

G = 4

Each group contains:

\frac{C}{G} = 16 \ channels.

Mean and variance are computed within each group:

\mu_g = \frac{1}{m}\sum_{i=1}^{m}x_i

\sigma_g^2 = \frac{1}{m}\sum_{i=1}^{m}(x_i-\mu_g)^2

Normalize:

\hat{x}_i = \frac{x_i-\mu_g}{\sqrt{\sigma_g^2+\epsilon}}

Statistics Computed Across

For:

(N,C,H,W)

GroupNorm computes statistics across:

\left(\frac{C}{G}, H, W\right)

within each sample.

Advantages

Independent of batch size
Works well with small batches
Excellent for detection and segmentation

Limitations

Slightly slower than BatchNorm
Less common than LayerNorm in Transformers

Typical Use Cases

Mask R-CNN
Object Detection
Segmentation Models
Memory-constrained CNN training

BatchNorm vs LayerNorm vs GroupNorm Comparison

Feature	BatchNorm	LayerNorm	GroupNorm
Uses Batch Statistics	✅	❌	❌
Works with Batch Size = 1	❌	✅	✅
Best for CNNs	✅	❌	✅
Best for Transformers	❌	✅	❌
Same Train/Inference Behavior	❌	✅	✅
Multi-GPU Friendly	⚠️	✅	✅

Written by Hitesh Sahu, a passionate developer and blogger.

Fri Feb 27 2026

Share This on

← Previous

Deep Learning Path 🤖

Training a Neural Network

AI-DeepLearning/1-Neural-Network

Fetching content, this won’t take long…

🍌 Bananas are berries, but strawberries are not.

Fetching content, this won’t take long…

🐙 Octopuses have three hearts and blue blood.

AI-DeepLearning

AI-AgenticAI

AI-DeepLearning

AI-GenAI

AI-Infrastructure

AI-Machine-Learning

AI-Math

AWS

Azure

Hobbies

kubernetes

Management

Programming

Terraform

Z_Appendix

0-root

Neural Network Hypothesis and Intuition

Explore the hypothesis and intuition behind neural networks, including their structure, activation functions, and how they process inputs to produce outputs.

Neural Networks Overview

The Feature Explosion Problem

Why do we need Neural Networks?

For quadratic features

For cubic features

Practical Example: Image Recognition

Conclusion

Neural Networks as a Solution

NN Types and Applications

Comparison of Neural Network Types

Evolution of Architectures

1. 🔀 Standard Feedforward Networks

Mathematical Representation

2. 🏞️ Convolutional Neural Networks (CNNs)

Mathematical Representation of Convolution

Use Case

3. 🔢 Recurrent Neural Networks (RNNs)

RNN Architecture

Mathematical Representation of RNNs

Use Cases for RNNs

Audio / Time Series

Language Processing

4. Custom Neural Networks

Neurons as Computational Units

Artificial Neurons Model: Logistic Unit

Bias unit

Output

Activation Function g(⋅)g(\cdot)g(⋅)

ReLU vs Sigmoid Activation Function

Sigmoid

Vanishing gradient

Example

Exploding vs Vanishing Gradients

Rectified linear unit ReLU

Neural Network Structure

1. Layer 1: Input layer

2. Layer 2: Hidden layer:

ai(j)a^{(j)}_iai(j)​ = Activation output of iii th neuron in layer jjj

Computing Hidden Layer Activations

Weight Matrices Θ(j)\Theta^{(j)}Θ(j)

Weight Matrix Dimensions

Outputlayer Neurons x (InputLayer Neurons + 1) Dimensioned Matrix

Practical Example:

Weight Matrix:

Activation of Neurons in Layer 2

3. Output layer

Output Layer Hypothesis

🔢 Normalization Techniques

1. Batch Normalization (BatchNorm)

Intuition

Statistics

Formula

Advantages

Limitations

Typical Use Cases

2. Layer Normalization (LayerNorm)

Intuition

Illustration of LayerNorm process:

1. 🔀 `Standard Feedforward Networks`

2. 🏞️ `Convolutional Neural Networks (CNNs)`

3. 🔢 `Recurrent Neural Networks (RNNs)`

Activation Function $g(\cdot)$

Rectified linear unit `ReLU`

$a^{(j)}_i$ = Activation output of $i$ th `neuron` in `layer` $j$

Weight Matrices $\Theta^{(j)}$

`Outputlayer Neurons x (InputLayer Neurons + 1)` Dimensioned Matrix