Forward Propagation
For any layer j j j :
Linear Step
Calculate pre activation term
z ( j ) = Θ ( j − 1 ) a ( j − 1 ) z^{(j)} = \Theta^{(j-1)} a^{(j-1)} z ( j ) = Θ ( j − 1 ) a ( j − 1 )
Activation Step
Apply activation
a ( j ) = g ( z ( j ) ) a^{(j)} = g(z^{(j)}) a ( j ) = g ( z ( j ) )
This process is repeated until the output layer.
From Scalar Equations to Vector Form
graph LR
subgraph Input Layer
x1(((x1)))
x2(((x2)))
x3(((x3)))
end
subgraph Hidden Layer 1
a1{a1}
a2{a2}
a3{a3}
end
subgraph Output Layer
y(((hθx)))
end
x1 --> a1
x1 --> a2
x1 --> a3
x2 --> a1
x2 --> a2
x2 --> a3
x3 --> a1
x3 --> a2
x3 --> a3
a1 --> y
a2 --> y
a3 --> y
Previously, we wrote each neuron separately.
For the hidden layer:
a 1 ( 2 ) = g ( Θ 10 ( 1 ) x 0 + Θ 11 ( 1 ) x 1 + Θ 12 ( 1 ) x 2 + Θ 13 ( 1 ) x 3 ) a^{(2)}_1 = g\left(\Theta^{(1)}_{10}x_0 + \Theta^{(1)}_{11}x_1 + \Theta^{(1)}_{12}x_2 + \Theta^{(1)}_{13}x_3\right) a 1 ( 2 ) = g ( Θ 10 ( 1 ) x 0 + Θ 11 ( 1 ) x 1 + Θ 12 ( 1 ) x 2 + Θ 13 ( 1 ) x 3 )
a 2 ( 2 ) = g ( Θ 20 ( 1 ) x 0 + Θ 21 ( 1 ) x 1 + Θ 22 ( 1 ) x 2 + Θ 23 ( 1 ) x 3 ) a^{(2)}_2 = g\left(\Theta^{(1)}_{20}x_0 + \Theta^{(1)}_{21}x_1 + \Theta^{(1)}_{22}x_2 + \Theta^{(1)}_{23}x_3\right) a 2 ( 2 ) = g ( Θ 20 ( 1 ) x 0 + Θ 21 ( 1 ) x 1 + Θ 22 ( 1 ) x 2 + Θ 23 ( 1 ) x 3 )
a 3 ( 2 ) = g ( Θ 30 ( 1 ) x 0 + Θ 31 ( 1 ) x 1 + Θ 32 ( 1 ) x 2 + Θ 33 ( 1 ) x 3 ) a^{(2)}_3 = g\left(\Theta^{(1)}_{30}x_0 + \Theta^{(1)}_{31}x_1 + \Theta^{(1)}_{32}x_2 + \Theta^{(1)}_{33}x_3\right) a 3 ( 2 ) = g ( Θ 30 ( 1 ) x 0 + Θ 31 ( 1 ) x 1 + Θ 32 ( 1 ) x 2 + Θ 33 ( 1 ) x 3 )
Where
Superscript a ( 2 ) a^{(2)} a ( 2 ) indicates layer 2 (hidden layer)
g ( ⋅ ) g(\cdot) g ( ⋅ ) is the sigmoid function
Final hypothesis is:
h Θ ( x ) = a 1 ( 3 ) h_\Theta(x) = a^{(3)}_1 h Θ ( x ) = a 1 ( 3 )
Where
a 1 ( 3 ) = g ( Θ 10 ( 2 ) a 0 ( 2 ) + Θ 11 ( 2 ) a 1 ( 2 ) + Θ 12 ( 2 ) a 2 ( 2 ) + Θ 13 ( 2 ) a 3 ( 2 ) ) a^{(3)}_1 = g\left(\Theta^{(2)}_{10}a^{(2)}_0 + \Theta^{(2)}_{11}a^{(2)}_1 + \Theta^{(2)}_{12}a^{(2)}_2 + \Theta^{(2)}_{13}a^{(2)}_3\right) a 1 ( 3 ) = g ( Θ 10 ( 2 ) a 0 ( 2 ) + Θ 11 ( 2 ) a 1 ( 2 ) + Θ 12 ( 2 ) a 2 ( 2 ) + Θ 13 ( 2 ) a 3 ( 2 ) )
It does not scale. So we need to vectorize it for more complex use cases.
Pre Activation Term z k ( j ) z_k^{(j)} z k ( j )
Intermediate Variable contain the weighted sum before activation:
Suppose
z 1 ( 2 ) = Θ 10 ( 1 ) x 0 + Θ 11 ( 1 ) x 1 + Θ 12 ( 1 ) x 2 + Θ 13 ( 1 ) x 3 z_1^{(2)} = \Theta^{(1)}_{10}x_0 + \Theta^{(1)}_{11}x_1 + \Theta^{(1)}_{12}x_2 + \Theta^{(1)}_{13}x_3 z 1 ( 2 ) = Θ 10 ( 1 ) x 0 + Θ 11 ( 1 ) x 1 + Θ 12 ( 1 ) x 2 + Θ 13 ( 1 ) x 3
z 2 ( 2 ) = g ( Θ 20 ( 1 ) x 0 + Θ 21 ( 1 ) x 1 + Θ 22 ( 1 ) x 2 + Θ 23 ( 1 ) x 3 ) z_2^{(2)} = g\left(\Theta^{(1)}_{20}x_0 + \Theta^{(1)}_{21}x_1 + \Theta^{(1)}_{22}x_2 + \Theta^{(1)}_{23}x_3\right) z 2 ( 2 ) = g ( Θ 20 ( 1 ) x 0 + Θ 21 ( 1 ) x 1 + Θ 22 ( 1 ) x 2 + Θ 23 ( 1 ) x 3 )
z 3 ( 2 ) = g ( Θ 20 ( 1 ) x 0 + Θ 21 ( 1 ) x 1 + Θ 22 ( 1 ) x 2 + Θ 23 ( 1 ) x 3 ) z_3^{(2)} = g\left(\Theta^{(1)}_{20}x_0 + \Theta^{(1)}_{21}x_1 + \Theta^{(1)}_{22}x_2 + \Theta^{(1)}_{23}x_3\right) z 3 ( 2 ) = g ( Θ 20 ( 1 ) x 0 + Θ 21 ( 1 ) x 1 + Θ 22 ( 1 ) x 2 + Θ 23 ( 1 ) x 3 )
🧠 Generalized preactivation term
z k ( j ) = Θ k , 0 ( j − 1 ) a 0 ( j − 1 ) + Θ k , 1 ( j − 1 ) a 1 ( j − 1 ) + ⋯ + Θ k , n ( j − 1 ) a n ( j − 1 ) z^{(j)}_k =
\Theta^{(j-1)}_{k,0} a^{(j-1)}_0 +
\Theta^{(j-1)}_{k,1} a^{(j-1)}_1 +
\dots +
\Theta^{(j-1)}_{k,n} a^{(j-1)}_n z k ( j ) = Θ k , 0 ( j − 1 ) a 0 ( j − 1 ) + Θ k , 1 ( j − 1 ) a 1 ( j − 1 ) + ⋯ + Θ k , n ( j − 1 ) a n ( j − 1 )
Then
a 1 ( 2 ) = g ( z 1 ( 2 ) ) a^{(2)}_1 = g(z_1^{(2)}) a 1 ( 2 ) = g ( z 1 ( 2 ) )
a 2 ( 2 ) = g ( z 2 ( 2 ) ) a^{(2)}_2 = g(z_2^{(2)}) a 2 ( 2 ) = g ( z 2 ( 2 ) )
a 3 ( 2 ) = g ( z 3 ( 2 ) ) a^{(2)}_3 = g(z_3^{(2)}) a 3 ( 2 ) = g ( z 3 ( 2 ) )
🧠 Generalized Activation
a k ( j ) = g ( z k ( j ) ) a^{(j)}_k = g\left(z^{(j)}_k\right) a k ( j ) = g ( z k ( j ) )
This separates:
Linear computation
Nonlinear activation
Vector Representation
Input layer:
x = [ x 0 x 1 ⋮ x n ] x =
\begin{bmatrix}
x_0 \\
x_1 \\
\vdots \\
x_n
\end{bmatrix} x = x 0 x 1 ⋮ x n
Where x 0 x_0 x 0 = 1
Let
a ( 1 ) = x a^{(1)} = x a ( 1 ) = x
Weighted sum vector:
z ( j ) = [ z 1 ( j ) z 2 ( j ) ⋮ z s j ( j ) ] z^{(j)} =
\begin{bmatrix}
z^{(j)}_1 \\
z^{(j)}_2 \\
\vdots \\
z^{(j)}_{s_j}
\end{bmatrix} z ( j ) = z 1 ( j ) z 2 ( j ) ⋮ z s j ( j )
Where:
s j = number of units in layer j s_j = \text{number of units in layer } j s j = number of units in layer j
We can calculate z z z as
z ( 2 ) = Θ ( 1 ) x z^{(2)} = \Theta^{(1)}x z ( 2 ) = Θ ( 1 ) x
Since x = a ( 1 ) a^{(1)} a ( 1 ) , so we can rewrite it:
z ( 2 ) = Θ ( 1 ) a ( 1 ) z^{(2)} = \Theta^{(1)} a^{(1)} z ( 2 ) = Θ ( 1 ) a ( 1 )
🧠 Generalized Vectorized Preactivation term:
z ( j ) = Θ ( j − 1 ) a ( j − 1 ) z^{(j)} = \Theta^{(j-1)} a^{(j-1)} z ( j ) = Θ ( j − 1 ) a ( j − 1 )
Where Vector Dimensions:
Θ ( j − 1 ) ∈ R s j × ( n + 1 ) \Theta^{(j-1)} \in \mathbb{R}^{s_j \times (n+1)} Θ ( j − 1 ) ∈ R s j × ( n + 1 )
a ( j − 1 ) ∈ R ( n + 1 ) × 1 a^{(j-1)} \in \mathbb{R}^{(n+1) \times 1} a ( j − 1 ) ∈ R ( n + 1 ) × 1
z ( j ) ∈ R s j × 1 z^{(j)} \in \mathbb{R}^{s_j \times 1} z ( j ) ∈ R s j × 1
Activation Function
Since
a 2 ( 2 ) = g ( z 2 ( 2 ) ) a^{(2)}_2 = g(z_2^{(2)}) a 2 ( 2 ) = g ( z 2 ( 2 ) )
Generalized Activation Function:
a ( j ) = g ( z ( j ) ) a^{(j)} = g\left(z^{(j)}\right) a ( j ) = g ( z ( j ) )
If using sigmoid:
g ( z ) = 1 1 + e − z g(z) = \frac{1}{1 + e^{-z}} g ( z ) = 1 + e − z 1
Add Bias Unit
After computing a ( 2 ) a^{(2)} a ( 2 ) , add:
a 0 ( 2 ) = 1 a_0^{(2)} = 1 a 0 ( 2 ) = 1
Now:
a ( j ) = [ 1 a 1 ( j ) ⋮ a s j ( j ) ] a^{(j)} =
\begin{bmatrix}
1 \\
a^{(j)}_1 \\
\vdots \\
a^{(j)}_{s_j}
\end{bmatrix} a ( j ) = 1 a 1 ( j ) ⋮ a s j ( j )
Output Layer
Repeat the same process:
calculate Linear Term z
z ( j + 1 ) = Θ ( j ) a ( j ) z^{(j+1)} = \Theta^{(j)} a^{(j)} z ( j + 1 ) = Θ ( j ) a ( j )
Apply activation sigmoid of z
a ( j + 1 ) = g ( z ( j + 1 ) ) a^{(j+1)} = g\left(z^{(j+1)}\right) a ( j + 1 ) = g ( z ( j + 1 ) )
Final hypothesis:
h Θ ( x ) = a ( 3 ) = g ( z ( 3 ) ) h_\Theta(x) = a^{(3)} = g(z^{(3)}) h Θ ( x ) = a ( 3 ) = g ( z ( 3 ) )
🧠 Generalized Hypothesis
h Θ ( x ) = a ( j + 1 ) = g ( z ( j + 1 ) ) h_\Theta(x) = a^{(j+1)} = g\left(z^{(j+1)}\right) h Θ ( x ) = a ( j + 1 ) = g ( z ( j + 1 ) )
The Big Picture
Each layer performs:
Linear transformation z = Θ a \text{Linear transformation} \quad z = \Theta a Linear transformation z = Θ a
followed by
Nonlinearity a = g ( z ) \text{Nonlinearity} \quad a = g(z) Nonlinearity a = g ( z )
Stacking these layers allows neural networks to represent complex nonlinear functions.
Intuition
If we remove the hidden layer, the model becomes logistic regression:
h ( x ) = g ( θ T x ) h(x) = g(\theta^T x) h ( x ) = g ( θ T x )
With hidden layers, the network instead uses learned features:
a 1 , a 2 , a 3 a_1, a_2, a_3 a 1 , a 2 , a 3
These are:
Computed by the hidden layer
Learned from data
Controlled by parameters Θ ( 1 ) \Theta^{(1)} Θ ( 1 )
So a neural network is:
Logistic regression on learned features.