Laboratory Activity 3#

image

Define Inputs and Weights#

We begin by defining the input vector x, target output y, hidden layer weights (W_hidden), and output layer parameters (theta).
The learning rate (lr) is also set to control how much the weights are updated during backpropagation.

# Define inputs, weights, target, and learning rate (given values)
import numpy as np

print("Define inputs and weights (as given)")
x = np.array([1., 0., 1.])        # input vector
y = 1.                           # target scalar

W_hidden = np.array([            # shape (3,2) inputs->2 hidden units
    [0.2, -0.3],
    [0.4,  0.1],
    [-0.5, 0.2]
])
theta = np.array([-0.4, 0.2, 0.1])  # [bias, w_h1, w_h2]

lr = 0.001  # learning rate

print("x =", x)
print("y =", y)
print("W_hidden =\n", W_hidden)
print("theta =", theta)
print("learning rate =", lr)
Define inputs and weights (as given)
x = [1. 0. 1.]
y = 1.0
W_hidden =
 [[ 0.2 -0.3]
 [ 0.4  0.1]
 [-0.5  0.2]]
theta = [-0.4  0.2  0.1]
learning rate = 0.001

We begin by defining the input vector x, target output y, hidden layer weights (W_hidden), and output layer parameters (theta).
The learning rate (lr) is also set to control how much the weights are updated during backpropagation.

Explanation:

  • x = [1, 0, 1]: Represents one data sample with three input features.

  • y = 1: The expected (target) output value.

  • W_hidden: Weights connecting the input layer to the two hidden neurons.

  • theta: Parameters for the output layer, including bias and weights for each hidden neuron.

  • lr = 0.001: A small step size for updating the weights during learning.

Forward Pass – Hidden Pre-Activation (z_hidden)#

This step computes the weighted sum of inputs for each hidden neuron before applying the activation function.

# Forward pass - hidden pre-activation
print("Forward pass - hidden pre-activation (z_hidden)")
z_hidden = x.dot(W_hidden)   # shape (2,)
print("z_hidden =", z_hidden)   # expect [-0.3, -0.1]
Forward pass - hidden pre-activation (z_hidden)
z_hidden = [-0.3 -0.1]

Explanation:

  • Formula:
    [ z_{hidden} = x \times W_{hidden} ]

  • Result: z_hidden = [-0.3, -0.1]

  • These values represent the raw activations that determine whether each neuron will activate (fire) after ReLU is applied.

Hidden Activation using ReLU (a_hidden)#

We now apply the ReLU (Rectified Linear Unit) activation function to each hidden neuron output.

# Forward pass - hidden activation (ReLU)
print("Forward pass - hidden activation (a_hidden) using ReLU")
a_hidden = np.maximum(0, z_hidden)
relu_derivative = (z_hidden > 0).astype(float)  # derivative of ReLU wrt z_hidden
print("a_hidden =", a_hidden)
print("ReLU derivative (d a / d z) =", relu_derivative)
Forward pass - hidden activation (a_hidden) using ReLU
a_hidden = [0. 0.]
ReLU derivative (d a / d z) = [0. 0.]

Explanation:

  • Formula:
    [ a_{hidden} = \max(0, z_{hidden}) ]

  • Result: a_hidden = [0, 0]

  • Since both pre-activation values are negative, ReLU outputs 0 for both neurons.

  • The ReLU derivative is [0, 0], meaning no gradient will flow backward through these neurons — they are “inactive.”

Output Pre-Activation (z_out) and Prediction (ŷ)#

We now compute the output neuron’s pre-activation value using the output layer parameters (theta).

# Forward pass - output pre-activation and prediction (identity)
print("Forward pass - output pre-activation (z_out) and prediction y_hat")
bias = theta[0]
w_h = theta[1:]   # [w_h1, w_h2]
z_out = bias + w_h.dot(a_hidden)    # scalar
y_hat = z_out  # identity activation
print("bias =", bias)
print("w_h =", w_h)
print("z_out =", z_out)
print("y_hat =", y_hat)
Forward pass - output pre-activation (z_out) and prediction y_hat
bias = -0.4
w_h = [0.2 0.1]
z_out = -0.4
y_hat = -0.4

Explanation:

  • Formula:
    [ z_{out} = \text{bias} + (a_{hidden} \cdot w_{hidden}) ]

  • With bias = -0.4, w_h = [0.2, 0.1], and a_hidden = [0, 0],
    z_out = -0.4

  • The prediction (ŷ) uses an identity activation, so ŷ = -0.4.

  • The network predicts -0.4, far from the true output y = 1.

Compute Loss (Mean Squared Error)#

We measure how far the prediction is from the true target using the Mean Squared Error (MSE) loss function.

# Compute loss (MSE: 0.5*(y - y_hat)^2)
print("Compute loss (MSE)")
loss = 0.5 * (y - y_hat)**2
print("loss =", loss)
Compute loss (MSE)
loss = 0.9799999999999999

Explanation:

  • Formula:
    [ E = \frac{1}{2}(y - \hat{y})^2 ]

  • Result: E = 0.98

  • A high loss indicates a large prediction error.

Backpropagation – Gradients at Output Layer#

We now compute the gradients of the loss with respect to the output neuron parameters (theta).

# Backpropagation - gradients at output layer
# For identity output, d y_hat / d z_out = 1
print("Backpropagation - gradients at output layer")
dE_dyhat = -(y - y_hat)        # derivative of 0.5*(y-ŷ)^2 wrt ŷ = -(y-ŷ)
dyhat_dzout = 1.0
dE_dzout = dE_dyhat * dyhat_dzout
print("dE/dy_hat =", dE_dyhat)
print("dE/dz_out =", dE_dzout)

# Gradients w.r.t theta parameters: d z_out / d theta = [1, a_h1, a_h2]
dE_dtheta = np.array([dE_dzout * 1.0, dE_dzout * a_hidden[0], dE_dzout * a_hidden[1]])
print("dE/dtheta =", dE_dtheta)
Backpropagation - gradients at output layer
dE/dy_hat = -1.4
dE/dz_out = -1.4
dE/dtheta = [-1.4 -0.  -0. ]

Explanation:

  • The derivative of MSE w.r.t. the output is:
    [ \frac{\partial E}{\partial \hat{y}} = (\hat{y} - y) ]

  • So, dE/dy_hat = -1.4 and dE/dz_out = -1.4.

  • The gradients for output weights are then:
    [ dE/d\theta = [-1.4, 0, 0] ]

  • Only the bias term receives a gradient because the hidden activations were both 0.

Backpropagate to Hidden Layer#

Next, we compute how the error propagates back to the hidden layer weights.

# Backpropagate to hidden layer and compute gradients for W_hidden
print("Backpropagate to hidden layer and compute dE/dW_hidden")

# dE/d a_hidden = dE/d z_out * d z_out / d a_hidden = dE_dzout * w_h
dE_dah = dE_dzout * w_h  # shape (2,)
print("dE/d a_hidden =", dE_dah)

# d a / d z (ReLU derivative) computed earlier
dE_dzh = dE_dah * relu_derivative  # shape (2,)
print("dE/d z_hidden =", dE_dzh)

# d z_hidden_j / d W_hidden_ij = x_i  -> so dE/dW_hidden_ij = x_i * dE/dz_hidden_j
# We'll compute gradient matrix same shape as W_hidden (3,2)
dE_dW_hidden = np.zeros_like(W_hidden)
for i in range(W_hidden.shape[0]):   # inputs
    for j in range(W_hidden.shape[1]):  # hidden units
        dE_dW_hidden[i, j] = x[i] * dE_dzh[j]

print("dE/dW_hidden =\n", dE_dW_hidden)
Backpropagate to hidden layer and compute dE/dW_hidden
dE/d a_hidden = [-0.28 -0.14]
dE/d z_hidden = [-0. -0.]
dE/dW_hidden =
 [[-0. -0.]
 [-0. -0.]
 [-0. -0.]]

Explanation:

  • The gradients passed to hidden activations:
    [ dE/da_{hidden} = dE/dz_{out} \times w_h ]

  • Result: dE/da_hidden = [-0.28, -0.14]

  • Because both ReLU outputs were 0, their derivatives are also 0 →
    [ dE/dz_{hidden} = [0, 0] ]

  • Therefore, no gradient reaches the hidden weights, and dE/dW_hidden is all zeros.

Parameter Updates#

Now we update all weights and biases using the computed gradients and learning rate (lr = 0.001).

# Parameter updates (gradient descent)
print("Parameter updates using learning rate lr =", lr)

# Update theta: theta_new = theta - lr * dE/dtheta
theta_new = theta - lr * dE_dtheta
# Update W_hidden similarly
W_hidden_new = W_hidden - lr * dE_dW_hidden

print("theta_old =", theta)
print("dE/dtheta =", dE_dtheta)
print("theta_new =", theta_new)
print("\nW_hidden_old =\n", W_hidden)
print("dE/dW_hidden =\n", dE_dW_hidden)
print("W_hidden_new =\n", W_hidden_new)
Parameter updates using learning rate lr = 0.001
theta_old = [-0.4  0.2  0.1]
dE/dtheta = [-1.4 -0.  -0. ]
theta_new = [-0.3986  0.2     0.1   ]

W_hidden_old =
 [[ 0.2 -0.3]
 [ 0.4  0.1]
 [-0.5  0.2]]
dE/dW_hidden =
 [[-0. -0.]
 [-0. -0.]
 [-0. -0.]]
W_hidden_new =
 [[ 0.2 -0.3]
 [ 0.4  0.1]
 [-0.5  0.2]]

Explanation:

  • Update rule:
    [ \theta_{new} = \theta_{old} - lr \times dE/d\theta ]

  • Only the bias term changes slightly:
    theta_new = [-0.3986, 0.2, 0.1]

  • Hidden weights remain the same because their gradients were zero.

Summary and Interpretation#

# Summary of gradients and small commentary
print("Summary and interpretation")
print(f"Because both hidden ReLU units were inactive (a_hidden = [0,0]), the gradients flowing")
print("into the hidden weights are zero for inputs where x=0, and proportional to x where x=1.")
print("Numeric results:")
print("dE/dtheta:", dE_dtheta)
print("dE/dW_hidden:\n", dE_dW_hidden)
print("\nNote: Because a_hidden = [0,0], dE/dtheta's second and third components are zero.")
print("Also, the output error is large because prediction (-0.4) is far from target (1).")
print("-- End --\n")
Summary and interpretation
Because both hidden ReLU units were inactive (a_hidden = [0,0]), the gradients flowing
into the hidden weights are zero for inputs where x=0, and proportional to x where x=1.
Numeric results:
dE/dtheta: [-1.4 -0.  -0. ]
dE/dW_hidden:
 [[-0. -0.]
 [-0. -0.]
 [-0. -0.]]

Note: Because a_hidden = [0,0], dE/dtheta's second and third components are zero.
Also, the output error is large because prediction (-0.4) is far from target (1).
-- End --

Final Analysis:

  • Both hidden ReLU units were inactive (a_hidden = [0, 0]), so they did not contribute to learning in this step.

  • As a result, the hidden weights did not update, and only the bias term changed slightly.

  • The prediction error remains high (ŷ = -0.4 vs y = 1), showing that the network did not learn effectively on this input.

Key takeaways:

  • ReLU can “die” (stop learning) if neurons receive only negative activations.

  • Proper initialization and varied inputs are crucial for training success.

  • Forward and backward propagation form the foundation of neural network learning.