After logistic regression, it is time to move on the Neural Network Algorithm since we also need to use gradient descent here. So I will first introduce the basic logic first, and then calculate the mathematics, and finally the implementation.

# Basic Logic

The basic logic behind this algorithm is layers. The first layer, namely the input, will calculate its outputs to the next layer based on its own weights as the second layer’s inputs, and the second layer will also calculate its outputs to the third layer based on its own weights as the third layer’s inputs, so on so forth. The first layer is the input layer, the last layer is the output layer, and the middle layers are all hidden layers.

Inside each layer, there are many nodes. Each node is associated with some nodes in the previous layer except the first layer. Also, each node is associated with some weights that directs them to the next layer.

Please see the picture below.

So how do we update the weights? The process is, we first confirm the nodes we need and the initial weights, and then calculate the nodes layer by layer, and at the last layer we have both the predicted values and the true values, then use the values to calculate the errors (RSS) layer by layer in a back order. After that we have the RSS values for each layer, we can update the weights in each layer based on the gradient descent algorithm.

So how to achieve this?

# Theory Derivation

Let me first introduce some symbols I will use in this part. $w_{ij}^l$ is the weight, where i points to the next layer node and j points to the current layer node, l is the pointer of the layer number. $a_k^l$ is the predicted value for the kth node in the lth layer. $z_k^l$ is the value for $w_kj^{l-1} * a_j^{l-1}$, usually substitute in the sigmoid function.

So at the last layer, we have all $y$ and $a$ values in each node, and define RSS as:

$$ RSS = \frac{1}{2}\sum_{i=1}^n(y_i - a_i)^2 $$

Since the gradient descent formula for the $l-1$ layer is:

$$ w^{l-1}_{kj} := w^{l-1}_{kj} - \frac{\partial RSS}{\partial w_{kj}^{l-1}} $$

So we need to calculate $\frac{\partial RSS}{\partial w_{kj}^{l-1}}$.

By the chain principle,

$$

\begin{align}

\frac{\partial RSS}{\partial w_{kj}^{l-1}} &= \frac{\partial RSS}{\partial a^l_k}\frac{\partial a^l_k}{\partial z^l_k}\frac{\partial z^l_k}{\partial w^{l-1}_{kj}}

\end{align}

$$

since

$$

\frac{\partial RSS}{\partial a^l_k} = -(y_k - a_k^l)

$$

$$

\frac{\partial a^l_k}{\partial z^l_k} = \frac{\partial sigmoid(z^l_k)}{\partial z^l_k} = (1 - a_k^l)a_k^l

$$

$$

\frac{\partial z^l_k}{\partial w^{l-1}_{kj}} = a_j^{l-1}

$$

so

$$

\frac{\partial RSS}{\partial w_{kj}^{l-1}} = -(y_k - a_k^l)(1 - a_k^l)a_k^la_j^{l-1}

$$

Since the first three elements belong to the l layer, then we can substitute them as $\delta^l_k$:

$$

\delta^l_k := -(y_k - a_k^l)(1 - a_k^l)a_k^l

$$

Then

$$

\frac{\partial RSS}{\partial w_{kj}^{l-1}} = \delta^l_ka_j^{l-1}

$$

Now it is done for the $l-1$ layer. Now comes to the $l-2$ layer.

$$

\frac{\partial RSS}{\partial w_{ji}^{l-2}} = \frac{\partial RSS}{\partial a^{l-1}_j}\frac{\partial a^{l-1}_j}{\partial z^{l-1}_j}\frac{\partial z^{l-1}_j}{\partial w^{l-2}_{ji}}

$$

We know the value of the last two, now calculates the first one:

$$

\begin{align}

\frac{\partial RSS}{\partial a^{l-1}_j} &= \frac{\partial RSS}{\partial a^l_s}\frac{\partial a^l_s}{\partial z^l_s}\frac{\partial z^l_s}{\partial a^{l-1}_j}\\

&= -\sum_s((y_s - a_s^l)(1-a_s^l)a_s^lw^{l-1}_{sj})

\end{align}

$$

Then we have:

$$

\begin{align}

\frac{\partial RSS}{\partial w_{ji}^{l-2}} &= -\sum_s((y_s - a_s^l)(1-a_s^l)a_s^lw^{l-1}_{sj})*(1-a_j^{l-1})a_j^{l-1}a_i^{l-2}\\

&= \delta_j^{l-1}a_i^{l-2}

\end{align}

$$

As before, define the following as $\delta_j^{l-1}$

$$

\begin{align}

\delta_j^{l-1} &:= -\sum_s((y_s - a_s^l)(1-a_s^l)a_s^lw^{l-1}_{sj}) * (1-a_j^{l-1})a_j^{l-1}\\

&= \sum_s(\delta_s^lw^{l-1}_{sj}) * (1-a_j^{l-1})a_j^{l-1}

\end{align}

$$

So that is the formula for the $l-1$ layer. Note we can now write the generic formula:

$$

\frac{\partial RSS}{\partial w_{ji}^{l^{‘}}} = \delta_j^{l{‘}+1} * a_i^{l{‘}}

$$

where

$$

\delta_j^{l{‘}+1} = \sum_s(\delta_s^{l^{‘}+2}w^{l^{‘}+1}_{sj})*(1-a_j^{l^{‘}+1})a_j^{l^{‘}+1}

$$

# Implementation

First I claim that I need to have a matrix $n*l$ for $\delta$ and $a$, and a $n*n*l$ dimension array for $w$.