From Logistic Regression with a Neural Network mindset, we achieved the Neural Network which use Logistic Regression to resolve the linear classification . In this blog ,we will achieve a Neural Network with one hidden layer to resolve the no-linear classification as :

## Which I will Code

• Implement a 2-class classification neural network with a single hidden layer

• Use units with a non-linear activation function, such as tanh

• Compute the cross entropy loss

• Implement forward and backward propagation

## Defining the neural network structure

### layer_size()

This function will define three variables:

• n_x: the size of the input layer
• n_h: the size of the hidden layer (set this to 4)

• n_y: the size of the output layer

## Initialize the model’s parameters

### initialize_parameters()

• To make sure our parameters’ sizes are right. Refer to the neural network figure above if needed.
• I will initialize the weights matrices with random values.

• Use: np.random.randn(a,b) * 0.01 to randomly initialize a matrix of shape (a,b).
• I will initialize the bias vectors as zeros.

• Use: np.zeros((a,b)) to initialize a matrix of shape (a,b) with zeros.

## The Loop

### forward_propagation()

#### Step

1. Retrieve each parameter from the dictionary “parameters” (which is the output of initialize_parameters()) by using parameters[".."].
1. Implement Forward Propagation. Compute $Z^{[1]}, A^{[1]}, Z^{[2]}$ and $A^{[2]}$ (the vector of all our predictions on all the examples in the training set).
1. Values needed in the backpropagation are stored in “cache“. The cache will be given as an input to the backpropagation function.

### compute_cost()

Now that I have computed A^{[2]} (in the Python variable “A2“), which contains $a^{2}$ for every example, I can compute the cost function as follows:

### backward propagation()

Backpropagation is usually the hardest (most mathematical) part in deep learning. Here is the slide from the lecture on backpropagation. I’ll want to use the six equations on the right of this slide, since I are building a vectorized implementation.

$\frac{\partial \mathcal{J} }{ \partial z_{2}^{(i)} } = \frac{1}{m} (a^{2} - y^{(i)})$

$\frac{\partial \mathcal{J} }{ \partial W2 } = \frac{\partial \mathcal{J} }{ \partial z{2}^{(i)} } a^{[1] (i) T}$

$\frac{\partial \mathcal{J} }{ \partial b2 } = \sum_i{\frac{\partial \mathcal{J} }{ \partial z{2}^{(i)}}}$

$\frac{\partial \mathcal{J} }{ \partial z{1}^{(i)} } = W_2^T \frac{\partial \mathcal{J} }{ \partial z{2}^{(i)} } * ( 1 - a^{[1] (i) 2})$

$\frac{\partial \mathcal{J} }{ \partial W1 } = \frac{\partial \mathcal{J} }{ \partial z{1}^{(i)} } X^T$

$\frac{\partial \mathcal{J} i }{ \partial b_1 } = \sum_i{\frac{\partial \mathcal{J} }{ \partial z{1}^{(i)}}}$

• Note that $*$ denotes elementwise multiplication.
• The notation I will use is common in deep learning coding:
• dW1 = $\frac{\partial \mathcal{J} }{ \partial W_1 }$
• db1 = $\frac{\partial \mathcal{J} }{ \partial b_1 }$
• dW2 = $\frac{\partial \mathcal{J} }{ \partial W_2 }$
• db2 = $\frac{\partial \mathcal{J} }{ \partial b_2 }$
• Tips:
• To compute dZ1 we need to compute $g^{[1]’}(Z^{[1]})$. Since $g^{[1]}(.)$ is the tanh activation function, if $a = g^{[1]}(z)$ then $g^{[1]’}(z) = 1-a^2$. So we can compute
$g^{[1]’}(Z^{[1]})$ using (1 - np.power(A1, 2)).

General gradient descent rule: $\theta = \theta - \alpha \frac{\partial J }{ \partial \theta }$ where $\alpha$ is the learning rate and $\theta$ represents a parameter.

Illustration: The gradient descent algorithm with a good learning rate (converging) and a bad learning rate (diverging). Images courtesy of Adam Harley.

if the learning rate is fit, the training gradient will descent as the left Gif, While if we use a too bad learning rate ,the gradient will descent like the right Gif.

## Predictions

Now I will use our model to predict by building predict().
Use forward propagation to predict results.

Reminder: predictions = $y_{prediction} = \mathbb 1 \textfalse = \begin{cases} ​ 1 & \text{if}\ activation > 0.5 \ ​ 0 & \text{otherwise} ​ \end{cases}$

As an example, if we would like to set the entries of a matrix X to 0 and 1 based on a threshold we would do:

It is time to run the model and see how it performs on a planar dataset:

Output;

 Cost after iteration 9000 0.218607

Output:

 Accuracy 90%

## Tuning hidden layer size (optional/ungraded exercise)

Output:

### Interpretation

• The larger models (with more hidden units) are able to fit the training set better, until eventually the largest models overfit the data.
• The best hidden layer size seems to be around n_h = 5. Indeed, a value around here seems to fits the data well without also incurring noticable overfitting.
• You will also learn later about regularization, which lets you use very large models (such as n_h = 50) without much overfitting.

github