From Logistic Regression with a Neural Network mindset, we achieved the Neural Network which use Logistic Regression to resolve the linear classification . In this blog ,we will achieve a Neural Network with one hidden layer to resolve the nolinear classification as :
Which I will Code
Implement a 2class classification neural network with a single hidden layer
Use units with a nonlinear activation function, such as tanh
Compute the cross entropy loss
Implement forward and backward propagation
Defining the neural network structure
layer_size()
This function will define three variables:
 n_x: the size of the input layer
n_h: the size of the hidden layer (set this to 4)
n_y: the size of the output layer
1 

Initialize the model’s parameters
initialize_parameters()
 To make sure our parameters’ sizes are right. Refer to the neural network figure above if needed.
I will initialize the weights matrices with random values.
 Use:
np.random.randn(a,b) * 0.01
to randomly initialize a matrix of shape (a,b).
 Use:
I will initialize the bias vectors as zeros.
 Use:
np.zeros((a,b))
to initialize a matrix of shape (a,b) with zeros.
 Use:
1  def initialize_parameters(n_x, n_h, n_y): 
The Loop
forward_propagation()
Step
 Retrieve each parameter from the dictionary “parameters” (which is the output of
initialize_parameters()
) by usingparameters[".."]
. Implement Forward Propagation. Compute $Z^{[1]}, A^{[1]}, Z^{[2]}$ and $A^{[2]}$ (the vector of all our predictions on all the examples in the training set).
 Values needed in the backpropagation are stored in “
cache
“. Thecache
will be given as an input to the backpropagation function.
Code
1 

compute_cost()
Now that I have computed A^{[2]} (in the Python variable “A2
“), which contains $a^{2}$ for every example, I can compute the cost function as follows:
1 

backward propagation()
Backpropagation is usually the hardest (most mathematical) part in deep learning. Here is the slide from the lecture on backpropagation. I’ll want to use the six equations on the right of this slide, since I are building a vectorized implementation.
$\frac{\partial \mathcal{J} }{ \partial z_{2}^{(i)} } = \frac{1}{m} (a^{2}  y^{(i)})$
$\frac{\partial \mathcal{J} }{ \partial W2 } = \frac{\partial \mathcal{J} }{ \partial z{2}^{(i)} } a^{[1] (i) T}$
$\frac{\partial \mathcal{J} }{ \partial b2 } = \sum_i{\frac{\partial \mathcal{J} }{ \partial z{2}^{(i)}}}$
$\frac{\partial \mathcal{J} }{ \partial z{1}^{(i)} } = W_2^T \frac{\partial \mathcal{J} }{ \partial z{2}^{(i)} } * ( 1  a^{[1] (i) 2}) $
$\frac{\partial \mathcal{J} }{ \partial W1 } = \frac{\partial \mathcal{J} }{ \partial z{1}^{(i)} } X^T $
$\frac{\partial \mathcal{J} i }{ \partial b_1 } = \sum_i{\frac{\partial \mathcal{J} }{ \partial z{1}^{(i)}}}$
 Note that $*$ denotes elementwise multiplication.
 The notation I will use is common in deep learning coding:
 dW1 = $\frac{\partial \mathcal{J} }{ \partial W_1 }$
 db1 = $\frac{\partial \mathcal{J} }{ \partial b_1 }$
 dW2 = $\frac{\partial \mathcal{J} }{ \partial W_2 }$
 db2 = $\frac{\partial \mathcal{J} }{ \partial b_2 }$
 Tips:
 To compute dZ1 we need to compute $g^{[1]’}(Z^{[1]})$. Since $g^{[1]}(.)$ is the tanh activation function, if $a = g^{[1]}(z)$ then $g^{[1]’}(z) = 1a^2$. So we can compute
$g^{[1]’}(Z^{[1]})$ using(1  np.power(A1, 2))
.
 To compute dZ1 we need to compute $g^{[1]’}(Z^{[1]})$. Since $g^{[1]}(.)$ is the tanh activation function, if $a = g^{[1]}(z)$ then $g^{[1]’}(z) = 1a^2$. So we can compute
1 

General gradient descent rule: $ \theta = \theta  \alpha \frac{\partial J }{ \partial \theta }$ where $\alpha$ is the learning rate and $\theta$ represents a parameter.
Illustration: The gradient descent algorithm with a good learning rate (converging) and a bad learning rate (diverging). Images courtesy of Adam Harley.
if the learning rate is fit, the training gradient will descent as the left Gif, While if we use a too bad learning rate ,the gradient will descent like the right Gif.
1 

Integrate above base function in nn_model()
1 

Predictions
Now I will use our model to predict by building predict().
Use forward propagation to predict results.
Reminder: predictions = $y_{prediction} = \mathbb 1 \textfalse = \begin{cases}
1 & \text{if}\ activation > 0.5 \
0 & \text{otherwise}
\end{cases}$
As an example, if we would like to set the entries of a matrix X to 0 and 1 based on a threshold we would do:
1 

It is time to run the model and see how it performs on a planar dataset:
1  parameters = nn_model(X, Y, n_h = 4, num_iterations = 10000, print_cost=True) 
Output;
Cost after iteration 9000  0.218607 
Print accuracy
1  predictions = predict(parameters, X) 
Output:
Accuracy  90% 
Tuning hidden layer size (optional/ungraded exercise)
1 

Output:
Interpretation
 The larger models (with more hidden units) are able to fit the training set better, until eventually the largest models overfit the data.
 The best hidden layer size seems to be around n_h = 5. Indeed, a value around here seems to fits the data well without also incurring noticable overfitting.
 You will also learn later about regularization, which lets you use very large models (such as n_h = 50) without much overfitting.