Dense Layer

Dense Layer#

How do we actually initialize a layer for a New Neural Network?

  • initialization of weights with small random values

    • why? because according to Andrew Ng’s explanation if all the weights/params are initialized by zero or same value then all the hidden units will be symmetric with identical nodes.

    • With identical nodes there will be no learning/ decision making. because all the decisions shares same value.

    • If all the nodes will have zero values(weights are zero , multiplication with weights will also be zero) and propogation result wont be a conclusive one(dead network).

  • initialization of bias can be zero.

    • as randomness is already introduced by weights. But for smaller Neural Network it is advised to not to initialize with zero.

\begin{align*} X &= \begin{bmatrix} x_1^{(1)} & x_1^{(2)} & \dots & x_1^{(m)}\\ x_2^{(1)} & x_2^{(2)} & \dots & x_2^{(m)}\\ & & \vdots \\ x_n^{(1)} & x_n^{(2)} & \dots & x_n^{(m)}\\ \end{bmatrix}_{n \times m}\\ W &= \begin{bmatrix} w_1^{(1)} & w_1^{(2)} & \dots & w_1^{(m)}\\ w_2^{(1)} & w_2^{(2)} & \dots & w_2^{(m)}\\ & & \vdots \\ w_n^{(1)} & w_n^{(2)} & \dots & w_n^{(m)}\\ \end{bmatrix}_{n \times m}\\ b &= \begin{bmatrix} b_1 & b_2 & \dots & b_n \end{bmatrix}_{1 \times n}\\ Z &= X W^T + b\\ \\ &=\begin{bmatrix} x_1^{(1)} & x_1^{(2)} & \dots & x_1^{(m)}\\ x_2^{(1)} & x_2^{(2)} & \dots & x_2^{(m)}\\ & & \vdots \\ x_n^{(1)} & x_n^{(2)} & \dots & x_n^{(m)}\\ \end{bmatrix}_{n \times m} \begin{bmatrix} w_1^{(1)} & w_2^{(1)} & \dots & w_n^{(1)}\\ w_1^{(2)} & w_2^{(2)} & \dots & w_n^{(2)}\\ & & \vdots \\ w_1^{(m)} & w_2^{(m)} & \dots & w_n^{(m)}\\ \end{bmatrix}_{m \times n}+ \begin{bmatrix} b_1 & b_2 & \dots & b_n \end{bmatrix}_{1 \times n}\\ \\ &= \begin{bmatrix} x_1^{(1)}w_1^{(1)}+ x_1^{(2)}w_1^{(2)} +\dots+x_1^{(m)}w_1^{(m)} & \dots & x_1^{(1)}w_n^{(1)}+ x_1^{(2)}w_n^{(2)} +\dots+x_1^{(m)}w_n^{(m)} \\ x_2^{(1)}w_1^{(1)}+ x_2^{(2)}w_1^{(2)} +\dots+x_2^{(m)}w_1^{(m)} & \dots & x_2^{(1)}w_n^{(1)}+ x_2^{(2)}w_n^{(2)} +\dots+x_2^{(m)}w_n^{(m)} \\ & \vdots \\ x_n^{(1)}w_1^{(1)}+ x_n^{(2)}w_1^{(2)} +\dots+x_n^{(m)}w_1^{(m)} & \dots & x_n^{(1)}w_n^{(1)}+ x_n^{(2)}w_n^{(2)} +\dots+x_n^{(m)}w_n^{(m)} \end{bmatrix}_{n \times n} + \begin{bmatrix} b_1 & b_2 & \dots & b_n\\ b_1 & b_2 & \dots & b_n\\ & & \vdots\\ b_1 & b_2 & \dots & b_n\\ \end{bmatrix}_{n \times n \text{ broadcasting}}\\ \\ &= \begin{bmatrix} x_1^{(1)}w_1^{(1)}+ x_1^{(2)}w_1^{(2)} +\dots+x_1^{(m)}w_1^{(m)} + b_1 & \dots & x_1^{(1)}w_n^{(1)}+ x_1^{(2)}w_n^{(2)} +\dots+x_1^{(m)}w_n^{(m)}+ b_n \\ x_2^{(1)}w_1^{(1)}+ x_2^{(2)}w_1^{(2)} +\dots+x_2^{(m)}w_1^{(m)} + b_1 & \dots & x_2^{(1)}w_n^{(1)}+ x_2^{(2)}w_n^{(2)} +\dots+x_2^{(m)}w_n^{(m)}+ b_n \\ & \vdots \\ x_n^{(1)}w_1^{(1)}+ x_n^{(2)}w_1^{(2)} +\dots+x_n^{(m)}w_1^{(m)} + b_1 & \dots & x_n^{(1)}w_n^{(1)}+ x_n^{(2)}w_n^{(2)} +\dots+x_n^{(m)}w_n^{(m)} + b_n \end{bmatrix}_{n \times n} \end{align*}

Forward#

\begin{align*} Z^{[1]} &= A^{[0]} W^{[1]T} + b^{[1]}\\ A^{[1]} &= g^{[1]}(Z^{[1]})\\ \\ Z^{[2]} &= A^{[1]} W^{[2]T} + b^{[2]}\\ A^{[2]} &= g^{[2]}(Z^{[2]})\\ \end{align*}

Generalized \begin{align*} Z^{[l]} &= A^{[l-1]} W^{[l]T} + b^{[l]}\\ A^{[l]} &= g^{[l]}(Z^{[l]}) \end{align*}

[1]:
from abc import ABC,abstractmethod
import numpy as np
import matplotlib.pyplot as plt

lets take two layers

  • lets take layer 1 as input layer. This means input is x or \(a^{[0]}\)

    • lets take 3 columns = number of nodes = \(n^{[0]} = 3\)

    • and take 10 samples = m = 10

    • shape of \(a^{[0]} = (n^{[0]},m)\) (3, 10)

    • shape of \(w^{[1]} = (n^{[0]},m) = dw^{[1]}\) (3, 10)

    • shape of \(b^{[1]} = (1, n^{[0]}) = db^{[1]}\) (1, 3)

    • shape of \(z^{[1]} = (n^{[0]},m) (m, n^{[0]}) + (1, n^{[0]}) = (n^{[0]}, n^{[0]}) = dz^{[1]}\) (3, 10) (10, 3)+ (1, 3) = (3, 3)

    • shape of \(z^{[1]}\) = shape of \(a^{[1]} = (n^{[0]}, n^{[0]})\) (3, 3)

  • lets take layer 2 the next layer to that. The first one in hidden layer. Input to this layer is \(a^{[1]}\)

    • lets take number of nodes in the layer = 5 = \(n^{[1]} = 5\)

    • shape of \(w^{[2]} = (n^{[1]},n^{[0]}) = dw^{[2]}\) (5 ,3)

    • shape of \(b^{[2]} = (1, n^{[1]}) = db^{[2]}\) (1, 5)

    • shape of \(z^{[2]} = (n^{[0]}, n^{[0]}) ( n^{[0]}, n^{[1]}) + (1, n^{[1]}) = (n^{[0]},n^{[1]}) = dz^{[2]}\) (3, 3) (3, 5) + (1, 5) = (3, 5)

[2]:
n0 = 3
n1 = 5
m = 10
[3]:
a0 = np.random.random((n0, m))
w1 = np.random.random((n0, m))
b1 = np.random.random((1, n0))
print(w1.shape, a0.shape,'+', b1.shape)
(3, 10) (3, 10) + (1, 3)
[4]:
z1 = (a0 @ w1.T) + b1
z1.shape
[4]:
(3, 3)
[5]:
a1 = 1/(1 + np.exp(-z1))

a1.shape
[5]:
(3, 3)
[6]:
w2 = np.random.random((n1, n0))
b2 = np.random.random((1, n1))
print(w2.shape, a1.shape,'+', b2.shape)
(5, 3) (3, 3) + (1, 5)
[7]:
z2 = (a1 @ w2.T) + b2
z2.shape
[7]:
(3, 5)
[8]:
a2 = 1/(1 + np.exp(-z2))
a2.shape
[8]:
(3, 5)

Backward#

\begin{align*} & \text{param for this layer (this function starts working from here)}\\ dW &= dZ' .A^T\\ dB &= \sum(dZ')\\ \\ & \text{input for next layer (in backward propogation)}\\ dZ &= dZ' .W^T \end{align*}

[9]:
dz2 = np.random.random((n0,n1))
dz2.shape
[9]:
(3, 5)
[10]:
dw2 = dz2 @ a2.T
dw2.shape
[10]:
(3, 3)
[11]:
db2 = dz2.sum(axis=0,keepdims=True)
db2.shape
[11]:
(1, 5)
[12]:
dz1 = dz2 @ w2
dz1.shape
[12]:
(3, 3)
[13]:
dw1 = dz1 @ a1.T
dw1.shape
[13]:
(3, 3)
[14]:
db1 = dz1.sum(axis=0,keepdims=True)
db1.shape
[14]:
(1, 3)
[15]:
dz1 @ w1
[15]:
array([[1.00430696, 1.12665459, 1.27528356, 0.37028909, 1.83008842,
        0.86290497, 1.23745471, 1.23044548, 0.83923269, 1.65279249],
       [0.77372465, 0.99710549, 1.00752794, 0.2431575 , 1.27532378,
        0.7486004 , 1.11498651, 0.84140139, 0.61524338, 1.5975826 ],
       [1.57524514, 1.82300126, 2.04126706, 0.56541619, 2.87108659,
        1.40560442, 2.03900305, 1.88715055, 1.31532754, 2.74793296]])

Model#

[16]:
class LayerDense:
    """Layer Module

    It is recommended that input data X is scaled(data scaling operations)
    so that data is normalized but meaning of the data remains same.

    Args:
        n_inputs (int) : number of inputs
        n_neurons (int) : number of neurons
    """
    def __init__(self,n_inputs,n_neurons):
        """
        """
        self.w = 0.10 * np.random.randn(n_inputs,n_neurons) # multiply by 0.1 to make it small
        self.b = np.zeros((1,n_neurons))

    def forward(self, a):
        """forward propogation calculation
        """
        self.a = a
        self.z = np.dot(self.a,self.w)+self.b

    def backward(self, dz):
        """backward pass
        """
        # gradient on parameters
        self.dw = dz @ self.a.T
        self.db = dz.sum(axis=0,keepdims=True)

        # gradient on values / input to next layer in backpropogation
        self.dz = dz @ self.w