Batch Normalization (Batch Normalization)
Batch Normalization (BatchNorm) method (Batch Normalization, BatchNorm) was proposed by Ioffe and Szegedy in 2015, and has been widely used in deep learning, and its purpose is to standardize the output of the intermediate layer of the neural network, so that the output of the intermediate layer is more stable.
Usually we standardize the data of neural networks, and the processed sample dataset satisfies the statistical distribution of mean 0 and variance 1. This is because when the distribution of the input data is relatively fixed, it is conducive to the stability and convergence of the algorithm. For deep neural networks, because the parameters are constantly updated, even if the input data has been standardized, but for those layers that are more backward, the inputs received are still drastically changing, which usually leads to unstable values, and it is difficult for the model to converge.BatchNorm can make the outputs of the intermediate layers of the neural network become more stable, and has the following three advantages:
Enables learning to occur quickly (ability to use larger learning rates)
Reduces the sensitivity of the model to initial values
Suppresses overfitting to some extent
The main idea of BatchNorm is to normalize the values of neurons in mini-batches during training so that the distribution of the data meets the mean of 0 and the variance of 1. The specific calculation process (a bit like calculating the standard normal distribution) is as follows:
1. calculate the mean of the samples within the mini-batch
2. Calculate the variance of the samples in the mini-batch.
3. calculate the standardized output
It is up to the reader to verify that the mini-batch consisting of x ^ ( 1 ) , x ^ ( 2 ) , x ^ ( 3 ) satisfies a distribution with mean 0 and variance 1.
Forcibly restricting the distribution of the output layer to be normalized may result in the loss of certain feature patterns, so immediately after normalization, BatchNorm will do scaling and panning on the data.
Listed above is the computational logic of the BatchNorm method, and the following are examples for each of the two types of input data formats. Flying Paddle supports input data with dimensions of 2, 3, 4, 5, here is an example with dimensions of 2 and 4.
Example 1: When the input data shape is [ N , K ], it generally corresponds to the output of the fully connected layer, and the sample code is shown below.
In this case, the mean and variance of N samples will be computed separately for each component of K. The data and parameters correspond as follows:
Input x, [N, K]
Output y, [N, K]
Mean μ B, [K,]
Variance σ B ², [K, ]
Scaling parameter γ, [K, ]
Translation parameter β, [K, ]
In [ ]
# Example when the input data shape is [N, K].
import numpy as np
import paddle
from paddle.nn import BatchNorm1D
# Create the data
data = np.array([[1,2,3], [4,5,6], [7,8,9]]).astype('float32')
# Calculate the normalized output using BatchNorm1D
# Input data dimension [N, K] with num_features equal to K
bn = BatchNorm1D(num_features=3)
x = paddle.to_tensor(data)
y = bn(x)
print('output of BatchNorm1D Layer: \n {}'.format(y.numpy()))
# Calculate mean, variance and normalized output using Numpy
# Validate the 0th feature here
a = np.array([1,4,7])
a_mean = a.mean()
a_std = a.std()
b = (a - a_mean) / a_std
print('std {}, mean {}, \n output {}'.format(a_mean, a_std, b))
# Suggest the reader to verify the 1st and 2nd features, and see if the numpy calculations are consistent with the paddle calculations
Example 2: When the shape of the input data is [ N , C , H , W ], it generally corresponds to the output of the convolutional layer, the sample code is shown below
In this case, it will be expanded along the dimension of C. The mean and variance of the total N × H × W pixels in N samples will be calculated for each channel separately, and the data and parameters correspond to the following:
Input x, [N, C, H, W]
Output y, [N, C, H, W]
Mean μ B, [C, ]
Variance σ B², [C, ]
Scaling parameter γ, [C, ]
Translation parameter β, [C, ]
Tip:
Some readers may ask: “Doesn't BatchNorm still need to do affine transformation on the result after normalization, and how to use Numpy to compute the result consistent with the BatchNorm operator?” This is because the BatchNorm operator automatically set the initial value of γ = 1, β = 0, this time the affine transformation is equivalent to a constant transformation. During the training process these two parameters will be continuously learned, and then the affine transformation will work.
# Example of batchnorm when input data shape is [N, C, H, W].
import numpy as np
import paddle
from paddle.nn import BatchNorm2D
# Set the random number seed so that the results are consistent from run to run.
np.random.seed(100)
# Create the data
data = np.random.rand(2,3,3,3).astype('float32')
# Calculate the normalized output using BatchNorm2D
# Input data dimensions [N, C, H, W] with num_features equal to C
bn = BatchNorm2D(num_features=3)
x = paddle.to_tensor(data)
y = bn(x)
print('input of BatchNorm2D Layer: \n {}'.format(x.numpy()))
print('output of BatchNorm2D Layer: \n {}'.format(y.numpy())))
# Fetch the data for channel 0 in data.
# Calculate mean, variance and normalized output using numpy
a = data[:, 0, :, :]
a_mean = a.mean()
a_std = a.std()
b = (a - a_mean) / a_std
print('channel 0 of input data: \n {}'.format(a))
print('std {}, mean {}, \n output: \n {}'.format(a_mean, a_std, b))
# Hint: Here's the output computed via numpy
# differs slightly from the results of the BatchNorm2D operator.
# because in the BatchNorm2D operator to ensure numerical stability.
# because in the BatchNorm2D operator to ensure numerical stability, a small floating point number epsilon=1e-05 is added to the denominator.
- Using BatchNorm for Prediction
The above describes the use of BatchNorm to normalize a batch of samples during training, but if the same method is used to normalize a batch of samples that need to be predicted, there will be uncertainty in the prediction results.
For example, sample A, sample B as a batch of samples to calculate the mean and variance, and sample A, sample C and sample D as a batch of samples to calculate the mean and variance, the results obtained are generally different. Then the prediction result of sample A becomes uncertain, which is not reasonable for the prediction process.
The solution is to save the mean and variance of a large number of samples during the training process, and use the saved values directly when predicting without recalculating.
In fact, in the specific implementation of BatchNorm, a moving average of the mean and variance is calculated during training. In Flying Paddle, the default is to compute it in the following way:
The variants of BatchNorm include Layer Normalization (LN), Group Normalization (GN), and Instance Normalization (IN), which are compared by the following figure.
Where N knows the batch size, H and W denote the height and width of the feature map, respectively, C denotes the number of channels in the feature map, and blue pixels denote normalization using the same mean and variance:
Figure 14: Normalization method
LN: Normalize the [C,W,H] dimensions for mean-variance, i.e., normalize in the channel direction, independent of the batch size, and the effect may be better on a small batch size.
GN: first group the channel directions, then normalize the [ C i C_{i} ,W,H] dimensions within each group, also independent of the batch size
IN: normalize only the [H,W] dimension, image stylization task is suitable to use IN algorithm
Figure 14 from Yuxin Wu, Kaiming He, Group Normalization
Dropout
Dropout is a commonly used method for suppressing overfitting in deep learning, which is done by randomly deleting a portion of neurons during the learning process of the neural network. During training, a portion of neurons are randomly selected and their outputs are set to 0. These neurons will not transmit signals externally.
Figure 15 is a schematic diagram of Dropout, with the complete neural network on the left and the network structure after applying Dropout on the right. After applying Dropout, neurons labeled with x are removed from the network so that they do not transmit signals to later layers. During the learning process, which neurons are dropped is decided randomly, so the model does not overly rely on certain neurons, and overfitting can be suppressed to some extent.
Figure 15 Schematic of Dropout
When predicting a scene, the signals of all neurons are passed forward, which may lead to a new problem: the total size of the output data will become smaller due to some neurons being randomly dropped during training. For example: calculating its L1 paradigm will become smaller than when Dropout is not used, but no neurons are discarded during prediction, which will lead to a different distribution of data during training and prediction. To solve this problem, Flying Paddle supports the following two methods:
downscale_in_infer
Discard a random fraction of neurons at scale r during training and do not pass their signals backward; pass the signals of all neurons backward during prediction, but multiply the values on each neuron by (1- r).
upscale_in_train
Randomly drops a subset of neurons at a scale p during training, does not pass their signals backwards, but divides the values on those retained neurons by (1-p); passes the signals of all neurons backwards during prediction without any processing.
In the Paddle Dropout API, the mode parameter is used to specify which way to manipulate the neurons with the
paddle.nn.Dropout(p=0.5, axis=None, mode=“upscale_in_train”, name=None)
The main parameters are as follows:
p (float) : the probability of setting the input node to 0, i.e. the dropout probability, default value: 0.5. The dropout probability of this parameter for an element is for each element, not for all elements. For example, assuming there are 12 numbers in the matrix, a dropout with a probability of 0.5 may not necessarily result in 6 zeros.
mode(str) : The implementation of the dropout method, there are 'downscale_in_infer' and 'upscale_in_train', the default is 'upscale_in_train'.
Description:
The default handling of Dropout may be different for different frameworks, readers can check the API for details.
The following program shows the form of output data after Dropout.
From the output of the above code, we can find that after the dropout, some elements in the tensor become 0. This is the function of the dropout, by randomly setting the elements of the input data to 0, eliminating the weakening of the joint adaptation between the neuron nodes, and enhancing the generalization ability of the model.
There are many variants of Dropout, such as DropConnect, Standout, Gaussian Dropout, Spatial Dropout, Cutout, Max-Drop, RNNDrop, Cyclic Drop, etc., and several of these algorithms are briefly described below:
DropConnect is proposed by L. Wan et al. Instead of applying dropout directly on neurons, it is applied on the weights and biases of connected neurons, and can only be applied to the fully connected layer
Standout is proposed by L. J. Ba and B. Frey, the probability p p of dropping a portion of a neuron in layer i i is not constant, it is adaptive depending on the value of the weights, the higher the weight the higher the probability of being dropped
Spatial Dropout was proposed by J. Tompson et al. to consider the high correlation between neighboring pixels, instead of dropping individual pixels, the dropout is performed on the feature map
Cutout was proposed by T. DeVries and G. W. Taylor to prevent overfitting by generalizing the hidden regions of the image, and to improve the robustness (the ability of a system or device to maintain its functionality or performance under different environments or conditions ******) ****** and overall performance of the neural network
Summary
After learning these concepts, you have the foundation for building a convolutional neural network. In the next section, we will apply these building blocks and together we will model a typical application in image classification - eye screening task in medical images.
Assignment
1 Calculate the total number of multiplication and addition operations in the convolution
The input data shape is [10,3,224,224], the convolution kernel k_h = k_w = 3, the number of output channels is 6464, the step stride = 1, and the padding p_h = p_w = 1.
How many multiplication and addition operations are required in total to complete such a convolution?
Hints.
Look at how many multiplication and addition operations need to be done to output a pixel point before calculating the total number of operations needed.
Submission method
Please reply with the number of multiplication and addition operations, e.g. multiplication 1000, addition 1000.
2 Calculate the shape of the output data and parameters of the network layer
The network structure is defined as shown in the code below, and the input data shape is [10,3,224,224].
Please calculate the shape of the output data for each layer and the shape of the parameters contained in each layer separately
In [ ]
# Define the SimpleNet network structure
import paddle
from paddle.nn import Conv2D, MaxPool2D, Linear
import paddle.nn.functional as F
class SimpleNet(paddle.nn.Layer).
def __init__(self, num_classes=1).
#super(SimpleNet, self). __init__(name_scope)
self.conv1 = Conv2D(in_channels=3, out_channels=6, kernel_size=5, stride=1, padding=2)
self.max_pool1 = MaxPool2D(kernel_size=2, tride=2)
self.conv2 = Conv2D(in_channels=6, out_channels=16, kernel_size=5, stride=1, padding=2)
self.max_pool2 = MaxPool2D(kernel_size=2, tride=2)
self.fc1 = Linear(in_features=50176, out_features=64)
self.fc2 = Linear(in_features=64, out_features=num_classes)
def forward(self, x).
x = self.conv1(x)
x = F.relu(x)
x = self.max_pool1(x)
x = self.conv2(x)
x = F.relu(x)
x = self.max_pool2(x)
x = paddle.reshape(x, [x.shape[0], -1])
x = self.fc1(x)
x = F.sigmoid(x)
x = self.fc2(x)
return x
Hint, the first layer of convolution c o n v 1 conv1 , with the following parameters:
The shape of the output feature map is [N,Cout,Hout,Wout] = [10,6,224,224]
Please complete the table below:
Name
wShape
Number of w parameters
b shape
Number of b parameters
Output Shape
conv1
[6,3,5,5]
450
[6]
6
[10, 6, 224, 224]
pool1
None
None
None
None
[10, 6, 112, 112]
conv2
pool2
fc1
fc2
To submit: post a screenshot of the table to the discussion board
Rely on the formulas below the chapter “Filling”: Hout=H+ph1+ph2-kh+1, Wout=W+pw1+pw2-kw+1
pool and fc will not do
In this section, we introduce a basic computer vision task development process using the ResNet introduced in the previous section as an example, covering the following topics:
Basic computer vision task development process: introduces the basic computer vision task development process.
Eye disease dataset: introduce the structure of the dataset and data preprocessing methods.
ResNet Network: How to apply the eye disease dataset for model training and testing.