2.8 Training Debugging and Optimization for Handwritten Digit Recognition
In Section 2.7, we investigated resource deployment optimization methods to improve the efficiency of model training by using a single GPU and distributed deployment. In this section, we continue to develop a horizontal and vertical approach, as shown in Figure 1, to explore some debugging and optimization methods for the model training part of the handwritten digit recognition task, in order to ensure the real results of the model.
Figure 1: “Horizontal and vertical” teaching method - training process
The optimization of the training process has the following five key aspects:
(1) Calculate the classification accuracy and observe the training effect of the model.
The cross-entropy loss function can only be used as an optimization objective, which cannot directly and accurately measure the training effect of the model. The accuracy rate can directly measure the training effect, but due to its discrete nature, it is not suitable for optimizing neural networks as a loss function.
(2) Check the model training process to identify potential problems
If the model's loss or evaluation metrics behave abnormally, it is usually necessary to print the inputs and outputs of each layer of the model to locate the problem, and analyze the contents of each layer to obtain the cause of the error.
(3) Incorporate calibration or testing to better evaluate the model results
The ideal model training result is to have high accuracy on both the training set and the validation set. If the accuracy of the training set is lower than that of the validation set, it indicates that the network has not been trained enough; if the accuracy of the training set is higher than that of the validation set, the overfitting phenomenon may have occurred. The overfitting problem is solved by adding a regularization term in the optimization objective.
(4) Add regularization terms to avoid model overfitting
The flying propeller framework supports the inclusion of regularization terms for the overall parameters, which is a common practice. In addition, the paddle framework also supports adding regularization terms for a layer or a part of the network individually, in order to achieve the effect of fine-tuning the parameter training.
(5) Visualization and Analysis
Users can not only print or use the matplotlib library to make graphs, Flying Paddle also provides a more professional visualization and analysis tool VisualDL, which provides convenient visualization and analysis methods .
2.8.1 Calculate the classification accuracy of the model
Accuracy is an intuitive measure of the effectiveness of the classification model, and since this metric is discrete, it is not suitable for optimization as a loss. Usually, the smaller the cross-entropy loss, the higher the classification accuracy of the model. Based on the classification accuracy, we can fairly compare the advantages and disadvantages of the two loss functions, such as the comparison of mean square error and cross entropy in the loss function chapter of [Handwritten Digit Recognition].
Using the classification accuracy API provided by paddle, we can directly calculate the accuracy.
class paddle.metric.Accuracy
The input parameter of this API is the predicted classification result, and the input parameter label is the real label of the data. paddle also provides more calculation metrics to measure the effectiveness of the model, you can check the API under the paddle.mer package for details.
In the following code, we calculate the classification accuracy in the forward function of the model's forward computation process, and print the classification accuracy for each batch of samples during training.
In [ ]
import paddle
from data_process import get_MNIST_dataloader
train_loader, test_loader = get_MNIST_dataloader()
In [ ]
# Define the model structure import paddle.nn.functional as F
from paddle.nn import Conv2D, MaxPool2D, Linear
# Multilayer convolutional neural network implementation classMNIST(paddle.nn.Layer):def__init__(self):super(MNIST, self). __init__()
# Define the convolutional layer with the output feature channel out_channels set to 20, the convolutional kernel size kernel_size to 5, the convolutional step size stride=1, padding=2
self.conv1 = Conv2D(in_channels=1, out_channels=20, kernel_size=5, stride=1, padding=2)
# Define pooling layer, pooling kernel size kernel_size is 2, pooling step is 2
self.max_pool1 = MaxPool2D(kernel_size=2, stride=2)
# Define the convolution layer with output feature channel out_channels set to 20, convolution kernel size kernel_size 5, convolution step stride=1, padding=2
self.conv2 = Conv2D(in_channels=20, out_channels=20, kernel_size=5, stride=1, padding=2)
# Define pooling layer, pooling kernel size kernel_size is 2, pooling step is 2
self.max_pool2 = MaxPool2D(kernel_size=2, stride=2)
# Define a fully connected layer with an output dimension of 10
self.fc = Linear(in_features=980, out_features=10)
# Define the network forward computation process, using the pooling layer immediately after convolution, and finally using the fully-connected layer to compute the final output # Use Relu for the convolutional layer activation function, and softmaxdefforward(self, inputs, label) for the fully-connected layer activation function.
x = self.conv1(inputs)
x = F.relu(x)
x = self.max_pool1(x)
x = self.conv2(x)
x = F.relu(x)
x = self.max_pool2(x)
x = paddle.reshape(x, [x.shape[0], 980])
x = self.fc(x)
if label isnotNone: acc = paddle.metric(x.shape[0], 980])
acc = paddle.metric.accuracy(input=x, label=label)
return x, acc
return x, acc
return x
#When using a GPU machine, you can set the use_gpu variable to True
use_gpu = True
paddle.set_device('gpu:0') if use_gpu else paddle.set_device('cpu')
#Only the settings of the optimization algorithm make a difference deftrain(model).
model = MNIST()
model.train()
# Setup options for the four optimization algorithms, you can try the effect one by one # opt = paddle.optimizer.SGD(learning_rate=0.01, parameters=model.parameters())# opt = paddle.optimizer.Momentum( learning_rate=0.01, momentum=0.9, parameters=model.parameters())# opt = paddle.optimizer.Adagrad(learning_rate=0.01, parameters=model. parameters())
opt = paddle.optimizer.Adam(learning_rate=0.01, parameters=model.parameters())
EPOCH_NUM = 5for epoch_id inrange(EPOCH_NUM):: for batch_id, data inrange(EPOCH_NUM).
for batch_id, data inenumerate(train_loader()):.
# Prepare data
images, labels = data
images = paddle.to_tensor(images)
labels = paddle.to_tensor(labels)
#Forward computation process
predicts, acc = model(images, labels)
# Calculate the loss, take the average of the loss of a batch of samples
loss = F.cross_entropy(predicts, labels)
avg_loss = paddle.mean(loss)
# Print the current Loss for every 100 batches of data trained if batch_id % 200 == 0.
print(“epoch: {}, batch: {}, loss is: {}, acc is {}”.format(epoch_id, batch_id, avg_loss.numpy(), acc.numpy()))
# Backward propagation, updating parameters, removing gradients
avg_loss.backward()
opt.step()
opt.clear_grad()
# Save the model parameters
paddle.save(model.state_dict(), 'mnist.pdparams')
# Create the model
model = MNIST()
#Start the training process
train(model)
W0905 14:35:13.571403 98 device_context.cc:447] Please NOTE: device: 0, GPU Compute Capability: 8.0, Driver API Version: 11.2, Runtime API Version. 11.2
W0905 14:35:13.574612 98 device_context.cc:465] device: 0, cuDNN Version: 8.2.
epoch: 0, batch: 0, loss is: [3.8877463], acc is [0.09375]
epoch: 0, batch: 200, loss is: [0.205946], acc is [0.921875]
epoch: 0, batch: 400, loss is: [0.01945284], acc is [1.]
epoch: 0, batch: 600, loss is: [0.0915114], acc is [0.96875]
epoch: 0, batch: 800, loss is: [0.07086004], acc is [0.984375]
epoch: 4, batch: 0, loss is: [0.10021097], acc is [0.96875]
epoch: 4, batch: 200, loss is: [0.04378257], acc is [0.984375]
epoch: 4, batch: 400, loss is: [0.02809021], acc is [0.984375]
epoch: 4, batch: 600, loss is: [0.04106805], acc is [0.984375]
epoch: 4, batch: 800, loss is: [0.01189358], acc is [1.]
2.8.2 Examining the Model Training Process to Identify Potential Training Problems
The execution process of training can be easily viewed and debugged using Paddle Dynamic Graph Programming. In the Forward function of the network definition, the dimensions of the inputs and outputs of each layer can be printed, as well as the parameters of each layer of the network. By viewing this information, not only can the execution of the training be better understood, but also potential problems can be identified or ideas for continued optimization can be inspired.
In the following program, the ******check_shape****** variable is used to control whether or not the “size” is printed to verify that the network structure is correct. Use the check_content variable to control whether to print the “content value” to verify that the data distribution is reasonable. If the output of the middle layer is consistently 0 during training, it means that there is a problem with the design of the network structure in this part and it is not fully utilized.
In [ ]
import numpy as np
import paddle.nn.functional as F
# Define model structure classMNIST(paddle.nn.Layer):def__init__(self):super(MNIST, self). __init__()
# Define the convolutional layer with the output feature channel out_channels set to 20, the convolutional kernel size kernel_size to 5, the convolutional step stride=1, padding=2
self.conv1 = Conv2D(in_channels=1, out_channels=20, kernel_size=5, stride=1, padding=2)
# Define pooling layer, pooling kernel size kernel_size is 2, pooling step is 2
self.max_pool1 = MaxPool2D(kernel_size=2, stride=2)
# Define the convolution layer with output feature channel out_channels set to 20, convolution kernel size kernel_size 5, convolution step stride=1, padding=2
self.conv2 = Conv2D(in_channels=20, out_channels=20, kernel_size=5, stride=1, padding=2)
# Define pooling layer, pooling kernel size kernel_size is 2, pooling step is 2
self.max_pool2 = MaxPool2D(kernel_size=2, stride=2)
# Define a fully connected layer with an output dimension of 10
self.fc = Linear(in_features=980, out_features=10)
# Add printing of the dimensions and data content of the inputs and outputs of each layer, deciding whether to print the parameters and output dimensions of each layer based on the check parameter # The convolutional layer activation function uses Relu, and the fully connected layer activation function uses softmaxdefforward(self, inputs, label=None, check_shape= False, check_content=False):# Give different names to the outputs of different layers for debugging purposes.
outputs1 = self.conv1(inputs)
outputs2 = F.relu(outputs1)
outputs3 = self.max_pool1(outputs2)
outputs4 = self.conv2(outputs3)
outputs5 = F.relu(outputs4)
outputs6 = self.max_pool2(outputs5)
outputs6 = paddle.reshape(outputs6, [outputs6.shape[0], -1])
outputs7 = self.fc(outputs6)
# Choose whether to print the parameter sizes and output sizes for each layer of the neural network, verifying that the network structure is set up correctly if check_shape.
# Print the hyperparameters set for each network layer - convolution kernel size, convolution step, convolution padding, pooling kernel size print(“\n########## print network layer's superparams ##############”)
print(“conv1-- kernel_size:{}, padding:{}, stride:{}”.format(self.conv1.weight.shape, self.conv1._padding, self.conv1._stride))
print(“conv2-- kernel_size:{}, padding:{}, stride:{}”.format(self.conv2.weight.shape, self.conv2._padding, self.conv2._stride))
#print(“max_pool1-- kernel_size:{}, padding:{}, stride:{}”.format(self.max_pool1.pool_size, self.max_pool1.pool_stride, self.max_pool1 ._stride))#print(“max_pool2-- kernel_size:{}, padding:{}, stride:{}”.format(self.max_pool2.weight.shape, self.max_pool2._padding, self. max_pool2._stride)) print(“fc-- weight_size:{}, bias_size_{}”.format(self.fc.weight.shape, self.fc.bias.shape))
# print the output size of each layer print(“\n########## print shape of features of every layer ###############”)
print(“inputs_shape: {}”.format(inputs.shape))
print(“outputs1_shape: {}”.format(outputs1.shape))
print(“outputs2_shape: {}”.format(outputs2.shape))
print(“outputs3_shape: {}”.format(outputs3.shape))
print(“outputs4_shape: {}”.format(outputs4.shape))
print(“outputs5_shape: {}”.format(outputs5.shape))
print(“outputs6_shape: {}”.format(outputs6.shape))
print(“outputs7_shape: {}”.format(outputs7.shape))
# print(“outputs8_shape: {}”.format(outputs8.shape)) # Choose whether or not to print the parameters and outputs of the training process, which can be used for debugging during training if check_content.
# Print the parameters of the convolution layer - convolution kernel weights, there are more weight parameters, only some of them are printed here print(“\n########## print convolution layer's kernel ###############”)
print(“conv1 params -- kernel weights:”, self.conv1.weight[0][0])
print(“conv2 params -- kernel weights:”, self.conv2.weight[0][0])
# Create a random number to randomly print the output value of a particular channel
idx1 = np.random.randint(0, outputs1.shape[1])
idx2 = np.random.randint(0, outputs4.shape[1])
# Print the result after convolution-pooling, only the features corresponding to the first image in the batch print("\nThe {}th channel of conv1 layer: ”.format(idx1), outputs1[0][idx1])
print("The {}th channel of conv2 layer: ”.format(idx2), outputs4[0][idx2])
print(“The output of last layer:”, outputs7[0], '\n')
# Calculate classification accuracy and return if label isnotNone: if label isnotNone.
acc = paddle.metric.accuracy(input=F.softmax(outputs7), label=label)
return outputs7, acc
return outputs7, acc
return outputs7
#When using a GPU machine, you can set the use_gpu variable to True
use_gpu = True
paddle.set_device('gpu:0') if use_gpu else paddle.set_device('cpu')
deftrain(model).
model = MNIST()
model.train()
#The four optimization algorithms can be set up one by one to try out the results
opt = paddle.optimizer.SGD(learning_rate=0.01, parameters=model.parameters())
# opt = paddle.optimizer.Momentum(learning_rate=0.01, momentum=0.9, parameters=model.parameters())# opt = paddle.optimizer.Adagrad( learning_rate=0.01, parameters=model.parameters())# opt = paddle.optimizer.Adam(learning_rate=0.01, parameters=model.parameters())
EPOCH_NUM = 1for epoch_id inrange(EPOCH_NUM):: for batch_id, data inrange(EPOCH_NUM).
for batch_id, data inenumerate(train_loader()):.
# Prepare the data, made more concise
images, labels = data
images = paddle.to_tensor(images)
labels = paddle.to_tensor(labels)
# The process of forward computation, get both model output values and classification accuracy if batch_id == 0and epoch_id==0: # Print the model parameters and each layer's output.
# Print the model parameters and the dimensions of each layer's outputs
predicts, acc = model(images, labels, check_shape=True, check_content=False)
elif batch_id==401.
# Print the model parameters and the values output by each layer
predicts, acc = model(images, labels, check_shape=False, check_content=True)
else.
predicts, acc = model(images, labels)
# Calculate loss by averaging the loss of one batch of samples
loss = F.cross_entropy(predicts, labels)
avg_loss = paddle.mean(loss)
# Print the current Loss for every 100 batches of data trained if batch_id % 200 == 0.
print(“epoch: {}, batch: {}, loss is: {}, acc is {}”.format(epoch_id, batch_id, avg_loss.numpy(), acc.numpy()))
# Backward propagation to update the parameters
avg_loss.backward()
opt.step()
opt.clear_grad()
# Save the model parameters
paddle.save(model.state_dict(), 'mnist_test.pdparams')
# Create the model
model = MNIST()
# Start the training process
train(model)
print(“Model has been saved.”)
########## print network layer's superparams ##############
conv1-- kernel_size:[20, 1, 5, 5], padding: 2, stride:[1, 1]
conv2-- kernel_size:[20, 20, 5, 5], padding:2, stride:[1, 1]
fc-- weight_size:[980, 10], bias_size_[10]
########## print shape of features of every layer ###############
inputs_shape: [64, 1, 28, 28]
outputs1_shape: [64, 20, 28, 28]
outputs1_shape: [64, 20, 28, 28] outputs2_shape: [64, 20, 28, 28]
outputs3_shape: [64, 20, 14, 14]
outputs4_shape: [64, 20, 14, 14]
outputs5_shape: [64, 20, 14, 14]
outputs5_shape: [64, 20, 14, 14] outputs6_shape: [64, 980]
outputs7_shape: [64, 10]
epoch: 0, batch: 0, loss is: [3.1953034], acc is [0.109375]
epoch: 0, batch: 200, loss is: [0.37904966], acc is [0.875]
epoch: 0, batch: 400, loss is: [0.16234186], acc is [0.984375]
########## print convolution layer's kernel ###############
[[-0.06305157, -0.24847564, -0.63125026, 0.16255459, 0.05321766],
[ 0.27163783, 0.03359267, -0.16793424, -0.33206284, 0.24191348 ], [-0.017961157
[-0.01796198, 0.59920996, -0.19074094, 0.30350232, -0.13594414], [-0.173974783
[-0.17397462, -0.28396046, 0.37263221, -0.46014515, 0.00446801], [-0.34503122], -0.59920996, -0.19074094, 0.30350232, -0.13594414
[-0.34503123, 0.21267484, 0.12081132, 0.25961810, -0.29448596]])
conv2 params -- kernel weights: Tensor(shape=[5, 5], dtype=float32, place=CUDAPlace(0), stop_gradient=False,
[[ 0.11260822, 0.07491580, -0.00697492, 0.03264738, -0.02312732 ], [-0.13066202
[ -0.13066202, -0.02998639, 0.01296079, 0.07414430, -0.07227730 ], [ -0.13066202, -0.02998639, 0.01296079, 0.07414430, -0.07227730 ], [
[ 0.00029436, 0.01291541, 0.06429236, 0.09252947, -0.00228186], [-0.00228186], [0.00228186]
[-0.04436536, -0.02769227, -0.03541787, -0.04839976, 0.00429312], [ 0.11987270, -0.11987270, -0.11987270
[ 0.11987270, -0.00784257, -0.12884265, 0.03420701, 0.04363669]])
2.8.3 Add calibration or test to better evaluate the model effect
During the training process, we will find that the loss of the model on the training sample set is decreasing. But does this mean the model is still valid on future application scenarios? In order to verify the effectiveness of the model, the sample set is usually divided into three parts, the training set, the calibration set and the test set.
- Training set: used to train the parameters of the model, i.e., the main work done in the training process.
- Validation set: used for the selection of model hyperparameters, such as the adjustment of the network structure, the selection of the weight of the regularization term.
- Test set: used to simulate the real effect of the model after application. Because the test set is not involved in any model optimization or parameter training, it is a completely unknown sample for the model. When the calibration data is not used to optimize the network structure or model hyperparameters, the effect of the calibration data and the test data is similar, and both reflect the model effect more realistically.
The following program reads the model parameters saved from the previous training step, reads the calibration data set, and tests the model's effect on the calibration data set.
In [ ]
defevaluation(model):print('start evaluation .......')
# Define the prediction process
params_file_path = 'mnist.pdparams' # load model parameters
param_dict = paddle.load(params_file_path)
model.load_dict(param_dict)
model.eval()
eval_loader = test_loader
acc_set = []
avg_loss_set = []
for batch_id, data inenumerate(eval_loader()):
images, labels = data
images = paddle.to_tensor(images)
labels = paddle.to_tensor(labels)
predicts, acc = model(images, labels)
loss = F.cross_entropy(input=predicts, label=labels)
avg_loss = paddle.mean(loss)
acc_set.append(float(acc.numpy()))
avg_loss_set.append(float(avg_loss.numpy()))
# Calculate the average loss and accuracy of multiple batch
acc_val_mean = np.array(acc_set).mean()
avg_loss_val_mean = np.array(avg_loss_set).mean()
print('loss={}, acc={}'.format(avg_loss_val_mean, acc_val_mean))
model = MNIST()
evaluation(model)
start evaluation .......
loss=0.06925583740963806, acc=0.9796974522292994
From the results of the test, the model still has 98.6% accuracy on the validation set, proving that it is predictive.