2.4 Network Structure for Handwritten Digit Recognition
In Section 2.2 we tried to solve the handwritten digit recognition problem using the same simple neural network as the house price prediction, but the results were not satisfactory. The reason is that the input of handwritten digit recognition is 28×28 pixel values, and the output is digit labels from 0 to 9. The linear regression model is unable to capture the complex information embedded in two-dimensional image data, as shown in Figure 1. In both the Newton's second law task and the house price prediction task, the relationship between the input features and the output predictions can be portrayed as a “straight line” (expressed as a linear equation). However, the relationship between the input pixels and the output labels of the handwritten digit recognition task is clearly not linear, and even this relationship is so complex that it is difficult for us to understand it intuitively by the human brain.
Figure 1: Input and output of a digit recognition task are not linear.
Therefore, we need to try to use other more complex and powerful networks to construct the handwritten digit recognition task, and observe the training effect, i.e., to expand the “horizontal and vertical” teaching method from the horizontal, as shown in Figure 2. This section introduces two common network structures: the classical multi-layer fully-connected neural network and the convolutional neural network.
Figure 2: “Horizontal and vertical” teaching method - network structure optimization
Data Processing
The data processing has been introduced in the previous section, and the encapsulated functions can be called directly here:
In [6]
from data_process import get_MNIST_dataloader
train_loader,_ = get_MNIST_dataloader()
2.4.1 Classical Fully Connected Neural Networks
Neurons are the basic units that make up a neural network, and their basic structure is shown in Figure 3.
Figure 3: Neuron structure
The classical fully connected neural network comes to contain a four-layer network: the input layer, two hidden layers, and the output layer, and the handwritten digit recognition task is represented by a fully connected neural network, as shown in Figure 4.
Figure 4: Fully connected neural network structure for handwritten digit recognition task
Input layer: inputs data to the neural network. In this task, the scale of the input layer is 28×28 pixel values.
Hidden layer: increase the depth and complexity of the network, the number of nodes in the hidden layer can be adjusted, the more the number of nodes, the stronger the neural network representation ability, the number of parameters will also increase. In this task, the two hidden layers in the middle are 10×10 structure, usually the size of the hidden layer will be smaller than the size of the input layer, in order to do the abstraction of the key information, the activation function uses the common Sigmoid function.
Output layer: outputs the results of the network computation, the number of nodes in the output layer is fixed. If it is a regression problem, the number of nodes is the number of numbers to be regressed. If it is a classification problem, it is the number of classification labels. In this task, the output of the model is to regress a number and the size of the output layer is 1.
Description:
The nonlinear activation function Sigmoid is introduced in the hidden layer to increase the nonlinear capability of the neural network.
For example, if a neural network uses a linear transformation with four inputs x1~x4 and one output y. Assuming that the transformations of the first layer are z1=x1-x2 and z2=x3+x4, and that the transformation of the second layer is y=z1+z2, expanding the transformations of the two layers gives y=x1-x2+x3+x4. In other words, no matter how many layers of linear transformations are accumulated in between, there is still a linear relationship between the original input and the final output.
Sigmoid is a common nonlinear transformation function in early neural network models, and the formula is
With the following code, the function curve of Sigmoid is plotted.
In [2]
import numpy as np
import matplotlib.pyplot as plt
def sigmoid(x).
# Return the sigmoid function directly
return 1. / (1. + np.exp(-x))
# param: start point, end point, spacing
x = np.arange(-8, 8, 0.2)
y = sigmoid(x)
plt.plot(x, y)
plt.show()
import numpy as npimport matplotlib.pyplot as pltdef sigmoid(x): # Return the sigmoid function directly return 1. / (1. + np.exp(-x)) # param: start, end, spacing x = np.range(-8, 8, 0.2 ) y = sigmoid(x)plt.plot(x, y)plt.show()
For the task of handwritten digit recognition, the network layer is designed as follows:
The input layer has a scale of 28 × 28, but 1 dimension is added uniformly for batch computation (size of batch size).
The two hidden layers in the middle are 10 × 10 in structure, and the activation function uses a Sigmoid function.
As with the house price prediction model, the output of the model is a regression to a number, with the size of the output layer set to 1.
The following code is an implementation of a classical fully connected neural network. After completing the definition of the network structure, the neural network can be trained.
Description:
The data iterator train_loader has a data shape of [batch_size, 1, 28, 28] at each iteration, so it is necessary to change the form of this data to vector form.
In [3]
import paddle
import paddle.nn.functional as F
from paddle.nn import Linear
# Define a multi-layer fully connected neural network
class MNIST(paddle.nn.Layer):: __init__(self): __init__(self).
def __init__(self).
super(MNIST, self). __init__()
# Define two fully-connected implicit layers with an output dimension of 10, currently setting the number of implicit nodes to 10, which can be adjusted according to the task
self.fc1 = Linear(in_features=784, out_features=10)
self.fc2 = Linear(in_features=10, out_features=10)
# Define a fully-connected output layer with an output dimension of 1
self.fc3 = Linear(in_features=10, out_features=1)
# Define the forward computation of the network, the hidden layer activation function is sigmoid, the output layer does not use activation function
def forward(self, inputs).
inputs = paddle.reshape(inputs, [inputs.shape[0], 784])
outputs1 = self.fc1(inputs)
outputs1 = F.sigmoid(outputs1)
outputs2 = self.fc2(outputs1)
outputs2 = F.sigmoid(outputs2)
outputs_final = self.fc3(outputs2)
return outputs_final
The paddle.summary(net, input_size, dtypes=None) function prints the network infrastructure and parameter information.
paddle.summary(net, input_size=None, dtypes=None, input=None)
The meaning of the key parameters is as follows:
- net (Layer) - Network instance, must be a subclass of Layer.
- input_size (tuple|InputSpec|list[tuple|InputSpec) - the size of the input tensor. If the network has only one input, then this value needs to be set to tuple or InputSpec. If the model has multiple inputs. Then the value needs to be set to list[tuple|InputSpec] containing the shape of each input.
- dtypes (str, optional) - The data type of the input tensor, if not given, the float32 type is used by default. Default: None
Returns: a dictionary containing the total number of parameters and the total number of trainable parameters.
Below we print the fully connected neural network infrastructure and parameter information defined above:
In [4]
model = MNIST()
params_info = paddle.summary(model, (1, 1, 28, 28))
print(params_info)
W0901 17:12:48.648859 98 device_context.cc:447] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.2, Runtime API Version. 10.1
W0901 17:12:48.653869 98 device_context.cc:465] device: 0, cuDNN Version: 7.6.
---------------------------------------------------------------------------
Layer (type) Input Shape Output Shape Param #
===========================================================================
Linear-1 [[1, 784]] [1, 10] 7,850
Linear-2 [[1, 10]] [1, 10] 110
Linear-3 [[1, 10]] [1, 1] 11
===========================================================================
Total params: 7,971
Trainable params: 7,971
Non-trainable params: 0
---------------------------------------------------------------------------
Input size (MB): 0.00
Forward/backward pass size (MB): 0.00
Params size (MB): 0.03
Estimated Total Size (MB): 0.03
---------------------------------------------------------------------------
{'total_params': 7971, 'trainable_params': 7971}
Training a well-defined classical fully connected neural network using the MNIST dataset.
In [7]
# The code after the network structure section remains unchanged
def train(model).
model.train()
# Use SGD optimizer with learning_rate set to 0.01
opt = paddle.optimizer.SGD(learning_rate=0.01, parameters=model.parameters())
# train for 5 rounds
EPOCH_NUM = 10
loss_list = []
for epoch_id in range(EPOCH_NUM).
for batch_id, data in enumerate(train_loader()).
# Prepare data
images, labels = data
images = paddle.to_tensor(images)
labels = paddle.to_tensor(labels, dtype=“float32”)
The # forward computation process
predicts = model(images)
#Calculate the loss, take the average of the loss of a batch of samples
loss = F.square_error_cost(predicts, labels)
avg_loss = paddle.mean(loss)
#For every 200 batches of data, print out the current loss
if batch_id % 200 == 0.
loss_list.append(avg_loss.numpy()[0])
print(“epoch: {}, batch: {}, loss is: {}”.format(epoch_id, batch_id, avg_loss.numpy()))
# Backward propagation, updating the parameters
avg_loss.backward()
# Minimize the loss, update the parameters
opt.step()
# Clear the gradient
opt.clear_grad()
# Save the model parameters
paddle.save(model.state_dict(), 'mnist.pdparams')
return loss_list
model = MNIST()
loss_list = train(model)
epoch: 0, batch: 0, loss is: [26.372326]
epoch: 0, batch: 200, loss is: [5.287777]
epoch: 0, batch: 400, loss is: [3.5908165]
epoch: 0, batch: 600, loss is: [2.9941204]
epoch: 0, batch: 800, loss is: [2.546555]...
epoch: 9, batch: 0, loss is: [0.8851687]
epoch: 9, batch: 200, loss is: [1.3586344]...
epoch: 9, batch: 400, loss is: [1.5049815]
epoch: 9, batch: 600, loss is: [2.061728]
epoch: 9, batch: 800, loss is: [1.1433309]
Plot the curve according to the change in the loss function:
In [10]
from tools import plot
plot(loss_list)
from tools import plotplot(loss_list)
2.4.2 Convolutional Neural Networks
Although the use of classical fully connected neural networks improves accuracy somewhat, the form of their input data results in the loss of spatial information between image pixels, which affects the network's understanding of the image content. For computer vision problems, the most effective model remains the convolutional neural network. Convolutional neural networks optimize the network structure for the characteristics of vision problems and can directly process image data in its original form, retaining the spatial information between pixels, making them more suitable for dealing with vision problems.
The convolutional neural network consists of multiple convolutional layers and pooling layers, as shown in Figure 5. The convolutional layers are responsible for scanning the input to generate more abstract feature representations, and the pooling layer filters these feature representations to retain the most critical feature information.
Figure 5: Convolutional neural networks that shine in handling computer vision tasks
Description:
This section only briefly describes the implementation of the handwritten digit recognition task with convolutional neural networks and the improvement in results it brings. Readers can simply understand convolutional neural networks as a more powerful model than the classical fully connected neural networks first, and more detailed principles and implementation are described in the next Computer Vision - Fundamentals of Convolutional Neural Networks.
Convolution has a much lower loss in this problem.
The neural network implementation with two layers of convolution and pooling is shown below.
Note: This implementation is different from Figure 5, e.g., this implementation has an output length of 1.
In [11]
# Define the SimpleNet network structure
import paddle
from paddle.nn import Conv2D, MaxPool2D, Linear
import paddle.nn.functional as F
# Multi-layer convolutional neural network implementation
class MNIST(paddle.nn.Layer).
def __init__(self).
super(MNIST, self). __init__()
# Define the convolutional layer with the output feature channel out_channels set to 20, the convolutional kernel size kernel_size to 5, the convolutional step stride=1, and padding=2.
self.conv1 = Conv2D(in_channels=1, out_channels=20, kernel_size=5, stride=1, padding=2)
# Define pooling layer, pooling kernel size kernel_size is 2, pooling step is 2
self.max_pool1 = MaxPool2D(kernel_size=2, stride=2)
# Define the convolution layer with output feature channel out_channels set to 20, convolution kernel size kernel_size 5, convolution step stride=1, padding=2
self.conv2 = Conv2D(in_channels=20, out_channels=20, kernel_size=5, stride=1, padding=2)
# Define pooling layer, pooling kernel size kernel_size is 2, pooling step is 2
self.max_pool2 = MaxPool2D(kernel_size=2, stride=2)
# Define a fully connected layer with an output dimension of 1
self.fc = Linear(in_features=980, out_features=1)
# Define the network forward computation process, using the pooling layer immediately after the convolution, and finally using the fully connected layer to compute the final output
# Use Relu for the activation function in the convolutional layer and no activation function in the fully connected layer.
def forward(self, inputs).
x = self.conv1(inputs)
x = F.relu(x)
x = self.max_pool1(x)
x = self.conv2(x)
x = F.relu(x)
x = self.max_pool2(x)
x = paddle.reshape(x, [x.shape[0], -1])
x = self.fc(x)
return x
Print the network structure:
In [12]
model = MNIST()
params_info = paddle.summary(model, (1, 1, 28, 28))
print(params_info)
---------------------------------------------------------------------------
Layer (type) Input Shape Output Shape Param #
===========================================================================
Conv2D-1 [[1, 1, 28, 28]] [1, 20, 28, 28] 520
MaxPool2D-1 [[1, 20, 28, 28]] [1, 20, 14, 14] 0
Conv2D-2 [[1, 20, 14, 14]] [1, 20, 14, 14] 10,020
MaxPool2D-2 [[1, 20, 14, 14]] [1, 20, 7, 7] 0
Linear-10 [[1, 980]] [1, 1] 981
===========================================================================
Total params: 11,521
Trainable params: 11,521
Non-trainable params: 0
---------------------------------------------------------------------------
Input size (MB): 0.00
Forward/backward pass size (MB): 0.19
Params size (MB): 0.04
Estimated Total Size (MB): 0.23
---------------------------------------------------------------------------
{'total_params': 11521, 'trainable_params': 11521}
The defined convolutional neural network is trained using the MNIST dataset as shown below.
In [13]
# The code after the network structure section remains unchanged
def train(model).
model.train()
learning_rate = 0.001
# Set the learning_rate using the SGD optimizer.
opt = paddle.optimizer.SGD(learning_rate=learning_rate, parameters=model.parameters())
# train for 5 rounds
EPOCH_NUM = 10
# MNIST image height and width
IMG_ROWS, IMG_COLS = 28, 28
loss_list = []
for epoch_id in range(EPOCH_NUM).
for batch_id, data in enumerate(train_loader()).
# Prepare data
images, labels = data
images = paddle.to_tensor(images)
labels = paddle.to_tensor(labels, dtype=“float32”)
The # forward computation process
predicts = model(images) # [batch_size, 1]
# Calculate the loss by averaging the loss of one batch of samples
loss = F.square_error_cost(predicts, labels)
avg_loss = paddle.mean(loss)
#For every 200 batches of data, print out the current loss
if batch_id % 200 == 0.
loss_list.append(avg_loss.numpy()[0])
print(“epoch: {}, batch: {}, loss is: {}”.format(epoch_id, batch_id, avg_loss.numpy()))
# Backward propagation, updating the parameters
avg_loss.backward()
# Minimize the loss, update the parameters
opt.step()
# Clear the gradient
opt.clear_grad()
# Save the model parameters
paddle.save(model.state_dict(), 'mnist.pdparams')
return loss_list
model = MNIST()
loss_list_conv = train(model)
epoch: 0, batch: 0, loss is: [33.812702]
epoch: 0, batch: 200, loss is: [2.4981875]
epoch: 0, batch: 400, loss is: [2.748225]
epoch: 0, batch: 600, loss is: [2.2872992]
epoch: 0, batch: 800, loss is: [2.0726104]....
epoch: 9, batch: 0, loss is: [1.5385004]
epoch: 9, batch: 200, loss is: [1.3261033]
epoch: 9, batch: 400, loss is: [0.9551685]
epoch: 9, batch: 600, loss is: [1.0667143]
epoch: 9, batch: 800, loss is: [1.651695]
Simultaneously plot the change in the loss function during training of the two network structures:
In [14]
def plot_two_losses(loss_list_1, loss_list_2):
plt.figure(figsize=(10,5))
freqs = [i for i in range(len(loss_list_1))]
# Plot the training loss variation curve
plt.plot(freqs, loss_list_1, color='#e4007f', label=“Train loss1”)
plt.plot(freqs, loss_list_2, color='#f19ec2', linestyle='--', label=“Train loss2”)
# plot axes and legend
plt.ylabel(“loss”, fontsize='large')
plt.xlabel(“freq”, fontsize='large')
plt.legend(loc='upper right', fontsize='x-large')
plt.show()
plot_two_losses(loss_list, loss_list_conv)
def plot_two_losses(loss_list_1, loss_list_2): plt.figure(figsize=(10,5)) freqs = [i for i in range(len(loss_list_1))] # Plot the training loss change curve plt.plot( freqs, loss_list_1, color='#e4007f', label=“Train loss1”) plt.plot(freqs, loss_list_2, color='#f19ec2', linestyle='--', label=“Train loss2”) # plot axes and legend plt.ylabel(“loss”, fontsize='large') plt.xlabel(“freq”, fontsize='large') plt.legend(loc='upper right', fontsize='x-large') plt. show() plot_two_losses(loss_list, loss_list_conv)
The trend of the loss function shows that the fully connected neural network and the convolutional neural network converge at a comparable rate. Currently our convolutional neural network is doing a regression task, next we try to replace the regression task with a classification task to see how well the convolutional neural network works.