2.9 Recovering Training for Handwriting Digit Recognition
In the previous section, we have described the method of saving the trained model to a disk file. The application can load the model at any time to complete the prediction task. However, in our daily training work we may encounter some unexpected situations that cause the training process to be interrupted actively or passively. If training a model takes several days of training time, it is unacceptable to retrain from the initial state after the interruption. Fortunately, Flying Paddle supports training from the last saved state, so we don't have to retrain from the initial state as long as we save the model state during the training process at any time.
2.9.1 Constructing and training the network
The following section describes the implementation of resuming training, still using the case of handwritten digit recognition, with the part of the network definition remaining unchanged, and calling the encapsulated code directly.
In [1]
import paddle
import os
from data_process import get_MNIST_dataloader
from MNIST_network import MNIST
import paddle.nn.functional as F
train_loader, test_loader = get_MNIST_dataloader()
Define the training trainer, including the training process and model saving
In [2]
class Trainer(object).
def __init__(self, model_path, model, optimizer).
self.model_path = model_path # Model storage path.
self.model = model # Defined model.
self.optimizer = optimizer # optimizer
def save(self).
# Save the model
paddle.save(self.model.state_dict(), self.model_path)
def train_step(self, data).
images, labels = data
# The process of forward computation
predicts = self.model(images)
# Calculate the loss
loss = F.cross_entropy(predicts, labels)
avg_loss = paddle.mean(loss)
# Backward propagation, the process of updating the parameters
avg_loss.backward()
self.optimizer.step()
self.optimizer.clear_grad()
return avg_loss
def train_epoch(self, datasets, epoch).
self.model.train()
for batch_id, data in enumerate(datasets()): loss = self.train_step(epoch): self.
loss = self.train_step(data)
# Print out the current loss for every 1000 batches of data trained
if batch_id % 500 == 0.
print(“epoch_id: {}, batch_id: {}, loss is: {}”.format(epoch, batch_id, loss.numpy()))
def train(self, train_datasets, start_epoch, end_epoch, save_path): if not os.path.
if not os.path.exists(save_path): os.makedirs(save_path).
os.makedirs(save_path): if not os.path.exists(save_path).
for i in range(start_epoch, end_epoch): self.train_epoch(train_path): os.makedirs(save_path)
self.train_epoch(train_datasets, i)
paddle.save(self.optimizer.state_dict(), '. /{}/mnist_epoch{}'.format(save_path,i)+'.pdopt')
paddle.save(self.model.state_dict(), '. /{}/mnist_epoch{}'.format(save_path,i)+'.pdparams')
self.save()
Before starting the introduction of resuming training with flying paddles, a model is trained normally, with the optimizer using Adam, using a dynamically varying learning rate that decays from 0.01 to 0.001. The model is saved once after each training round, after which resuming training will be performed using the model parameters from one of the rounds, verifying that the model performs similarly for a one-time training and for interrupting and then resuming training (the training The model parameters of one of the rounds will be used for resuming the training.)
Note that the resumption procedure not only saves the model parameters, but also the optimizer parameters. This is because some optimizers contain parameters that change with the training process, e.g., Adam, Adagrad, and other optimizers use a variable learning rate strategy, which gradually reduces the learning rate as the training progresses. The parameters of these optimizers are crucial for resuming training.
To demonstrate this feature, the training procedure uses the Adam optimizer, and PolynomialDecay's variable strategy decays the learning rate from 0.01 to 0.001.
classpaddle.optimizer.lr.PolynomialDecay (learningrate, decaysteps, endlr=0.0001, power=1.0, cycle=False, lastepoch=- 1, verbose=False)
The parameters are described below:
- learning_rate (float): initial learning rate, data type is Python float.
- decay_steps (int): the step size to perform the decay, this determines the decay period.
- end_lr (float, optional): minimum final learning rate, default value is 0.0001.
- power (float, optional): the power of the polynomial, default value is 1.0.
- last_epoch (int, optional): number of rounds in the last round, set to the number of epochs in the last round when restarting training. Default value is -1, then it is the initial learning rate.
- verbose (bool, optional): if True, output a message in standard output stdout at each round of update, default value is False.
- cycle (bool, optional): whether the learning rate goes down and then goes up again. If True, the learning rate will rise again when it decays to the lowest learning rate value. If False, the learning rate is monotonically decreasing. The default value is False.
The change curve of PolynomialDecay is shown below:
In [ ]
#When using a GPU machine, the use_gpu variable can be set to True
use_gpu = True
paddle.set_device('gpu:0') if use_gpu else paddle.set_device('cpu')
paddle.seed(1024)
epochs = 3
BATCH_SIZE = 32
model_path = '. /mnist.pdparams'
model = MNIST()
total_steps = (int(50000//BATCH_SIZE) + 1) * epochs
lr = paddle.optimizer.lr.PolynomialDecay(learning_rate=0.01, decay_steps=total_steps, end_lr=0.001)
optimizer = paddle.optimizer.Adam(learning_rate=lr, parameters=model.parameters())
trainer = Trainer(
model_path=model_path,
model=model, optimizer=optimizer.
optimizer=optimizer
)
trainer.train(train_datasets=train_loader, start_epoch = 0, end_epoch = epochs, save_path='checkpoint')
W0901 17:29:43.145830 192 device_context.cc:447] PLEASE NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.2, Runtime API Version. 10.1
W0901 17:29:43.152843 192 device_context.cc:465] device: 0, cuDNN Version: 7.6.
epoch_id: 0, batch_id: 0, loss is: [3.900196]
epoch_id: 0, batch_id: 500, loss is: [0.1180672]
epoch_id: 1, batch_id: 0, loss is: [0.05891193]
epoch_id: 1, batch_id: 500, loss is: [0.16080402]
epoch_id: 2, batch_id: 0, loss is: [0.05444674]
epoch_id: 2, batch_id: 500, loss is: [0.02639692]
2.9.2 Resume Training
Model recovery training requires reorganizing the network, so we need to restart AI Studio, run the data processing and MNIST network definition, Trainer part of the code, and then execute the model recovery code.
In the above training code, we trained 3 rounds (epoch). At the end of each round, we saved the model parameters and optimizer related parameters.
- Use model.state_dict() to get the model parameters.
- Use opt.state_dict to get optimizer and learning rate related parameters.
- Call paddle.save to save the parameters locally.
For example, the files saved for the first round of training are mnist_epoch0.pdparams and mnist_epoch0.pdopt, which store the model parameters and optimizer parameters respectively.
Use paddle.load to load the model parameters and optimizer parameters respectively, as shown in the following code.
paddle.load(params_path+'.pdparams')
paddle.load(params_path+'.pdopt')
How can we tell if the model is accurately resuming training?
Ideally, when training is resumed, the state of the model returns to the moment when training was interrupted, and the direction of the gradient update after resuming training is exactly the same as the direction of the gradient before resuming training. Based on this, we can judge whether the above method can accurately resume training by the loss change after resuming training. That is, we resume training from the model parameters and optimizer state saved at the end of epoch 0, and check whether the loss change after training (epoch 1) is similar to the training without interruption.
Description:
Resuming training has the following two points:
- When saving the model, save the model parameters and optimizer parameters separately.
- When restoring parameters, restore the model parameters and optimizer parameters separately.
The following code demonstrates the process of resuming training and verifies that the resumption is successful. The model parameters are loaded and training is started from the first epoch so that the reader can calibrate the change in loss after resuming training.
In [3]
# MLP continues training
paddle.seed(1024)
epochs = 3
BATCH_SIZE = 32
model_path = '. /mnist_retrain.pdparams'
model = MNIST()
# lr = 0.01
total_steps = (int(50000//BATCH_SIZE) + 1) * epochs
lr = paddle.optimizer.lr.PolynomialDecay(learning_rate=0.01, decay_steps=total_steps, end_lr=0.001)
optimizer = paddle.optimizer.Adam(learning_rate=lr, parameters=model.parameters())
params_dict = paddle.load('. /checkpoint/mnist_epoch0.pdparams')
opt_dict = paddle.load('. /checkpoint/mnist_epoch0.pdopt')
# Load the parameters to the model
model.set_state_dict(params_dict)
optimizer.set_state_dict(opt_dict)
trainer = Trainer(
model_path=model_path,
model=model, optimizer=optimizer)
optimizer=optimizer
)
# The previous training models are saved, here save_path is set to the new path, the actual training is saved in the same directory can be
trainer.train(train_datasets=train_loader,start_epoch = 1, end_epoch = epochs, save_path='checkpoint_con')
W0901 17:31:41.696164 509 device_context.cc:447] PLEASE NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.2, Runtime API Version. 10.1
W0901 17:31:41.700848 509 device_context.cc:465] device: 0, cuDNN Version: 7.6.
epoch_id: 1, batch_id: 0, loss is: [0.03602091]
epoch_id: 1, batch_id: 500, loss is: [0.4263561]
epoch_id: 2, batch_id: 0, loss is: [0.0733113]
epoch_id: 2, batch_id: 500, loss is: [0.13029066]
We can save the model parameters and optimizer parameters separately, or both, as shown in the example below:
In [4]
import paddle
from paddle import nn
from paddle.optimizer import Adam
layer = paddle.nn.Linear(3, 4)
adam = Adam(learning_rate=0.001, parameters=layer.parameters())
obj = {'model': layer.state_dict(), 'opt': adam.state_dict(), 'epoch': 100}
path = 'example/model.pdparams'
paddle.save(obj, path)
From the loss change of recovery training, the loss function value of loading model parameters to continue training is the same as the normal training loss function value, so it can be seen that using paddle to realize recovery training is extremely simple. To summarize:
- Save both model parameters and optimizer parameters when saving the model;
paddle.save(opt.state_dict(), 'model.pdopt')
paddle.save(model.state_dict(), 'model.pdparams')
- Restore both model parameters and optimizer parameters when restoring parameters.
model_dict = paddle.load(“model.pdparams”)
opt_dict = paddle.load(“model.pdopt”)
model.set_state_dict(model_dict)
opt.set_state_dict(opt_dict)