Automatic Mixed Precision Using PyTorch | DigitalOcean

Larger deep learning models need more computing power and memory resources. Faster training of deep neural networks has been achieved via the development of new techniques. Instead of FP32 (full-precision floating-point numbers format), you may use FP16 (half-precision floating-point numbers format), and researchers have discovered that using them in tandem is a better option.

Mixed precision allows for half-precision training while still preserving much of the single-precision network accuracy. The term "mixed precision technique" refers to the fact that this method makes use of both single and half-precision representations.

In this overview of Automatic Mixed Precision (Amp) training with PyTorch, we demonstrate how the technique works, walking step-by-step through the process of using Amp, and discuss more advanced applications of Amp techniques with code scaffolds for users to later integrate with their own code.

Basic Knowledge of PyTorch: Familiarity with PyTorch, including its core concepts like tensors, modules, and the training loop.

Understanding of Deep Learning Fundamentals: Concepts such as neural networks, backpropagation, and optimization.

Knowledge of Mixed Precision Training: Awareness of the benefits and drawbacks of mixed precision training, including reduced memory usage and faster computation.

Access to Compatible Hardware: A GPU that supports mixed precision, such as NVIDIA GPUs with Tensor Cores (e.g., Volta, Turing, Ampere architectures).

Python and CUDA Setup: A working Python environment with PyTorch installed and CUDA configured for GPU acceleration.

Like most deep learning frameworks, PyTorch normally trains on 32-bit floating-point data (FP32). FP32, on the other hand, isn't always necessary for success. It's possible to use a 16-bit floating-point for a few operations, where FP32 consumes more time and memory.

Consequently, NVIDIA engineers developed a technique allowing mixed-precision training to be performed in FP32 for a small number of operations while most of the network runs in FP16.

For mixed-precision training, PyTorch offers a wealth of features already built-in.

A module's parameters are converted to FP16 when you call the method, and a tensor's data is converted to FP16 when you call . Fast FP16 arithmetic will be used to execute any operations on these modules or tensors. NVIDIA math libraries (cuBLAS and cuDNN) are well supported by PyTorch. Data from the FP16 pipeline is processed using Tensor Cores to conduct GEMMs and convolutions. To employ Tensor Cores in cuBLAS, the dimensions of a GEMM ([M, K] x [K, N] -> [M, N]) must be multiples of 8.

Apex's mixed-precision utilities are meant to increase training speed while keeping the accuracy and stability of single-precision training. Apex can perform operations in FP16 or FP32, automatically handle master parameter conversion, and automatically scale losses.

Apex was created to make it easier for researchers to include mixed-precision training in their models. Amp, short for Automatic Mixed-Precision, is one of the features of Apex, a lightweight PyTorch extension. A few more lines on their networks are all it takes for users to benefit from mixed precision training with Amp. Apex was launched at CVPR 2018, and it is worth noting that the PyTorch community has shown strong support for Apex since its release.

By making just minor changes to the running model, Amp makes it such that you don't have to worry about mixed types while creating or running your script. Amp's assumptions may not fit as well in models that utilize PyTorch in less usual ways, but there are hooks to adjust those assumptions as required.

Amp offers all of the advantages of mixed-precision training without the need for loss scaling or type conversions to be explicitly managed. Apex's GitHub website contains instructions for installation process, and its official API documentation can be found here.

Amp utilizes a whitelist/blacklist paradigm at the logical level. Tensor operations in PyTorch include neural network functions such as torch.nn.functional.conv2d, simple math functions such as torch.log, and tensor methods such as torch.Tensor. add__ . There are three main categories of functions in this universe:

Amp's task is straightforward, at least in theory. Amp determines if a PyTorch function is whitelisted, blacklisted, or neither before calling it. All arguments should be converted to FP16 if whitelisted or FP32 if blacklisted. If neither, just ensure that all arguments are of the same type. This policy is not as simple to apply in reality as it seems.

To include Amp into a current PyTorch script, follow these steps:

There is just one line of code for the first step:

The neural network model and the optimizer used for training must already be specified to complete this step, which is just one line long.

Additional settings allow you to fine-tune Amp's tensor and operation type adjustments. The function amp.initialize() accepts many parameters, among which we will just specify three among them:

O0 for FP32 training: This is a no-op. No need to worry about this since your incoming model should be FP32 already, and O0 might help establish a baseline for accuracy.

O1 for Mixed Precision (recommended for typical use): Modify all Tensor and Torch methods to use a whitelist-blacklist input casting scheme. In FP16, whitelist operations such as Tensor Core-friendly ops like GEMMs and convolutions are carried out. Softmax, for example, is a blacklist op that requires FP32 precision. Unless explicitly stated otherwise, O1 also employs dynamic loss scaling.

O2 for "Almost FP16" Mixed Precision: O2 casts the model weights to FP16, patches the model's forward method to cast input data to FP16, keeps batchnorms in FP32, maintains FP32 master weights, updates the optimizer's param_groups so that the optimizer.step() acts directly on the FP32 weights, and implements dynamic loss scaling (unless overridden). Unlike O1, O2 does not patch Torch functions or Tensor methods.

O3 for FP16 training: O3 may not be as stable as O1 and O2 regarding true mixed precision. Consequently, it might be advantageous to set a baseline speed for your model, against which the efficiency of O1 and O2 can be evaluated.

The extra property override keep_batchnorm_fp32=True in O3 might help you determine the "speed of light" if your model employs batch normalization, enabling cudnn batchnorm.

O0 and O3 are not true mixed-precision, but they help set accuracy and speed baselines, respectively. A mixed-precision implementation is defined as O1 and O2.

You can try both and see which improves performance and accuracy the most for your particular model.

Make sure you identify where the backward pass occurs in your code.

A few lines of code that look like this will appear:

Using the Amp context manager, you can enable loss scaling by simply wrapping the backward pass:

That's all. You may now rerun your script with mixed-precision training turned on.

PyTorch lacks the static model object or graph to latch onto and insert the casts mentioned above since it is so flexible and dynamic. By "monkey patching" the required functions, Amp can intercept and cast parameters dynamically.

As an example, you can use the code below to ensure that the arguments to the method torch.nn.functional.linear are always cast to fp16:

Although Amp may add refinements to make the code more resilient, calling Amp.init() effectively causes monkey patches to be inserted into all relevant PyTorch functions so that arguments are correctly cast at runtime.

Each weight is only cast FP32 -> FP16 once every iteration since Amp keeps an internal cache of all parameter casts and reuses them as needed. At each iteration, the context manager for the backward pass indicates Amp when to clear the cache.

"Automated mixed precision training" refers to the combination of torch.cuda.amp.autocast and torch.cuda.amp.GradScaler. Using torch.cuda.amp.autocast, you may set up autocasting just for certain areas. Autocasting automatically selects the precision for GPU operations to optimize efficiency while maintaining accuracy.

The torch.cuda.amp.GradScaler instances make it easier to perform the gradient scaling steps. Gradient scaling reduces gradient underflow, which helps networks with float16 gradients achieve better convergence.

Here's some code to demonstrate how to use autocast() to get automated mixed precision in PyTorch:

If the forward pass for one specific operation has float16 inputs, then the backward pass for this operation produces float16 gradients, and float16 may not be able to express gradient with small magnitudes.

The update for the related parameters will be lost if these values are flushed to zero ("underflow").

Gradient scaling is a technique that uses a scale factor to multiply the network's losses and then performs a backward pass on the scaled loss to avoid underflow. It is also necessary to scale backward-flowing gradients through the network by this same factor. Consequently, gradient values have a larger magnitude, which prevents them from flushing to zero.

Before updating parameters, each parameter's gradient (.grad attribute) should be unscaled so that the scale factor does not interfere with the learning rate. Both autocast and GradScaler can be used independently since they are modular.

We can scaled all gradients by using the method. The properties of the parameters between and must be unscaled before you change or inspect them. If you want to limit the global norm (see torch.nn.utils.clip_grad_norm_()) or maximum magnitude (see torch.nn.utils.clip_grad_value_()) of your gradient set to be less than or equal to a certain value(some user-imposed threshold), you can use a technique called "gradient clipping."

Clipping without unscaling would result in the gradients' norm/maximum magnitudes being scaled, invalidating your requested threshold(which was supposed to be the threshold for unscaled gradients). Gradients contained by the optimizer's given parameters are unscaled by scaler.unscale (optimizer).

You may unscale the gradients of other parameters that were previously given to another optimizer (such as optimizer1) by using scaler.unscale (optimizer1). We can illustrate this concept by adding two lines of codes:

Gradient accumulation is based on an absurdly basic concept. Instead of updating the model parameters, it waits and accumulates the gradients across successive batches to compute loss and gradient.

After a certain number of batches, the parameters are updated depending on the cumulative gradient. Here's a snippet of code on how to use gradient accumulation using pytorch:

If grads are unscaled (or the scale factor changes) before accumulation is complete, the next backward pass will add scaled grads to unscaled grads (or grads scaled by a different factor) after which it's impossible to recover the accumulated unscaled grads step must apply.

When implementing a gradient penalty, torch.autograd.grad() is used to build gradients, which are combined to form the penalty value, and then added to the loss. L2 penalty without scaling or autocasting is shown in the example below.

Tensors provided to torch.autograd.grad() should be scaled to implement a gradient penalty. It is necessary to unscale the gradients before combining them to obtain the penalty value. Since the penalty term computation is part of the forward pass, it should take place inside an autocast context.

For the same L2 penalty, here is how it looks:

Scaler.scale must be called on each loss in your network if you have many of them.

If you have many optimizers in your network, you may execute scaler.unscale on any of them, and you must call scaler.step on each of them. However, scaler.update should only be used once, after the stepping of all optimizers used in this iteration:

Each optimizer examines its gradients for infs/NaNs and makes an individual judgment whether or not to skip the step. Some optimizer may skip the step, while others may not do so. Step-skipping happens just once per several hundred iterations; therefore it shouldn't affect convergence. For multiple-optimizer models, you can report the problem if you see poor convergence after adding gradient scaling.

One of the most significant issues with Deep Learning models is that they are growing too large to train on a single GPU. It can take too long to train a model on a single GPU, and Multi-GPU training is required to get models ready as quickly as possible. A well-known researcher was able to shorten the ImageNet training period from two weeks to 18 minutes or train the most extensive and most advanced Transformer-XL in two weeks instead of four years.

Without compromising quality, PyTorch offers the best combination of ease of use and control. nn.DataParallel and nn.parallel.DistributedDataParallel are two PyTorch features for distributing training across multiple GPUs. You can use these easy-to-use wrappers and changes to train the network on multiple GPUs.

In a single machine, DataParallel helps spread training over many GPUs.

Let's take a closer look at how DataParallel really works in practice.

When utilizing DataParallel to train a neural network, the following stages take place:

It is worth noting that the concerns discussed here solely apply to autocast. GradScaler's behavior remains unchanged. It doesn't matter whether torch.nn.DataParallel creates threads for each device to conduct the forward pass. The autocast state is communicated in each one, and the following will work:

The documentation for torch.nn.parallel.DistributedDataParallel advises using one GPU per process for best performance. In this situation, DistributedDataParallel does not launch threads internally; hence the use of autocast and GradScaler is not affected.

Here torch.nn.parallel.DistributedDataParallel may spawn a side thread to run the forward pass on each device, like torch.nn.DataParallel. The fix is the same: apply autocast as part of your model's forward method to ensure it's enabled in side threads.

In this article, we have :

Pop Pulse News

Automatic Mixed Precision Using PyTorch | DigitalOcean

POPULAR CATEGORY

corporate

tech

entertainment

research

wellness

athletics