123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219 |
- .. role:: hidden
- :class: hidden-section
- Advanced Amp Usage
- ===================================
- GANs
- ----
- GANs are an interesting synthesis of several topics below. A `comprehensive example`_
- is under construction.
- .. _`comprehensive example`:
- https://github.com/NVIDIA/apex/tree/master/examples/dcgan
- Gradient clipping
- -----------------
- Amp calls the params owned directly by the optimizer's ``param_groups`` the "master params."
- These master params may be fully or partially distinct from ``model.parameters()``.
- For example, with `opt_level="O2"`_, ``amp.initialize`` casts most model params to FP16,
- creates an FP32 master param outside the model for each newly-FP16 model param,
- and updates the optimizer's ``param_groups`` to point to these FP32 params.
- The master params owned by the optimizer's ``param_groups`` may also fully coincide with the
- model params, which is typically true for ``opt_level``\s ``O0``, ``O1``, and ``O3``.
- In all cases, correct practice is to clip the gradients of the params that are guaranteed to be
- owned **by the optimizer's** ``param_groups``, instead of those retrieved via ``model.parameters()``.
- Also, if Amp uses loss scaling, gradients must be clipped after they have been unscaled
- (which occurs during exit from the ``amp.scale_loss`` context manager).
- The following pattern should be correct for any ``opt_level``::
- with amp.scale_loss(loss, optimizer) as scaled_loss:
- scaled_loss.backward()
- # Gradients are unscaled during context manager exit.
- # Now it's safe to clip. Replace
- # torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm)
- # with
- torch.nn.utils.clip_grad_norm_(amp.master_params(optimizer), max_norm)
- # or
- torch.nn.utils.clip_grad_value_(amp.master_params(optimizer), max_)
- Note the use of the utility function ``amp.master_params(optimizer)``,
- which returns a generator-expression that iterates over the
- params in the optimizer's ``param_groups``.
- Also note that ``clip_grad_norm_(amp.master_params(optimizer), max_norm)`` is invoked
- *instead of*, not *in addition to*, ``clip_grad_norm_(model.parameters(), max_norm)``.
- .. _`opt_level="O2"`:
- https://nvidia.github.io/apex/amp.html#o2-fast-mixed-precision
- Custom/user-defined autograd functions
- --------------------------------------
- The old Amp API for `registering user functions`_ is still considered correct. Functions must
- be registered before calling ``amp.initialize``.
- .. _`registering user functions`:
- https://github.com/NVIDIA/apex/tree/master/apex/amp#annotating-user-functions
- Forcing particular layers/functions to a desired type
- -----------------------------------------------------
- I'm still working on a generalizable exposure for this that won't require user-side code divergence
- across different ``opt-level``\ s.
- Multiple models/optimizers/losses
- ---------------------------------
- Initialization with multiple models/optimizers
- **********************************************
- ``amp.initialize``'s optimizer argument may be a single optimizer or a list of optimizers,
- as long as the output you accept has the same type.
- Similarly, the ``model`` argument may be a single model or a list of models, as long as the accepted
- output matches. The following calls are all legal::
- model, optim = amp.initialize(model, optim,...)
- model, [optim0, optim1] = amp.initialize(model, [optim0, optim1],...)
- [model0, model1], optim = amp.initialize([model0, model1], optim,...)
- [model0, model1], [optim0, optim1] = amp.initialize([model0, model1], [optim0, optim1],...)
- Backward passes with multiple optimizers
- ****************************************
- Whenever you invoke a backward pass, the ``amp.scale_loss`` context manager must receive
- **all the optimizers that own any params for which the current backward pass is creating gradients.**
- This is true even if each optimizer owns only some, but not all, of the params that are about to
- receive gradients.
- If, for a given backward pass, there's only one optimizer whose params are about to receive gradients,
- you may pass that optimizer directly to ``amp.scale_loss``. Otherwise, you must pass the
- list of optimizers whose params are about to receive gradients. Example with 3 losses and 2 optimizers::
- # loss0 accumulates gradients only into params owned by optim0:
- with amp.scale_loss(loss0, optim0) as scaled_loss:
- scaled_loss.backward()
- # loss1 accumulates gradients only into params owned by optim1:
- with amp.scale_loss(loss1, optim1) as scaled_loss:
- scaled_loss.backward()
- # loss2 accumulates gradients into some params owned by optim0
- # and some params owned by optim1
- with amp.scale_loss(loss2, [optim0, optim1]) as scaled_loss:
- scaled_loss.backward()
- Optionally have Amp use a different loss scaler per-loss
- ********************************************************
- By default, Amp maintains a single global loss scaler that will be used for all backward passes
- (all invocations of ``with amp.scale_loss(...)``). No additional arguments to ``amp.initialize``
- or ``amp.scale_loss`` are required to use the global loss scaler. The code snippets above with
- multiple optimizers/backward passes use the single global loss scaler under the hood,
- and they should "just work."
- However, you can optionally tell Amp to maintain a loss scaler per-loss, which gives Amp increased
- numerical flexibility. This is accomplished by supplying the ``num_losses`` argument to
- ``amp.initialize`` (which tells Amp how many backward passes you plan to invoke, and therefore
- how many loss scalers Amp should create), then supplying the ``loss_id`` argument to each of your
- backward passes (which tells Amp the loss scaler to use for this particular backward pass)::
- model, [optim0, optim1] = amp.initialize(model, [optim0, optim1], ..., num_losses=3)
- with amp.scale_loss(loss0, optim0, loss_id=0) as scaled_loss:
- scaled_loss.backward()
- with amp.scale_loss(loss1, optim1, loss_id=1) as scaled_loss:
- scaled_loss.backward()
- with amp.scale_loss(loss2, [optim0, optim1], loss_id=2) as scaled_loss:
- scaled_loss.backward()
- ``num_losses`` and ``loss_id``\ s should be specified purely based on the set of
- losses/backward passes. The use of multiple optimizers, or association of single or
- multiple optimizers with each backward pass, is unrelated.
- Gradient accumulation across iterations
- ---------------------------------------
- The following should "just work," and properly accommodate multiple models/optimizers/losses, as well as
- gradient clipping via the `instructions above`_::
- # If your intent is to simulate a larger batch size using gradient accumulation,
- # you can divide the loss by the number of accumulation iterations (so that gradients
- # will be averaged over that many iterations):
- loss = loss/iters_to_accumulate
- with amp.scale_loss(loss, optimizer) as scaled_loss:
- scaled_loss.backward()
- # Every iters_to_accumulate iterations, call step() and reset gradients:
- if iter%iters_to_accumulate == 0:
- # Gradient clipping if desired:
- # torch.nn.utils.clip_grad_norm_(amp.master_params(optimizer), max_norm)
- optimizer.step()
- optimizer.zero_grad()
- As a minor performance optimization, you can pass ``delay_unscale=True``
- to ``amp.scale_loss`` until you're ready to ``step()``. You should only attempt ``delay_unscale=True``
- if you're sure you know what you're doing, because the interaction with gradient clipping and
- multiple models/optimizers/losses can become tricky.::
- if iter%iters_to_accumulate == 0:
- # Every iters_to_accumulate iterations, unscale and step
- with amp.scale_loss(loss, optimizer) as scaled_loss:
- scaled_loss.backward()
- optimizer.step()
- optimizer.zero_grad()
- else:
- # Otherwise, accumulate gradients, don't unscale or step.
- with amp.scale_loss(loss, optimizer, delay_unscale=True) as scaled_loss:
- scaled_loss.backward()
- .. _`instructions above`:
- https://nvidia.github.io/apex/advanced.html#gradient-clipping
- Custom data batch types
- -----------------------
- The intention of Amp is that you never need to cast your input data manually, regardless of
- ``opt_level``. Amp accomplishes this by patching any models' ``forward`` methods to cast
- incoming data appropriately for the ``opt_level``. But to cast incoming data,
- Amp needs to know how. The patched ``forward`` will recognize and cast floating-point Tensors
- (non-floating-point Tensors like IntTensors are not touched) and
- Python containers of floating-point Tensors. However, if you wrap your Tensors in a custom class,
- the casting logic doesn't know how to drill
- through the tough custom shell to access and cast the juicy Tensor meat within. You need to tell
- Amp how to cast your custom batch class, by assigning it a ``to`` method that accepts a ``torch.dtype``
- (e.g., ``torch.float16`` or ``torch.float32``) and returns an instance of the custom batch cast to
- ``dtype``. The patched ``forward`` checks for the presence of your ``to`` method, and will
- invoke it with the correct type for the ``opt_level``.
- Example::
- class CustomData(object):
- def __init__(self):
- self.tensor = torch.cuda.FloatTensor([1,2,3])
- def to(self, dtype):
- self.tensor = self.tensor.to(dtype)
- return self
- .. warning::
- Amp also forwards numpy ndarrays without casting them. If you send input data as a raw, unwrapped
- ndarray, then later use it to create a Tensor within your ``model.forward``, this Tensor's type will
- not depend on the ``opt_level``, and may or may not be correct. Users are encouraged to pass
- castable data inputs (Tensors, collections of Tensors, or custom classes with a ``to`` method)
- wherever possible.
- .. note::
- Amp does not call ``.cuda()`` on any Tensors for you. Amp assumes that your original script
- is already set up to move Tensors from the host to the device as needed.
|