123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288 |
- .. role:: hidden
- :class: hidden-section
- apex.amp
- ===================================
- This page documents the updated API for Amp (Automatic Mixed Precision),
- a tool to enable Tensor Core-accelerated training in only 3 lines of Python.
- A `runnable, comprehensive Imagenet example`_ demonstrating good practices can be found
- on the Github page.
- GANs are a tricky case that many people have requested. A `comprehensive DCGAN example`_
- is under construction.
- If you already implemented Amp based on the instructions below, but it isn't behaving as expected,
- please review `Advanced Amp Usage`_ to see if any topics match your use case. If that doesn't help,
- `file an issue`_.
- .. _`file an issue`:
- https://github.com/NVIDIA/apex/issues
- ``opt_level``\ s and Properties
- -------------------------------
- Amp allows users to easily experiment with different pure and mixed precision modes.
- Commonly-used default modes are chosen by
- selecting an "optimization level" or ``opt_level``; each ``opt_level`` establishes a set of
- properties that govern Amp's implementation of pure or mixed precision training.
- Finer-grained control of how a given ``opt_level`` behaves can be achieved by passing values for
- particular properties directly to ``amp.initialize``. These manually specified values
- override the defaults established by the ``opt_level``.
- Example::
- # Declare model and optimizer as usual, with default (FP32) precision
- model = torch.nn.Linear(D_in, D_out).cuda()
- optimizer = torch.optim.SGD(model.parameters(), lr=1e-3)
- # Allow Amp to perform casts as required by the opt_level
- model, optimizer = amp.initialize(model, optimizer, opt_level="O1")
- ...
- # loss.backward() becomes:
- with amp.scale_loss(loss, optimizer) as scaled_loss:
- scaled_loss.backward()
- ...
- Users **should not** manually cast their model or data to ``.half()``, regardless of what ``opt_level``
- or properties are chosen. Amp intends that users start with an existing default (FP32) script,
- add the three lines corresponding to the Amp API, and begin training with mixed precision.
- Amp can also be disabled, in which case the original script will behave exactly as it used to.
- In this way, there's no risk adhering to the Amp API, and a lot of potential performance benefit.
- .. note::
- Because it's never necessary to manually cast your model (aside from the call ``amp.initialize``)
- or input data, a script that adheres to the new API
- can switch between different ``opt-level``\ s without having to make any other changes.
- .. _`runnable, comprehensive Imagenet example`:
- https://github.com/NVIDIA/apex/tree/master/examples/imagenet
- .. _`comprehensive DCGAN example`:
- https://github.com/NVIDIA/apex/tree/master/examples/dcgan
- .. _`Advanced Amp Usage`:
- https://nvidia.github.io/apex/advanced.html
- Properties
- **********
- Currently, the under-the-hood properties that govern pure or mixed precision training are the following:
- - ``cast_model_type``: Casts your model's parameters and buffers to the desired type.
- - ``patch_torch_functions``: Patch all Torch functions and Tensor methods to perform Tensor Core-friendly ops like GEMMs and convolutions in FP16, and any ops that benefit from FP32 precision in FP32.
- - ``keep_batchnorm_fp32``: To enhance precision and enable cudnn batchnorm (which improves performance), it's often beneficial to keep batchnorm weights in FP32 even if the rest of the model is FP16.
- - ``master_weights``: Maintain FP32 master weights to accompany any FP16 model weights. FP32 master weights are stepped by the optimizer to enhance precision and capture small gradients.
- - ``loss_scale``: If ``loss_scale`` is a float value, use this value as the static (fixed) loss scale. If ``loss_scale`` is the string ``"dynamic"``, adaptively adjust the loss scale over time. Dynamic loss scale adjustments are performed by Amp automatically.
- Again, you often don't need to specify these properties by hand. Instead, select an ``opt_level``,
- which will set them up for you. After selecting an ``opt_level``, you can optionally pass property
- kwargs as manual overrides.
- If you attempt to override a property that does not make sense for the selected ``opt_level``,
- Amp will raise an error with an explanation. For example, selecting ``opt_level="O1"`` combined with
- the override ``master_weights=True`` does not make sense. ``O1`` inserts casts
- around Torch functions rather than model weights. Data, activations, and weights are recast
- out-of-place on the fly as they flow through patched functions. Therefore, the model weights themselves
- can (and should) remain FP32, and there is no need to maintain separate FP32 master weights.
- ``opt_level``\ s
- ****************
- Recognized ``opt_level``\ s are ``"O0"``, ``"O1"``, ``"O2"``, and ``"O3"``.
- ``O0`` and ``O3`` are not true mixed precision, but they are useful for establishing accuracy and
- speed baselines, respectively.
- ``O1`` and ``O2`` are different implementations of mixed precision. Try both, and see
- what gives the best speedup and accuracy for your model.
- ``O0``: FP32 training
- ^^^^^^^^^^^^^^^^^^^^^^
- Your incoming model should be FP32 already, so this is likely a no-op.
- ``O0`` can be useful to establish an accuracy baseline.
- | Default properties set by ``O0``:
- | ``cast_model_type=torch.float32``
- | ``patch_torch_functions=False``
- | ``keep_batchnorm_fp32=None`` (effectively, "not applicable," everything is FP32)
- | ``master_weights=False``
- | ``loss_scale=1.0``
- |
- |
- ``O1``: Mixed Precision (recommended for typical use)
- ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- Patch all Torch functions and Tensor methods to cast their inputs according to a whitelist-blacklist
- model. Whitelist ops (for example, Tensor Core-friendly ops like GEMMs and convolutions) are performed
- in FP16. Blacklist ops that benefit from FP32 precision (for example, softmax)
- are performed in FP32. ``O1`` also uses dynamic loss scaling, unless overridden.
- | Default properties set by ``O1``:
- | ``cast_model_type=None`` (not applicable)
- | ``patch_torch_functions=True``
- | ``keep_batchnorm_fp32=None`` (again, not applicable, all model weights remain FP32)
- | ``master_weights=None`` (not applicable, model weights remain FP32)
- | ``loss_scale="dynamic"``
- |
- |
- ``O2``: "Almost FP16" Mixed Precision
- ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- ``O2`` casts the model weights to FP16,
- patches the model's ``forward`` method to cast input
- data to FP16, keeps batchnorms in FP32, maintains FP32 master weights,
- updates the optimizer's ``param_groups`` so that the ``optimizer.step()``
- acts directly on the FP32 weights (followed by FP32 master weight->FP16 model weight
- copies if necessary),
- and implements dynamic loss scaling (unless overridden).
- Unlike ``O1``, ``O2`` does not patch Torch functions or Tensor methods.
- | Default properties set by ``O2``:
- | ``cast_model_type=torch.float16``
- | ``patch_torch_functions=False``
- | ``keep_batchnorm_fp32=True``
- | ``master_weights=True``
- | ``loss_scale="dynamic"``
- |
- |
- ``O3``: FP16 training
- ^^^^^^^^^^^^^^^^^^^^^^
- ``O3`` may not achieve the stability of the true mixed precision options ``O1`` and ``O2``.
- However, it can be useful to establish a speed baseline for your model, against which
- the performance of ``O1`` and ``O2`` can be compared. If your model uses batch normalization,
- to establish "speed of light" you can try ``O3`` with the additional property override
- ``keep_batchnorm_fp32=True`` (which enables cudnn batchnorm, as stated earlier).
- | Default properties set by ``O3``:
- | ``cast_model_type=torch.float16``
- | ``patch_torch_functions=False``
- | ``keep_batchnorm_fp32=False``
- | ``master_weights=False``
- | ``loss_scale=1.0``
- |
- |
- Unified API
- -----------
- .. automodule:: apex.amp
- .. currentmodule:: apex.amp
- .. autofunction:: initialize
- .. autofunction:: scale_loss
- .. autofunction:: master_params
- Checkpointing
- -------------
- To properly save and load your amp training, we introduce the ``amp.state_dict()``, which contains all ``loss_scaler``\ s and their corresponding unskipped steps, as well as ``amp.load_state_dict()`` to restore these attributes.
- In order to get bitwise accuracy, we recommend the following workflow::
- # Initialization
- opt_level = 'O1'
- model, optimizer = amp.initialize(model, optimizer, opt_level=opt_level)
-
- # Train your model
- ...
-
- # Save checkpoint
- checkpoint = {
- 'model': model.state_dict(),
- 'optimizer': optimizer.state_dict(),
- 'amp': amp.state_dict()
- }
- torch.save(checkpoint, 'amp_checkpoint.pt')
- ...
-
- # Restore
- model = ...
- optimizer = ...
- checkpoint = torch.load('amp_checkpoint.pt')
-
- model, optimizer = amp.initialize(model, optimizer, opt_level=opt_level)
- model.load_state_dict(checkpoint['model'])
- optimizer.load_state_dict(checkpoint['optimizer'])
- amp.load_state_dict(checkpoint['amp'])
-
- # Continue training
- ...
- Note that we recommend restoring the model using the same ``opt_level``. Also note that we recommend calling the ``load_state_dict`` methods after ``amp.initialize``.
- Advanced use cases
- ------------------
- The unified Amp API supports gradient accumulation across iterations,
- multiple backward passes per iteration, multiple models/optimizers,
- custom/user-defined autograd functions, and custom data batch classes. Gradient clipping and GANs also
- require special treatment, but this treatment does not need to change
- for different ``opt_level``\ s. Further details can be found here:
- .. toctree::
- :maxdepth: 1
- advanced
- Transition guide for old API users
- ----------------------------------
- We strongly encourage moving to the new Amp API, because it's more versatile, easier to use, and future proof. The original :class:`FP16_Optimizer` and the old "Amp" API are deprecated, and subject to removal at at any time.
- For users of the old "Amp" API
- ******************************
- In the new API, ``opt-level O1`` performs the same patching of the Torch namespace as the old thing
- called "Amp."
- However, the new API allows static or dynamic loss scaling, while the old API only allowed dynamic loss scaling.
- In the new API, the old call to ``amp_handle = amp.init()``, and the returned ``amp_handle``, are no
- longer exposed or necessary. The new ``amp.initialize()`` does the duty of ``amp.init()`` (and more).
- Therefore, any existing calls to ``amp_handle = amp.init()`` should be deleted.
- The functions formerly exposed through ``amp_handle`` are now free
- functions accessible through the ``amp`` module.
- The backward context manager must be changed accordingly::
- # old API
- with amp_handle.scale_loss(loss, optimizer) as scaled_loss:
- scaled_loss.backward()
- ->
- # new API
- with amp.scale_loss(loss, optimizer) as scaled_loss:
- scaled_loss.backward()
- For now, the deprecated "Amp" API documentation can still be found on the Github README: https://github.com/NVIDIA/apex/tree/master/apex/amp. The old API calls that `annotate user functions`_ to run
- with a particular precision are still honored by the new API.
- .. _`annotate user functions`:
- https://github.com/NVIDIA/apex/tree/master/apex/amp#annotating-user-functions
- For users of the old FP16_Optimizer
- ***********************************
- ``opt-level O2`` is equivalent to :class:`FP16_Optimizer` with ``dynamic_loss_scale=True``.
- Once again, the backward pass must be changed to the unified version::
- optimizer.backward(loss)
- ->
- with amp.scale_loss(loss, optimizer) as scaled_loss:
- scaled_loss.backward()
- One annoying aspect of FP16_Optimizer was that the user had to manually convert their model to half
- (either by calling ``.half()`` on it, or using a function or module wrapper from
- ``apex.fp16_utils``), and also manually call ``.half()`` on input data. **Neither of these are
- necessary in the new API. No matter what --opt-level
- you choose, you can and should simply build your model and pass input data in the default FP32 format.**
- The new Amp API will perform the right conversions during
- ``model, optimizer = amp.initialize(model, optimizer, opt_level=....)`` based on the ``--opt-level``
- and any overridden flags. Floating point input data may be FP32 or FP16, but you may as well just
- let it be FP16, because the ``model`` returned by ``amp.initialize`` will have its ``forward``
- method patched to cast the input data appropriately.
|