Subnormal Floats and IWAE

close encounters with floats of the other kind

[Yesterday, I was talking with Murphy from Mount Holyoke, and I mentioned this anecdote about subnormal floats. It happened more than a year ago, when I was working on the U-statistics for IWVI paper, (hence, replace “recently” with “not so recently”), but I thought it would be worth sharing.]

Recently, I found an interesting issue while training Variational Autoencoders (VAEs). In particular, I was using the Importance-Weighted Autoencoder (IWAE), with gradients computed using some variations of the Doubly Reparameterized Gradient Estimator (DReG).

The figure shows the time it takes, in seconds, for each epoch when using different variations of the gradient estimators/methods (different colors) as a function of the epochs. The specifics of the methods are not relevant, but what’s interesting is that the curves were not flat! Somehow, for some methods, the amount of computation increased as training progressed, and then the time per epoch decreased.

What exactly caused this extra computation? It turns out that the culprits were some tiny objects called subnormal floats. A float is usually represented as a number in the form \(1.3493 \times 10^{32}\), where the exponent is such that the first digit is different from 0. With a 32-bit float, the smallest (in absolute value) float is something like \(1.17549 \times 10^{-38}\). To be precise, that is the smallest number with a non-zero first digit. If we have something smaller than that, the first digit will be zero, and they are identified with a 0 in the exponent part. These are the subnormal (aka denormal) numbers. https://en.wikipedia.org/wiki/Subnormal_number Operations on these numbers carry a performance penalty and can be up to 100 times slower. This situation is especially true for code using the CPU (Intel).

We don’t need to get into the specifics of the gradient estimator. However, for explaining this phenomenon, we have to recall that it has a form like this: \(\sum_{i}^K\Bigl(\frac{w_i}{\sum_jw_j}\Bigr)^2\cdot\mathrm{something}\). That is, we are squaring self-normalizing numbers. Early in the optimization process, those numbers become very small, eventually becoming subnormal. As the optimization progressed, the smallest weights grew, and the performance penalty disappeared. Apparently, this is not an issue if you use GPUs, as long as you only perform matrix-related operations. If your operations involve functions like sqrt, you might have a problem.

What I find fascinating is that this experiment turned out to be a great proxy for a quantity of interest: the density of particles with almost no weight (recall that this is a form of self-normalizing importance weighting). This density increased early in the optimization and later decreased. It reminds me a lot of those experiments from physics where the quantity of interest can only be observed/measured indirectly.

A solution to this problem is a flag in the processor to treat subnormal floats as zero. However, using the flag introduces some errors, but often, this error is negligible. I set this flag to true using PyTorch’s torch.set_flush_denormal(True), and my problems were solved.

You can read more about this in:

PyTorch:Issue-4737, PyTorch:Issue-2554, PyTorch:Issue-19652, https://github.com/chainer/daz, https://discuss.pytorch.org/t/matrix-multiplication-slows-down-on-cpu-single-and-double-precisions/12473, and finally, in relation to GPUs, in https://developer.nvidia.com/blog/cuda-pro-tip-flush-denormals-confidence/

For completeness, this is the same plot after setting the flag. I had to change the scale to highlight the difference in time between the different methods. The curves are now flat, as expected.

Extra references:

IWAE: Burda, Yuri, Roger Grosse, and Ruslan Salakhutdinov. “Importance Weighted Autoencoders.”

DReG: Tucker, George, et al. “Doubly Reparameterized Gradient Estimators for Monte Carlo Objectives.”