[<< | Prev | Index | Next | >>]

Friday, September 29, 2023

Paper Review: Relative gradient optimization of the Jacobian term in unsupervised deep learning



[Research Log: more Probability Density Mapping without the matrix inverse]

I recently finally got around to reading Relative gradient optimization of the Jacobian term in unsupervised deep learning*, which essentially provides a method for doing Probability Density Mapping (PDM) without the matrix inverse.

At first I was unsure if I liked the approach, but after a related tangent I gave it some more thought and have warmed up to it. I'll explain the key insight here with a less formal and much simpler derivation than in the paper.

Practically speaking a PDM training implementation of a layer has a locally generated term which is trying to expand the probability density through the layer's transform, and a back-propagated term from above which is trying to keep the final z value back in its high-probability region. (See PDM eqn 1)

So, for instance, for a straight-through, linear scaling layer, `z = s x`, we need the gradient of our pattern probability, `p(x)`**, with respect to the scale (our free parameter), which is, for some back-propagated `b = (d p)/(d z)`:

` (d p(x))/(d s) = b x + 1/s` (1)

This implies the scale, s, is optimal when `s = -1/(b x)` (speaking of the average bx here).

Now, practically speaking we're usually using a gradient optimizer (like SMORMS3 or whatever) which is going to non-uniformly scale the various components of our gradient by some values in order to optimize the signal-to-noise. So, strictly speaking, we're not usually going in the direction of the true gradient, but rather some skewed version of it. Presumably as long as the dot product of our pragmatic gradient with the true gradient is positive, we're still making forward progress, and that's satisfied as long as those adaptive scaling values are positive.

All of which is to say: we're basically free to scale our gradient by any positive value (that's not a function of x) and not only are we still going categorically in the right direction, but chances are our gradient optimizer will adjust for the change anyway.

So, for instance, we might choose to scale the above gradient by s*s (which, being a square, is always positive) in order to avoid the divide:

` (d p(x))/(d s) \prop b x s^2 + s` (2)

Note that this still maximizes when `s = -1/(b x)`. The main difference, adaptive optimizations aside, is that instead of keeping s away from 0 by presenting a near-infinite gradient away as s approaches 0, this new version keeps s away from 0 simply by slowing down faster than s can get there. It is a trade-off for numerical stability, in terms of where the bad behavior is likely to be -- near 0, or for large values of s. In any event, it still works fine, a gradient optimizer is likely to undo the difference anyway, and it avoids a divide.

That last bit isn't too important there but you can see where it's leading: A matrix product is analytically akin to a scale, and the default PDM gradient is, analogous to the above:

` (d p(x))/(d W) = b x^T + W^-T` (3)

(Note this is essentially equation 5 here, where `b = -z` is the gradient of a log normal.)

Which, by the same process suggests:

` (d p(x))/(d W) \prop b x^T W^T W + W^-T W^T W = b x^T W^T W + W` (4)

And that's it -- poof, the inverse is gone.

Now, as for these being matrices and not just scalars, it's useful to remember that the inverse of a matrix can be found by simply inverting its singular values and transposing, which means the only difference between a matrix's transpose and its inverse is the magnitude (not sign) of its singular values--both have identical singular vectors and so map displacements through in the same directions plus or minus a skew. And in particular, `W W^T` will be symmetric, so lacking any rotational component (with eigenvectors equal to the left singular vectors of W), and with strictly positive eigenvalues--which assures we will never reverse the sense of our gradient. (And note also the same singularity avoidance logic applies: the modified version reduces away the gradient in the direction of the small singular values as they approach 0, so the gradient still avoids creating a non-invertible matrix, provided a reasonable learning rate.)

And in a sense, that's all there is to it--it's that simple. The paper derives the same result as an optimal direction given the premise of the gradient being relative to `W` (but as mentioned I suspect any gradient optimizer obviates that anyway). Most usefully, they show it works in practice.

I'm still running some comparisons myself, to see how it compares with the orthonormalizing approach. The latter potentially makes convolutions easy (provided you cycle through a set of kernels in order to retain a 1:1 mapping) because it would enforce orthonormality of the effective full matrix, ultimately meaning synthesis could be performed by the same convolution in reverse (because the inverse would equal the transpose, and the transpose has equivalent structure and sparsity to the original). But I haven't proven to myself yet that it learns as well as the simple and efficient unconstrained form provided by this paper. TBD.


* I think the paper reverses z and delta in a number of equations, such that, for instance, equation 15* is wrong (typo wrong, not conceptually wrong). Their sample code has it the other way. It's possible I'm just mis-reading, but that's my impression.

** I say p(x) as a shorthand, but really it should be log p(x) everywhere.



[<< | Prev | Index | Next | >>]


Simon Funk / simonfunk@gmail.com