For a long time, I did not get how contrastive divergence (CD) works. I was stumped by the bracket notation, and by “maximizing the log probability of the data”.
This made everything clearer: http://www.robots.ox.ac.uk/~ojw/files/NotesOnCD.pdf.
Local copy here (in case website is down)
The only math needed is integrals, partial derivatives, sum, products, and the derivative of the log of an arbitrary
(log u)' = u' / u.
Just so the math fits on one screen, I’ll put the equations below.
x is a data point.
f(x; Θ) is our function.
Θ is a vector of model parameters.
Z(Θ) is the partition function.
We learn our model parameters,
Θ, by maximizing the probability of a training set of data,
X = x1,..,K, given as
Which is the same as minimizing the energy
We derive the energy with respect to the model parameters,
Quick explanation for (7): for first bit
d log Z(Θ), we just put the
d/dΘ on top of it. For the second bit,
we put the
d/dΘ inside the sum.
Quick explanation for (8): The sum is rewritten with the bracket notation.
Here we calculate the first bit of the energy derivative of (7)
(9): We use
(log u)' = u' / u
(9 -> 10): We use the definition of Z. See (2)
(10-> 11): Easy
(11 -> 12): We use
(log u)' = u' / u again.
(12 -> 13): We use the definition of p. See (1)