For a long time, I did not get how contrastive divergence (CD) works. I was stumped by the bracket notation, and by “maximizing the log probability of the data”.

This made everything clearer: http://www.robots.ox.ac.uk/~ojw/files/NotesOnCD.pdf.

Local copy here (in case website is down)

The only math needed is integrals, partial derivatives, sum, products, and the derivative of the log of an arbitrary function. (log u)' = u' / u.

Just so the math fits on one screen, I’ll put the equations below.

x is a data point. f(x; Θ) is our function. Θ is a vector of model parameters.

Z(Θ) is the partition function.

We learn our model parameters, Θ, by maximizing the probability of a training set of data, X = x1,..,K, given as

Which is the same as minimizing the energy E(X; Θ)



We derive the energy with respect to the model parameters, Θ

Quick explanation for (7): for first bit d log Z(Θ), we just put the d/dΘ on top of it. For the second bit, we put the d/dΘ inside the sum.

Quick explanation for (8): The sum is rewritten with the bracket notation.



Here we calculate the first bit of the energy derivative of (7)

(9): We use (log u)' = u' / u

(9 -> 10): We use the definition of Z. See (2)

(10-> 11): Easy

(11 -> 12): We use (log u)' = u' / u again.

(12 -> 13): We use the definition of p. See (1)