# Deriving contrastive divergence

For a long time, I did not get how contrastive divergence (CD) works. I was stumped by the bracket notation, and by “maximizing the log probability of the data”.

This made everything clearer: http://www.robots.ox.ac.uk/~ojw/files/NotesOnCD.pdf.

Local copy here (in case website is down)

The only math needed is integrals, partial derivatives, sum, products, and the derivative of the log of an arbitrary
function. `(log u)' = u' / u`

.

Just so the math fits on one screen, I’ll put the equations below.

`x`

is a data point. `f(x; Θ)`

is our function. `Θ`

is a vector of model parameters.

`Z(Θ)`

is the partition function.

We learn our model parameters, `Θ`

, by maximizing the probability of a training set of data,
`X = x1,..,K`

, given as

Which is the same as minimizing the energy `E(X; Θ)`

We derive the energy with respect to the model parameters, `Θ`

Quick explanation for (7): for first bit `d log Z(Θ)`

, we just put the `d/dΘ`

on top of it. For the second bit,
we put the `d/dΘ`

inside the sum.

Quick explanation for (8): The sum is rewritten with the bracket notation.

Here we calculate the first bit of the energy derivative of (7)

(9): We use `(log u)' = u' / u`

(9 -> 10): We use the definition of Z. See (2)

(10-> 11): Easy

(11 -> 12): We use `(log u)' = u' / u`

again.

(12 -> 13): We use the definition of p. See (1)