In this post, I want to explore the idea of querying a generative black box as a way to get infinite labelled training data.

A generative black box is any program P that can generate a data vector x given a description vector (or code vector) y. It could be for example a 3d graphics engine. In that case, y would be a scene description: lights, objects in the scene, position of the camera; and x would be the generated image.

We could consider P (in our example, the graphics program) as an input to a learning procedure.

This type of setup is similar to a Helmholtz Machine (HM) and Analysis By Synthesis (ABS). For a good introduction, I recommend:

P can be turned into an infinite labelled training set by taking any valid y(i), feeding it into P and getting x(i), thus creating labelled set of (ys, xs).

One obvious immediate question is how to get valid, (and more importantly: plausible and meaningful) ys. Not all scene descriptions are equally plausible. Some might be physically impossible, or nonsensical. The breeder learning algorithm takes the following approach: We need an extra set X of n data vectors {x1, x2, ... xn}. We also have a recognition network Rw (eg: a feedforward neural network). The idea is to recognize the extra xs with Rw to get sensible ys. Then we can play around the those ys. In the paper, “playing around” consists of: perturbing the ys to get y'i, regenerating from them to get x'i, re-recognizing to get y'', and learning with the couple (x', y').

This approach assumes that a unlabelled data-set X is available. What could we do without such a data-set ? Is is possible to learn a model of P in an exploratory manner. Probably. I think we can turn to Schmidhuber format theory of fun and creativity for that.