captionbot.ai works pretty well.

It probably is a CNN hooked with a RNN (LSTM or GRU). The CNN transforms the image into a feature vector. The feature vector is then fed into the RNN to produce language.