When trying to find ways to improve the training time, I looked at others’ models and found out that some of my classmates who got pretty good results had surprinsingly simple models. For example, the other student in the class who trained pixelCNN (Sherjil Ozair) supposed that the pixels in the 32×32 patch wer all generated independently, conditionned only on the border of the image. Therefore, he only needed to zero out the 32×32 center part of the image and train the network as a conventional convolutionnal neural network.
Calling it pixelCNN might be an abuse of language, but the network is effective and less complicated; doing this simplification allowed me to get rid of the complicated masks on W (accounting the colors), the 5D and the 6D tensors. The computation time improved, and since the layers were thinner, I was able to train a deeper network.
Except for the masks, the architecture of this new network stayed the same; i.e. I used the residual blocks as defined in Oord et al. 2016 and ‘same’ convolutions. The first 7×7 convolution is followed by 13 residual blocks of 32 (simple) feature maps (64 in, 32 in-between, 64 out), and then by 2 simple 1×1 convolution layers of 32 feature maps. Except for the ouput layer, I used a « clipped » relu, where the values are kept between 0 and 256. I finally opted for the cross-entropy as a loss function, since the network seemed to converge faster than with the MSE.
Here are the results after a few epochs:
It still needs training but the results are already more promising than previously. I later found out that I did not account for the filter flipping in the convolution operation with my previous model, so I did not generate the pixels in the right order (or apply the right mask on W). It would explain why I was only getting dark generations; if I sampled the pixels from right to left and from bottom to top, I might have had different results.