A first version of Pixel CNN was trained. The model uses the same kind of architecture as in the original article by Oord et al., 2016, i.e. it does not yet uses the captions nor the gated convolutionnel layers.
The model is built as follows:
- 1 convolutionnal layer with a 7×7 kernel and the mask ‘a’
- 4 residual blocks of 32 (triple) feature maps each (64 input, 64 output, 32 in-between, see 3rd figure for more details) with mask ‘b’
- 2 simple 1×1 convolutionnal layers of 32 (triple) feature maps (with mask ‘b’)
- Sigmoid layer for (binary) classification
The loss function used was the binary cross-entropy. Each pixel in the input image was divided by 255 to get only [0,1] values.
The model was initialized using Xavier initialization and was trained using Adam optimization on batches of size 16 for 3 epochs. No regularizer nor normalization technique were (yet) used. All convolutions are ‘same’ convolutions, meaning that the size of the feature maps stay the same as the input image throughout the network.
The results show really dark generation. I suppose that the choice of loss function is in cause. Using fractions of 255 may result in a loss of precision, and this function performs better when the classes are 0 or 1 (and never values in between). The article used a 256-way softmax, but had to incorporate noise in the data (to indicate I guess a notion of « proximity » with the neighboring « classes »). However, a regression loss function, as the mean squared error, seems more appropriate in this case. The model may also underfit due to it being too shallow and not enough trained (the time remaining and my ressources unfortunately do not permit me to add more capacity to the model).
I will attempt to implement the architecture in the second article of Oord et al, 2016 in future versions (Conditional Image Generation with PixelCNN Decoders), so as to condition on the captions as well and to see if gated convolutionnal layers improves training. I will also give the mean squared error function a go.
To accelerate generation, it would also be nice to do parallel sampling as suggested by Reed et al., 2017. We coud sample 16 pixels at the same time, each of them 8 pixels appart (since the first convolution uses 7×7 kernels), making the image generation 16 times faster. However, for that the model needs to be trained using the non-trivial mask from the multi-scale solution, which will require some thinking in the way to implement it.