After training on about a hundred epochs, the pixelCNN/autoencoder managed to achieve pretty good results:
Unfortunately, it suffers from the same problems as the autoencoder models, i.e. the generations stay blurry.
I added a residual block to the network, since I calculated that with my current architecture, at least 14 residual blocks were needed for the 2×2 middle part to condition on the initial contour. I plan to try to add a few more to see if it helps with the lack of sharpness in the images.
I also tried to add caption information to the input by embedding the text using a char-CNN-GRU as suggested by Reed et al. 2017 when doing their experiments on the Caltech-USCD Birds dataset. As a mean to keep it simple, I embedded the text in a 64×64 2D space by first embedding it on a vector of length 64 and taking the outer product of it. Then, I simply added the generated output as a channel to the image. This way I can train the whole network rather than taking a different score for the text embedding and the image generation. Unfortunately the results are not (yet) very good:
Here are the architecture details:
- Encode a randomly choosen caption (since there are 5 to 7 by image) to 256 one-hot vectors, each of them representing a character of the caption.
- Do a 1d ‘same’ convolution with a 5×1 kernel to obtain 64 features by « character »
- Halve the input with a convolution of stride 2 and 4×1 kernels (still 64 features, but 128 « characters »)
- Perform a ‘same’ convolution again to encode each « character » in 32 features
- Halve the input again to obtain 64 « characters » each encoded in a vector of 32 units
- Pass this in 2 recurrent layers each using gated recurrent units:
- 1 going from left to right
- the other going from right to left
- Concatenate the 64 features obtained into one for each « character »
- Perform a final dot product with a 64×64 matrix of parameters as some form of attention
- Take the outer product of it
The reason why all of the generations are the same color remains a mystery for the moment. It seems that the text embedding architecture limits the capacity of the model somehow. I plan to experiment with different feature sizes, maybe replace the 2-strided convolutions by 1d max-pooling or generate the 2D embedding space differently (rather than by taking the outer product).
Training the network altogether might also be too hard a task. I will try to think of some way to train the text parameters separately from the image parameters.