Here are the final results of the model I used for the project in IFT6266, i.e. image generation conditionned on a contour and a caption.
The architecture used was inspired from Oord et al., 2016‘s PixelCNN. However, it made the simplifying assumption that all pixels – and all channels – of the 32×32 center image were not condionned on one another – only on the contour. Thus defined, the model could be simply thought as a type of encoder-decoder structure using the same architecture as pixelCNN minus the masks on the filters. In details, the final model was composed of:
- 1 (‘same’) convolutionnal layer with a 7×7 kernel;
- 16 residual blocks of 32 feature maps each (64 input, 64 output, 32 in-between, see 3rd figure of « PixelCNN » for more info);
- 2 simple 1×1 (‘same’) convolutionnal layers of 32 feature maps;
- Sigmoid layer for (binary) classification;
To condition on the captions, a char-CNN-GRU was used, as in Reed et al. 2017 for their experiment on the Caltech-USCD Birds dataset. Here is the architecture of this module in details:
- Codification of a randomly choosen caption (since there are 5 to 7 by image) to 256 one-hot vectors, each of them representing a character of the caption, each of the ‘bit’ of the vector representing a character of the alphabet;
- 1d ‘same’ convolution with a 5×1 kernel to obtain 64 features by « character »
- 1d convolution of stride 2 and 4×1 kernels (still 64 features, but 128 « characters »)
- 1d ‘same’ convolution again to encode each « character » in 32 features
- 1d convolution of stride 2 with 4×1 kernels again to obtain 64 « characters » each encoded in a vector of 32 units
- 2 recurrent layers each using gated recurrent units:
- 1 going from left to right
- the other going from right to left
- Concatenation of the 64 features obtained into one for each « character »
- Fully connected layer to obtain a vector of length 64
- Outer product of the vector to get a matrix of 64×64
The matrix obtained is then fed to the 2 last layers of the model (1×1 convolution with 32 units). Plugging it earlier in the model did not yield good results.
The final architecture was initialized with the already trained parameters; the new ones were initialized using Xavier initialization (except for the GRU that were initialized orthogonally). Here are some generations after 16 epochs:
The generation is pretty good but still blurry. The test loss (negative log-likelihood) is 1871, which would be 1,83 per pixel. It is difficult to say if the captions really improved the results. The lowest achievable without them was about 1879. Better results could probably be achieved if the model was trained adversarially.