More news on the latest model

After training on about a hundred epochs, the pixelCNN/autoencoder managed to achieve pretty good results:

Unfortunately, it suffers from the same problems as the autoencoder models, i.e. the generations stay blurry.

I added a residual block to the network, since I calculated that with my current architecture, at least 14 residual blocks were needed for the 2×2 middle part to condition on the initial contour. I plan to try to add a few more to see if it helps with the lack of sharpness in the images.

I also tried to add caption information to the input by embedding the text using a char-CNN-GRU as suggested by Reed et al. 2017 when doing their experiments on the Caltech-USCD Birds dataset. As a mean to keep it simple, I embedded the text in a 64×64 2D space by first embedding it on a vector of length 64 and taking the outer product of it. Then, I simply added the generated output as a channel to the image. This way I can train the whole network rather than taking a different score for the text embedding and the image generation. Unfortunately the results are not (yet) very good:

Here are the architecture details:

Encode a randomly choosen caption (since there are 5 to 7 by image) to 256 one-hot vectors, each of them representing a character of the caption.
Do a 1d ‘same’ convolution with a 5×1 kernel to obtain 64 features by « character »
Halve the input with a convolution of stride 2 and 4×1 kernels (still 64 features, but 128 « characters »)
Perform a ‘same’ convolution again to encode each « character » in 32 features
Halve the input again to obtain 64 « characters » each encoded in a vector of 32 units
Pass this in 2 recurrent layers each using gated recurrent units:
- 1 going from left to right
- the other going from right to left
Concatenate the 64 features obtained into one for each « character »
Perform a final dot product with a 64×64 matrix of parameters as some form of attention
Take the outer product of it

The reason why all of the generations are the same color remains a mystery for the moment. It seems that the text embedding architecture limits the capacity of the model somehow. I plan to experiment with different feature sizes, maybe replace the 2-strided convolutions by 1d max-pooling or generate the 2D embedding space differently (rather than by taking the outer product).

Training the network altogether might also be too hard a task. I will try to think of some way to train the text parameters separately from the image parameters.

Conditional Image Generation

Mariane Maynard's blog for the IFT6266 project

More news on the latest model

Laisser un commentaire Annuler la réponse.

Partager :

Articles similaires

Laisser un commentaire Annuler la réponse.