After trying the mean squared error as a loss function, no improvements have been observed yet. This may indicate that the model is indeed underfitting the data. I will run the most epochs possible until the end of the project to see if it gets to generate something.
The way this autoregressive model learns is also probably not helping. The way the mask and the convolution are applied do not allow the model to learn from the right and bottom border of the picture; it conditions only on the left and top parts of the image. What I would need is a clever way to apply the mask and convolution so that I could learn from these parts; and if possible, in a multi-scale way as well (so that we could generate samples faster).
However, given the results and the time remaining, the focus will be on implementing a way to include the captions in the learning process. If time – and the resources available – permits, the conditionning on the borders will be fixed.