I got the convolutional-deconvolutional VAE working as a standalone script now - training it on LFW to see the results. The code can be found here: https://gist.github.com/kastnerkyle/f3f67424adda343fef40
I have also completed coding a convnet in pure theano which heavily overfits the dogs and cats data. See convnet.py here: https://github.com/kastnerkyle/ift6266h15
Current training stats:
Epoch 272
Train Accuracy 0.993350
Valid Accuracy 0.501600
Loss 0.002335
The architecture is:
load in data as color and resize all to 48x48
1000 epochs, batch size 128
SGD with 0.01 learning rate, no momentum
layer 1 - 10 filters, 3x3 kernel, 2x2 max pool, relu
layer 2 - 10 filters, 3x3 kernel, 1x1 max pool, relu
layer 3 - 10 filters, 3x3 kernel, 1x1 max pool, relu
layer 4 - fully connected 3610x100, relu
layer 5 - softmax
The next step is quite obviously to add dropout. With this much overfitting I am hopeful that this architecture can get me above 80%. Other things to potentially add include ZCA preprocessing, maxout instead of relu, network-in-network, inception layers, and more. Also considering bumping the default image size to 64x64, random subcrops, image flipping, and other preprocessing tricks.
Once above 80%, I want to experiment with some of the "special sauce" from Dr. Ben Graham - fractional max pooling and spatially sparse convolution. His minibatch dropout also seems quite nice!
I think you should definitely bump up the input image to 64x64 and perhaps even more. Otherwise it will be impossible to tell the difference between cat and dogs, because (at low resolution) they have very similar textures and both cats and dogs appear in all kinds of colors, shapes and sizes. For example. people getting SOTA for Imagenet also seems to use 221x221 as input images. Although it is possible to get high accuracies on a dataset like CIFAR10 (with 32x32 images) the objects in these images are very different.
ReplyDelete