Batch normalization (and nesterov momentum) seem to help. After only 11 epochs, an ~50% smaller network is able to reach equivalent validation performance.
Train Accuracy 0.874000
Valid Accuracy 0.802000
The code for the batch normalization layer is here:
With the same sized network as before, things stay pretty consistently around 80% but begin to massively overfit. The best validation scores, with .95 nesterov momentum are:
Train Accuracy 0.875050
Valid Accuracy 0.813800
Train Accuracy 0.967650
Valid Accuracy 0.815800
Train Accuracy 0.992100
Valid Accuracy 0.822000
I next plan to try batch normalization on fully connected and convolutional VAE. First trying with MNIST, then LFW, then probably cats and dogs. It would be nice to try to either a) mess with batch normalization and do it properly *or* simplify the equations somehow b) do some reinforcement learning like Julian is doing, but on the minibatch selection process. However, time is short!