In the final experiment, I did some hyper-parameter search of the unsupervised pretraining. I just searched for hyper-parameter of my model of DAE with one hidden layer since other models did not give very good results. At first, I checked whether having tied weights for auto-encoder changes anything or not. So, I did two experiments for my DAE model with ZCA and GCN pre-processing. Here are the results:
Tied weights are definitely hurting with GCN pre-processing while for ZCA it improves the training results while makes the validation worse. So, it overfits a little bit with ZCA. However, with some regularizers it may improve the results. Next, I take the best models of CGN and ZCA and change the encoder function. As mentioned in “Practical Recommendations for Gradient-Based Training of Deep Architectures” in Section 3.2, subsection “Neuron non-linearity”, having rectifiers in the decoder is not useful because when the input of the unit is less than zero the gradient does not propogate in the network. So, I just try rectified linear units (ReLU) and also softplus which performs very similar to ReLU (both as the encoder).
Here are the results for the model with GCN preprocessing. In both experiments I had to set the learning rate to 1e-5. The training algorithm diverged using higher values of the learning rates .
as the results demonstrate, sigmoid units work better with GCN.
I did the same experiment with the ZCA preprocessing. In the case of softplus untis, I used a learning rate of 1e-4 and in the case of ReLU units I used a learning rate of 1e-7. The learning diverged by using higher values of the learning rate.
The same as CGN, ZCA got worse results with non-sigmoid units. In general, the pre-processing was really time-consuming and took several hours to complete. I set the number of iterations high because pre-processing tries to find the distribution in the data which is a more complex task than the regression which is performed in the supervised part of the training. So, I could not perform an extensive search for hyper-parameters. Yoshua also suggested to do unsupervised pre-training with the convolutional models, which should work better especially for images with large dimensions. Time constraints did not allow searching in that avenue, but that would improve the results.
In both kaggle contests pre-processing had a considerable impact on the results, which was noticeable more than other adjustments of the algorithms (e.g. the choice of extensions, regularizer, drop-out ,etc.) which is also related to the task we were dealing with. Among the models, I would say convolutional models have been more influential than other models and this was quite evident in the results of the previous competition. However, a fine-tuned mlp model could also get reasonably good results.