Fine tuning of unsupervised pretraining

In the final experiment, I did some hyper-parameter search of the unsupervised pretraining. I just searched for hyper-parameter of my model of DAE with one hidden layer since other models did not give very good results. At first, I checked whether having tied weights for auto-encoder changes anything or not. So, I did two experiments for my DAE model with ZCA and GCN pre-processing. Here are the results:

Preprocessing ZCA CGN ZCA CGN
Tied Weights Yes Yes No No
train_objective 11.45151233 49.88327026 23.5317306519 25.3235340118
valid_objective 138.7301025 106.3289260 117.575469971 82.4680023193

Tied weights are definitely hurting with GCN pre-processing while for ZCA it improves the training results while makes the validation worse. So, it overfits a little bit with ZCA. However, with some regularizers it may improve the results. Next, I take the best models of CGN and ZCA and change the encoder function. As mentioned in “Practical Recommendations for Gradient-Based Training of Deep Architectures” in Section 3.2, subsection “Neuron non-linearity”, having rectifiers in the decoder is not useful because when the input of the unit is less than zero the gradient does not propogate in the network. So, I just try rectified linear units (ReLU) and also softplus which performs very similar to ReLU (both as the encoder).

Here are the results for the model with GCN preprocessing. In both experiments I had to set the learning rate to 1e-5. The training algorithm diverged using higher values of the learning rates .

Encoder ReLU Softplus
train_objective 46.9691467285 36.7554893494
valid_objective 133.35369873 142.489044189

as the results demonstrate, sigmoid units work better with GCN.

I did the same experiment with the ZCA preprocessing. In the case of softplus untis, I used a learning rate of 1e-4 and in the case of ReLU units I used a learning rate of 1e-7. The learning diverged by using higher values of the learning rate.

ncoder ReLU Softplus
train_objective 530.757324219 4.37229394913
valid_objective 1487.81176758 228.77412414

The same as CGN, ZCA got worse results with non-sigmoid units. In general, the pre-processing was really time-consuming and took several hours to complete. I set the number of iterations high because pre-processing tries to find the distribution in the data which is a more complex task than the regression which is performed in the supervised part of the training. So, I could not perform an extensive search for hyper-parameters. Yoshua also suggested to do unsupervised pre-training with the convolutional models, which should work better especially for images with large dimensions. Time constraints did not allow searching in that avenue, but that would improve the results.

In both kaggle contests pre-processing had a considerable impact on the results, which was noticeable more than other adjustments of the algorithms (e.g. the choice of extensions, regularizer, drop-out ,etc.) which is also related to the task we were dealing with. Among the models, I would say convolutional models have been more influential than other models and this was quite evident in the results of the previous competition. However, a fine-tuned mlp model could also get reasonably good results.


Unsupervised Pretraining

In the next attempt, I used unsupervised preprocessing. For this experiment, I first preprocessd the data using ZCA. The models are pre-trained using SGD in which the learning rate is set to 0.001 and the batch size is set to 64. The unsupervised pre-processing is run for 2000 epochs (each epoch does 144 updates, so in total 288,000 updates were done by SGD). I tried four different models:

(1) pre-training one layer of Gaussian-RBM and then use it in a mlp with sigmoid units in the first layer and linear units in the second layer. The number of hidden units have been set to 2000 which was the size of best mlp model I had trained before. Here is an example  of unsupervised pretraining and then training the supervised model.

(2) pre-training one layer of Gaussian-RBM as the first hidden layer and one layer of binay-RBM on top of it and then use them in an mlp with two layers of sigmoid units and one layer of linear units on top of them. The size of the first hidden layer was set to 2000 and the second layer was set to 500 units. Here is an example of unsupervised pretraining and then training the supervised model.

In both models (1) and (2), I set the initial values of the biases to zero and I set the init_sigma value of the grbm layer to 0.12 which is the standard deviation of the data after applying the ZCA. This parameter gets updated in learning. The weights are initialized randomly in U(-0.05 , 0.05). Here are the results:

model GRBM GRBM+binary_RBM
train_objective 118.946777344 229.345596313
valid_objective 165.663757324 192.070770264

The results are not better than my previous best results. In the next two experiments, I  use unsupervised pre-trained models of denoising auto-encoders (DAE).

(3) pre-training one layer of DAE and then use it in a mlp with sigmoid units in the first layer and linear units in the second layer. The number of hidden units have been set to 2000. Here is an example of unsupervised pretraining and then training the supervised model.

(4) pre-training two layers of DAE and then use them in a mlp with two layers of sigmoid units and one layer of linear units on top of them. The size of the first hidden layer was set to 2000 and the second layer was set to 500 units. Here is an example of unsupervised pretraining and then training the supervised model.

In models (3) and (4), sigmoid units have been used for both enocoder and decoder layers. The weights have been tied and a binomial corruptor with corruption_level=0.5 has been used. The weights of the model is initialized at U(-0.001,0.001).

Here is the result of the unsupervised pre-training with DAE:

model DAE_1_layer DAE_2_layer
train_objective 11.4515123367 226.965240479
valid_objective 138.730102539 195.842666626

Since GCN pre-processing was also giving good results, I ran another experiment for DAE_1_layer with GCN pre-processing. Here is the result:

preprocessing ZCA GCN
train_objective 11.4515123367 49.8832702637
valid_objective 138.730102539 106.328926086

The usage of the unsupervised pre-training seems promissing. However, it does not beat the previous best results. It needs hyper-parameters search to fine-tune the models.

next, I will try to search for a better hyper-parameter in the case of DAE with one hidden layer and check if the regularization helps reduce the validation error. If time permits, I may also look into unsupervised pre-trained DAE models with 2 hidden layers that can work better compared to the one I got here.


Before moving to unsupervised pre-training, I tried some regularizers to check if they can improve the results, especially for the case of ZCA, where it was obviously over-fitting. I used drop-out in three different settings: (1) applied only on the first sigmoid layer, (2) applied only on the linear layer, and (3) applied on both layers. I tried it with both GCN and ZCA preprocessing. My model is the same as the previous experiment, a mlp with a layer of 2000 sigmoid units and a layer of 30 linear units on top of it. I used a SGD algorithm with a learning rate of 0.001. Here is an example of using ZCA and dropout. Here is the results with ZCA pre-processing and drop-out:

Drop-out applied layer Sigmoid Linear Both
train_objective 4590.81396484 396.242431641 5077.44580078
valid_objective 11949.8779297 11717.328125 12575.6894531

and here is the result with GCN preprocessing and drop-out:

Drop-out applied layer Sigmoid Linear Both
train_objective 6301.25830078 7764.77001953 7980.63720703
valid_objective 8424.89257812 9282.47753906 9296.05761719

In both cases, the results are considerably worse with drop-out. Mahdi who has been working with dropout recently suggested not to use drop-out with Sigmoid units. So, I tried rectified linear units (ReLU) instead to check their performance with dropout. I checked the performance of ReLU units both with and without dropout and for each case I observed the performance with GCN or ZCA preprocessing. Drop-out was used for both layers. I used a learning rate of 1e-4 for all the cases except GCN-with-dropout in which case I had to use a learning rate of 1e-5 to get it runnning. Here is an example of using ReLU and dropout. Here are the results:

Drop-out used No No Yes Yes
Pre-processing GCN ZCA GCN ZCA
train_objective 115.597937797 8.75792694092 6272.90283203 5097.07080078
valid_objective 130.858248391 15905.6787109 8550.16796875 31898.7890625

ReLU layer with GCN is working better than other models. ZCA is obviously overfitting and drop-out is not helping to reduce the overfitting in these experiments. Instead, I used Weight-Decay for the GCN and ZCA experiments without dropout. I set the weight decay to 0.5 for both layers. Here is an example of using ReLU and weight decay. Here is the result:

Pre-processing GCN ZCA
train_objective 701.39440918 82.8944015503
valid_objective 775.842163086 186.633499146

Weight Decay obviously helps in the case of ZCA pre-processing. I also used weight decay for the three pre-processings applied in the previous experiment, namely CGN, ZCA and Standardize with sigmoid units. Here are the results:

Preprocessing GCN ZCA Standardize
train_objective 192.404678345 201.947662354 218.928588867
valid_objective 187.206420898 259.510498047 220.174514771

Weight Decay helped reduce over-fitting in the model with ZCA pre-processing, but it didn’t help much with GCN. I didn’t do an extensive search (e.g. grid search) for hyper-parameter selection, e.g. in the case of the optimal value of the regularizer, due to both lack of time and GPU resources. Since the results here are not considerably better compared to my previous best result, I decided to go directly to the un-supervised pre-training experiment taking the models with GCN and ZCA pre-processing for further evaluation.

Pre-processing the data

I tried the common pre-processors to check if they can improve the results:

1) Global Contrast Normalization (with the option of removing the mean and dividing by the standard deviation):

sqrt_bias = 10 (default) sqrt_bias = 0.1
train_objective 27.7662734985 27.7245540619
valid_objective 67.9419555664 67.9540863037

2) Standardize: I ran two experiments:

1- removing the mean of each pixel (feature) within all images and then dividing by standard deviation of the same feature. (global_mean=False, global_std=False)

2-removing the mean of all pixel within all images (global mean) and dividing by the standard deviation of all pixels in all images (global standard deviation). (global_mean=True, global_std=True)

global_mean & global_std False True
train_objective 31.039180148 54.3752174377
valid_objective 81.7226135125 104.83846283

3) ZCA: In order to get this pre-processing working, I had to set the filter_bias to a high value of 8.0. Smaller values made the ZCA process invalid as the eigenvalues were negative and they should have been brought to a positive region before getting the square root of them by using a high filter_bias. Moreover, the default number of components used by ZCA is the number of pixels which in this competition is 96*96. I tried to reduce the number of components to observer it’s influence, however, it did not yield better results:

filter_bias n_components train_objective valid_objective
8.0 9216 17.3555030823 123.846374512
9.0 9216 17.6998901367 123.712715149
8.0 7680 169.795501709 192.400527954
8.0 2304 229.467651367 192.554016113

Despite, the smaller error values produced here, when I submitted the best result in each category to Kaggle, it did not get beyond my previous best result. I’m not sure if the data-set at our disposal and the ones on the Kaggle come from the same distribution or there is another explanation for this issue.

Starting with MLP on Keypoint Contest

In the first attempt, I tried using mlp which is the simplest tool to use. However, given the results and reports of Vincent, Gabriel, Xavier and Pierre Luc it seems that mlp cannot handle these problems properly. Experiments of Gabriel explores the size of the layers of mlp models, drop-out and learning rate affects and also pre-processing like standardize. So, I did not want to explore in the same domains and get the same results. Moreover, both experiments of Gabriel (using the probability of inclusion of 0.0000000005 for the first layer) and Xavier ( experiment that using only noise for mlp which can yield almost the same result) show that this model in it’s raw format is not suited for this task. All these experiments show that there is an optimization problem which results in having high training error.  Given the current problem in reducing the training error, it does not make sense to search for regularization methods at this point.

There is still one angle that I found not being explored by others. In order to make sure that the choice of the mlp architecture could not make any difference, I made some extra experiments. Here are the results:

First Layer Second Layer train_objective valid_objective
Rectifier: 500 Sigmoid: 500 234.6371 140.6402
Rectifier: 500 Tanh: 500 236.99388 136.77894
Tanh: 500 Tanh: 500 234.6358 139.3267
Sigmoid: 500 Sigmoid: 500 233.8239 142.4371
Sigmoid: 500 Rectifier: 500 234.4540 138.4603
Tanh: 500 Rectifier: 500 234.76416 138.12779
Sigmoid: 2000 nothing 215.3746 119.57048

In these experiments, the last layer is mlp.linear with 30 output units. I also tried using only  Tanh or only Rectifier units before the last layer, but I was getting Nans that I could not get rid of. Here is an example of the model I’m running here.

I submitted both the best result I got here and the one reported by Vincent. There are some improvements here, but the results are quite close. So, my conclusion is that the ‘raw’ mlp model is rather unsuitable for this task and given the complexity of the task the learning model should be better directed to find the keypoints through some architectures that simplies the learning task.

Final Experiment with Maxout

In the last experiment I used Maxout networks to observe the accuracy of the emotion prediction using this model. I used one layer of Maxout and one layer of softmax on top of it. Pierre Luc’s Experiments on Bigger Maxout networks has resulted in overfitting. So, I did not use more than one maxout layer. In the first run, I used the transformation that I had not experimented with in my previous post. Based on the observations of Xavier, Pierre Luc and Gabriel, I performed three experiments:

  • Using all transformations
  • Using all transformations except the noisy transformation
  • Using only the occlusion
  All transformations All except noisy Only Occlusion
Test misclass 0.20000000298 0.1950000 0.14833334
Train misclass 0.00259172031 0.00248376 0.00032061

Unlike what Xavier, Pierre-luc and Gabriel used which was “All except noisy” my experiment got its best result on using only the occlusion. It has to be said that I used Maxout instead of CNN and I did not change the default values of these transformations.

In the second experiment, I used the parameters that Xavier found to be performing the best on the occlusion transformer which he mentioned recently in his blog and I set my occlusion transformer to those values. I ran three experiments with the following results:

probability 1.0 0.01 1.0 0.05
n 10 10 10 10
Low 5.2 2.0 2.0 2.0
high 10 10 10 20
Test misclass 0.1800000 0.1533333 0.170000 0.148333
Train misclass 0.001823 0.000131 0.001353 0.000320

The variables in the above table correspond to the ones described in the Xavier’s blog. I got better results using the default values which correspond to the last column. Again, it has to be mentioned that my experiments are on maxout which may produece difference results.

Next, following the results of Gabriel, I used zero padding to see if that changes the results. It improved the performance of the network as shown here by 0.01:

Padding Size 0 4
Test misclass 0.138333335 0.14833334
Train misclass 0.000368303 0.00032061

Then, I ran some experiments by changing the weight decay:

First Layer 0.0010 0 0 0
Second Layer 0.0010 0.0010 0.005 0
Test misclass 0.1449999 0.1383333 0.1500000 0.150000
Train misclass 0.0005913 0.000368 0.0006848 0.0009121

The weight decay works better for the second layer only.

Later, I also observed the influence of drop-outs. I ran some experiments by both including and excluding drop-outs. Since drop-outs in this network can only be used for the Maxout layer, the values in the following table are for this layer:

Drop-out value No drop-outs 0.9 0.75 0.5 0.25
Test misclass 0.1449999 0.14833332 0.1433333 0.1383333 0.1416666
Train misclass 0.0002589 0.00059273 0.1433333 0.000368 0.000778

Finally, Ian recommend using high learning rates with drop-outs. Based on that, I did all my previous experiments using a learning rate of 0.1. In order to observe its influence I did some experiments as well:

Learning rate 0.01 0.05 0.1 0.2
Test misclass 0.1966666 0.146666 0.1383333 0.143333
Train misclass 0.001442 0.00144277 0.000368 0.0004975

Drop-outs can improve the performance of the network, however, the user needs to fine tune it to make it work properly for his model. Otherwise, it will deteriorate the results.


Dealing with Overfitting

In the next attemp, I tried to run some experiments to find ways to reduce overfitting of my convolutional nets. In the first attempt, I used drop-outs, I used it for my 2 convolutional layers and the ReLU layer. I used drop-outs of 0.25, 0.5 and 0.75 for my layers and one model without drop-outs.

Drop-out 0.25 0.50 0.75 No drop_out
Test misclass 0.700000 0.700000 0.700000 0.700000
Train misclass 0.7145 0.7145 0.7145 0.7145

The results show some problem in learning. The only difference between the last model and the other ones are removing the drop-out. I even tried to use drop-outs just in the first convolutional layer and in another try used it only in the ReLU layer (as in the Alex’s paper it’s been used in the first 2 layers of the fully connected layers) with the same probability of 0.5 and also with weight scaling, as discussed in the paper. Yet, I got the same results. In these experiments, I did not use sparse_initialization, since I was specifying the range of the weights with irange parameter. Note that just one of these can be used during training. If sparse_init is used, the weights are drawn from N(0, sparse_stdev^2), otherwise with irange the weights are drawn from U(-irange, irange).

In the next attempt, I used sparse-initialization instead of irange. I added the sparse-initialization to the ReLU layer and the convolutional layers. The only difference is using sparse initialization instead of the irange. Here is the result:

Drop-out 0.25 0.50 0.75 No drop_out
Test misclass 0.69833 0.313333 0.2166666 0.234999
Train misclass 0.71457 0.304285 0.1791428 0.08028

I also tried sparse-initialization for the individual layers:

Sparse init ReLU Conv_1 Conv_2 Conv 1 & 2
Test misclass 0.19166 0.185000 0.260000 0.18666
Train misclass 0.00314 0.08457 0.187142 0.015714

So, in general, in my experiments, drop-outs seem to be much more influential if combined with sparse initialization. In general, I found the sparse initialization to be much more influential than the drop-outs. Without them it was difficult to get compelling results. For my further experiments, I chose the last model since it’s training error is lower and its test-error difference is almost insignificant compared to the model just using sparse initialization in the first convolutional layer. I also used weight-decay and checked its influence on different layers:

Weight decay value for all layers 0.0005 0.0010
Test misclass 0.186666 0.200000
Train misclass 0.015714 0.106857

Then, I set the weight decay to zero for different layers, keeping it at 0.0005 for all other layers:

Weight Decay zero Layer Softmax ReLU Conv_2 Conv_1
Test misclass 0.19333 0.191666 0.181666 0.200000
Train misclass 0.060285 0.040285 0.03085 0.131428

In another attempt, I used the image-transformation techniques provided by Xavier and Pierre-Luc. I ran a couple of experiments and got a test error of 0.228333 and a train error of 0.0034285. I found their tools quite useful. However, I did not test with it extensively.

Finally, there is a technique called bagging that is used to overcome overfitting and the models that have high-variance. In this approach, if we have trained a couple of models, we can use all of them for prediction. Indeed, we take the prediction of each model and get the majority vote of them. Bagging reduces variance, especially for complex models. So, if you have trained some models that are relatively all good, you can use bagging on them. I have modified the make_submission file such that it gets a couple of models as input and creates the bagging submission file as the result. So it gets the models, takes their majority vote and writes it to a .csv file which can be submitted to Kaggle website. I have uploaded this file on my website. Here is an example of how to use it. Give your comma seperated models (which are enclosed in double-quotation) as input and the result is written to a .csv file. Example:

python “model_1.pkl , model_2.pkl” result.csv

Bagging can be useful for models with diversity, like the ones using the image-transformation tool, in which each model can have quite different results due to their random image manipulation techniques. I could not experiment with it extensively, but feel free to use it, if you find useful. Here is an example of my yaml file for running these experiments.