A bit of background: this was a competition to blend together predictors that had been created as part of the Netflix prize. There were two tasks (RMSE, with the same goal as in Netflix, and AUC, with the goal to predict a binary-valued attribute rather than regress). There were three dataset sizes: small, medium and large. The competition was decided on the average rank of the medium and large AUC and RMSE, where the large counted for twice the small.
This was a better result than I had anticipated. Partly this is because some of the stronger entries over the small dataset didn’t end up submitting anything for the final competition (due either to then knowingly overfitting in a way that couldn’t be generalized, or using techniques that didn’t scale). I also placed much better than I had expected in the RMSE sets: second in the large RMSE and eigth in the medium RMSE. On the other hand, my AUC performance was about what than I expected: 3rd for medium and 4th for large.
Due to the improvement of the rankings of my models from the small (where I was about 15th to 20th) to medium to large datasets, it appears that other teams either overfit the small dataset or used models that were efficient on constrained data but sub-optimal with more abundant data.
Phil Brierly, who ran the competition, put together a report containing the reports of all of the teams (though, unfortunately, there was no analysis). As I wasn’t present at the conference, I didn’t hear about what kinds of discussions were had; it would be interesting to hear from anyone who had a summary of what was said.
Looking quickly through the reports from the other teams, we see that:
- Andrzej Janusz from the University of Warsaw was easily the winner of the contest: he was first on the medium AUC and large RMSE, and second on the others. He used a lot of models (neural nets, regressions, …) and combined these together using a genetic algorithm to learn a very sparse representation.
- Vladimir Nikulin from the University of Queensland came in second. He used random models which were combined with another sparse technique which looked at the stability of the influence of each model and excluded those whose influence was unstable.
- Tom Au, Rong Duan, Guangqin Ma, and Rensheng Wang from AT&T Labs created a model that was significantly better than the others on the large AUC task. They used a lot of extra features to describe the statistical properties of each of the examples (for example, they fitted a beta distribution to the distribution of model outputs for each example), and used a simple boosted logistic regression to combine them. Their techniques are interesting in that they didn’t attempt to obtain a sparse representation.
- C. Balakarmekan and R. Boobesh from team LatentView were the highest placed in the medium RMSE task. However, they just used two kinds of regression trees averaged together. It seems to me like there was probably a substantial amount of luck involved in their model.
For me, the conclusions are:
- The RMSE metric is not useful for measuring progress on this kind of a noisy task, as the noise present and its distribution mean that the results are necessarily a lottery;
- I am missing from my “toolbox” a means of performing feature selection.