Jeremy Barnes' Blog

Learning to Rank Challenge: Yahoo Misses the Point

2010-03-16T00:00:00-07:00

A few weeks ago, Yahoo announced their Learning to Rank Challenge. Having recently done a few similar challenges, and worked with similar data in the past, I was quite excited. But since I’ve downloaded the data and looked at it, that’s turned into a sense of absolute apathy. It seems like they have done everything possible to remove all value from their data before they released it; and the fact that they released it at all in such a form shows a total misunderstanding of the value of Machine Learning research and the value proposition behind such competitions.

The Million Dollar Dataset?

The dataset that Yahoo released is, at face value, a dream dataset for anyone who’s interested in information retrieval. It is generated from about 36,000 queries and contains human relevance judgements for about 1 million URLs, or about 25 URLs per query. To generate this dataset yourself, you would need to, at the very minimum:

Determine a representative list of 36,000 queries (not an easy task);
Determine a useful set of URLs to annotate for each query (again, much more complicated that you would expect);
Build an infrastructure to allow 1 million URLs to be effectively annotated;
Decide on guidelines for annotation so that the job would be as consistent as possible (this document would invariably end up with at least 100 pages);
Assuming 1 second per URL to tag, spend 250 hours of mind-numbing work to tag them.

In real life (I’ve done a lot of this kind of work) the error rate on annotating 1 million queries with one person in 250 hours would be far too high to be useful to train Machine Learning algorithms of any precision, especially on the less common cases. In order to do a good job, I’d expect more like 1-5 minutes per query from more than one annotator (the annotators would have to actually visit each of the URLs—it’s surprising how often the title of a page is completely unrepresentative of the content, and how often relevant material is found hiding in a completely unrelated document).

In addition, the examples that provide the most information are those where the result is close to (or defines) the decision boundary: the marginal cases. These also take a lot longer to annotate. Finally, it is usually necessary to come back to those examples which are the most surprising to the algorithms that are learning from them to ensure that the tagging of these examples is correct—many of the more powerful machine learning algorithms can be thrown off by a relatively small amount of noise as they weight examples non-uniformly.

Overall, I’d say that this dataset represents 5-20 man-years of work given the resources that Yahoo already has, and would cost anyone else somewhere between $500,000 and $2,000,000 to generate for themselves, depending upon the accuracy and how meticulously it was annotated.

In addition, Yahoo is putting up $30,000 in prize money, for 8 prizes ranging from $8,000 to $1,000. Compared to the $1 million offered by NetFlix it’s paltry, but it also reflects the paltry value of the competition to the Machine Learning community.

This Horse has Rotten Teeth

For reasons that we’ll get to in a moment, the dataset has been eviscerated of anything that could have any substantial value. Instead of releasing a dataset that can be explored and analyzed (like NetFlix did in their competition) and contribute to the state of the art in information retrieval, Yahoo has removed almost the entire value:

The actual text of the queries isn’t released, only a query number.
The URLs that the documents correspond to haven’t been released either.
There is a feature set of about 600 features for each document, which have been obfuscated via rescaling and are only identified via a number.
Once the contest period has ended, you are obliged have to delete the dataset from your disk. As a result, nobody (except Yahoo) can investigate the algorithms that others have generated at the end of the contest, nor perform follow-on work.
Instead of a typical 4-label (1=very relevant to 4=completely irrelevant) judgement labels, Yahoo have just released binary labels, without saying to which categories they belong. This information is important as 4s are particularly unreliable. (Note: I was wrong about this point. The dataset does include graded relevance labels, from 0 to 4. Thanks to Jon Elsas for bringing this to my attention).

These restrictions completely disassociate the dataset from its problem domain (information retrieval or “Learning to Rank”), and in doing so reduce the value to something approaching zero. In effect, Yahoo is outsourcing (or “crowdsourcing” if you will) the work of discovering which Machine Learning paradigm is most suitable for their data, information which is only really useful to Yahoo themselves as only they have this kind of data.

What is Machine Learning good for?

Having just made a value judgement, I should describe the tenets of my position. To me, the value of Machine Learning to society is that it allows us to solve problems that we couldn’t solve without it. This doesn’t mean that the all machine learning research should be of an applied nature; far from it. But at some point, as a society, when we decide to allocate resources to Machine Learning instead of (say) to growing potatoes or curing cancer, we’re making some kind of a judgement about the value of one activity over the other.

By creating this competition, Yahoo is causing resources to be allocated away from other Machine Learning research and towards their own problem, both via a price signal (“this problem is worth $30,000 for someone to solve”) and via the competitive spirits of those involved. In my opinion, this reallocation of resources is likely to be both towards solving a dead-end problem. In addition, the value that Yahoo is providing to Machine Learning in return is derisory (due to the restrictions enumerated above), and will not come close to compensating for the time spent on solving the problem. As a result, it will cause a net loss to society. And that’s a real shame.

Getting it Right

It’s instructive to compare Yahoo’s challenge with the Netflix Prize. (I didn’t participate in the Netflix Prize, though I followed it closely, and I know several people who did participate). As I argued above, the effect of Yahoo in organizing their challenge will be to allocate some Machine Learning research towards improving Yahoo’s search engine, in exactly the same way as the Netflix Contest allocated (a staggering amount of) research capacity towards the goal of improving Netflix’s recommender system.

There are big differences between these two competitions, however. Improving Neflix’s recommender system also had big benefits for the broader community. The data for Netflix was open enough that people could work directly towards the problem of recommending movies. The availability of this dataset significantly lowered the barrier to entry for researchers in the field and allowed a vulgarization of the entire field. Significant research findings were made by people with little formal training at all. Many predictors and techniques were discovered, some specific to movies and some of a more general utility. And we could go on and on. It was clearly a win both for Netflix and society due to the residual value that was left (and even despite the fact that the dataset couldn’t be donated to the machine learning community at large as was originally intended, due to privacy concerns).

Compare with Yahoo’s effort. Nothing will be learnt about ranking of search results. Some improvement will be made on Yahoo’s dataset (I suspect by using many different classifiers and learning to blend them sparsely), but this will not aid search engine ranking in general. At the end of the competition, only Yahoo will have benefited; teams will not even be allowed to test other teams’ solutions on the competition data. In my opinion, it’s a net loss to society (taking researchers attention away from more useful or fruitful problems), and only Yahoo stands to benefit. A dataset that has been neutered in this way is only really of interest to the Machine Learning community, and so there won’t be any influx of new researchers and ideas like in Netflix.

So isn’t it useful to have another dataset, even if it’s just a black-box that will self-destruct after 3 months, to test algorithms on? This is a contentious, subjective and philosophical topic, so I won’t go too much into it here. In my opinion, the answer is no. With no link to a problem, the value to society is lost. That’s not to say that all research should be linked to a problem, far from it. But this kind of dataset simply encourages more “UCI papers” (my phrase): egoistical papers that try the author’s algorithm of choice over a large number of datasets (normally a selection from the UCI repository), with no regard for the problem that the dataset represented and only superficial analysis, and describes on which datasets the algorithm was better. (Mercifully, these have fallen out of fashion in the last 10 years). There is plenty of interesting research being done by treating datasets as black boxes, which has very strong links to real-world problems, and certainly no shortage of black-box data.

In Yahoo’s Defence

Yahoo, in their FAQ about the competition, respond to this kind of criticism as follows:

Q. Why not release the actual queries and urls?
A. For two reasons of competitive nature:
Feature engineering is a critical component of any learning to rank system. For this reason, search engine companies rarely disclose the features they use. Releasing the queries and urls would lead to a risk of reverse engineering of our features.
Our editorial judgments are a valuable asset, and along with queries and urls, it could be used to train a ranking model. We would thus give a competitive advantage to potential competitors by allowing them to use our editorial judgments to train their model.

I’m sympathetic to these concerns. If high-tech has been a battlefield recently, the bloodiest front-line is in search, and Yahoo must be pretty battle-weary by now. To really push the metaphor, the last thing that they want is to open a second front against an upstart that got their big break with Yahoo’s own data. That being said, I don’t think that their arguments hold much water.

Let’s deal with point number 2 first, as that’s the easiest one to debunk. Yahoo, just like Netflix when they released their dataset, could simply state that nobody is allowed to use that dataset to train a commercial search engine.

Anyone who was big enough to hope to compete with Yahoo on search would be a rather large entity, for whom two million dollars (to generate the dataset themselves) would not be such a large amount of money, and who would most likely be paranoid about tainting their service with someone else’s data (or being either sued out of existence by Yahoo or sued out of their upside by their investors). I’ve heard that Google, for example, doesn’t want to hear anything about other people’s ideas (they explicitly require that people not tell them anything about their intellectual property in meetings), for fear that they’ll end up (like Microsoft in the early 2000s) with a bad reputation and bogged down in legal disputes.

On point number 1, I have to admire their frankness as I couldn’t have argued the point better myself. Feature engineering is a critical component of any learning to rank system. It’s feature engineering that provides the link between the problem domain and Machine Learning. That’s exactly why the competition they created is useless for learning to rank, as it severs this link and precludes any kind of feature engineering.

Now I can imagine that they don’t want search engine optimizers to know their features for fear that they structure their sites to exploit them. I’m also sure that Yahoo have put a lot of creative energy into designing their features, and they don’t want to give this information away for free to their competitors. Still, it strikes me as a knee-jerk reaction. Google had some of their most important features published from the beginning, and it doesn’t exactly seem to have hurt them. Indeed, a lot of useful features these days must come from subsidiary sources of data (such as feedback from click-through data), which are only available to established players. Finally, the ranking function just isn’t a major differentiator any more, and won’t be until someone comes up with some really new ideas. Witness the wrecks of wannabe Google killers littered over the landscape, most of which added something new (and had broadly comparable ranking performance) but were obliterated in the marketplace anyway.

Yahoo’s reasoning for not including the URLs and queries with the features was due to the possibility of people inferring their features from the dataset. I think that this is a bit far-fetched, though not beyond the realm of possibility (and surely the less obvious the feature, the less likely it is to be inferred like this). Again, one way to avoid this problem would be to not include any features that they considered to be part of their “special sauce”; then they could release the URLs and queries.

And why not just leave out any features that don’t match a list of obvious or broadly known features, such as Eigenvector Centrality (or similar), TF-IDF, collocation of query terms, etc? A survey of the TREC conferences would provide a good list of those that could be given away.

Finally, I’m sure that Yahoo takes the privacy of its users very seriously, and had some serious reservations about releasing the dataset for this reason. This is a laudable and sensible attitude; similar issues just forced NetFlix to cancel their second competition and caused AOL all kinds of problems in 2006. However, if they were really serious about providing something of value to the community in return for free analysis of their data, 36,000 queries is not such a large number to check manually for personally identifying information. They could be checked largely automatically with a marginal extra cost over that incurred to tag the results. (If a query has been submitted to their search engine 1,000 times or more, then it’s almost certainly not a personal query. These could be double-checked quickly, leaving only the tail queries to be checked thoroughly.)

Dysfunctional or Paranoid?

In my opinion, Yahoo’s Learning to Rank challenge is a wasted opportunity for Yahoo to add a bit of shine to their brand and add some creative energy to their besieged engineering team, whilst simultaneously boosting an area of research that is important to their success. Instead, they come out looking cheap, out of touch and paranoid.

Don’t get me wrong: I think that Machine Learning competitions are a positive development and have their place. Competitions have been proved to be a powerful tool to rapidly develop a field or to pull research out of stagnation. However, these competitions need to be designed with careful though about the value that they propose over and above the prize money.

In what I’ve heard about working with Yahoo, turf wars and organizational mayhem are par for the course. There are no doubt still many fine engineers working there; I think that it’s unlikely that their R&D team decided on their own account to neuter the dataset. This looks to me like it comes from some their organizational malaise, where maintaining the status quo (and what a status quo it is) is more important than taking a risk and allowing for innovation. The Netflix contest had support from the Reed Hastings (the CEO) downwards, and they had enough vision to pull it off. It’s hard to see where such vision could still be hiding at Yahoo.

Do I think that people should stop doing this competition? It’s not my place to say. But think carefully about why it’s more valuable to you than whatever you would have done instead, before you start providing free contracting services to a billion dollar corporation.

Recoset-Starting a Company

2010-03-09T00:00:00-08:00

It’s been official for a couple of weeks now: I’m starting a company. Well, we’re starting a company, Daniel and I. We’ve even got share certificates on official-looking paper.

The company is called Recoset (web site isn’t up yet). The name is designed to invoke “recommendation”, as in recommendation engines, most famously used at Netflix. Apart from that there’s no meaning in it: it’s a name for which we could obtain the .com and twitter accounts; which means nothing offensive in any major language; and which is not too hard to remember or spell.

So what are we doing? From my point of view, the essence of the business is:

To give small e-commerce businesses access to the tools that the big guys use.

You see, there are plenty of places that have lots of historical sales data piled up, but can’t afford to buy software that comes complete with a planeload of strangely accented guys in blue suits to install it.
Our expertise is in dealing with difficult data: when there’s not much of it, when it’s noisy, when the structure is hard to tease out. So we’re going to take our expertise and use it to let smaller e-commerce businesses make the most out of their data.

Initially, we’re focusing on recommendations, based upon a scaling-down of the traditional algorithms to small amounts of data (this is much harder to do than scaling up, by the way, especially because you can’t just coast along in the wake of Moore’s law). By restricting ourselves to just e-commerce and not trying to make a general recommendation tool, we can dig really deeply into the problem domain and bring a lot of domain-specific knowledge to bear. And a lot of what a recommendation engine needs to know to do its job is, when presented well, very useful to the people running the business, so we’re going to provide access to that.

Of course, we have bigger plans for the future. But they might not be big in the way that people might expect.

3rd Place in AUSDM Competition

2009-12-05T00:00:00-08:00

I ended up in 3rd place in the AUSDM Competition. My report, which is probably the most detailed of those submitted, is available here.

A bit of background: this was a competition to blend together predictors that had been created as part of the Netflix prize. There were two tasks (RMSE, with the same goal as in Netflix, and AUC, with the goal to predict a binary-valued attribute rather than regress). There were three dataset sizes: small, medium and large. The competition was decided on the average rank of the medium and large AUC and RMSE, where the large counted for twice the small.

This was a better result than I had anticipated. Partly this is because some of the stronger entries over the small dataset didn’t end up submitting anything for the final competition (due either to then knowingly overfitting in a way that couldn’t be generalized, or using techniques that didn’t scale). I also placed much better than I had expected in the RMSE sets: second in the large RMSE and eigth in the medium RMSE. On the other hand, my AUC performance was about what than I expected: 3rd for medium and 4th for large.

Due to the improvement of the rankings of my models from the small (where I was about 15th to 20th) to medium to large datasets, it appears that other teams either overfit the small dataset or used models that were efficient on constrained data but sub-optimal with more abundant data.

Phil Brierly, who ran the competition, put together a report containing the reports of all of the teams (though, unfortunately, there was no analysis). As I wasn’t present at the conference, I didn’t hear about what kinds of discussions were had; it would be interesting to hear from anyone who had a summary of what was said.

Looking quickly through the reports from the other teams, we see that:

Andrzej Janusz from the University of Warsaw was easily the winner of the contest: he was first on the medium AUC and large RMSE, and second on the others. He used a lot of models (neural nets, regressions, …) and combined these together using a genetic algorithm to learn a very sparse representation.
Vladimir Nikulin from the University of Queensland came in second. He used random models which were combined with another sparse technique which looked at the stability of the influence of each model and excluded those whose influence was unstable.
Tom Au, Rong Duan, Guangqin Ma, and Rensheng Wang from AT&T Labs created a model that was significantly better than the others on the large AUC task. They used a lot of extra features to describe the statistical properties of each of the examples (for example, they fitted a beta distribution to the distribution of model outputs for each example), and used a simple boosted logistic regression to combine them. Their techniques are interesting in that they didn’t attempt to obtain a sparse representation.
C. Balakarmekan and R. Boobesh from team LatentView were the highest placed in the medium RMSE task. However, they just used two kinds of regression trees averaged together. It seems to me like there was probably a substantial amount of luck involved in their model.

For me, the conclusions are:

The RMSE metric is not useful for measuring progress on this kind of a noisy task, as the noise present and its distribution mean that the results are necessarily a lottery;
I am missing from my “toolbox” a means of performing feature selection.

Welcome

2009-11-29T00:00:00-08:00

Welcome to my blog!

The goal of this blog is to present my thoughts on Machine Learning and everything associated with it, as well as write about my experiences as a consultant.

The content is going to be technical, but I will try to keep it to a level where interested lay-persons can understand it.

I’m yet to see how often I will update it.