The release of AOL search data risks demonising search log analysis as a tool
The biggest kerfuffle in the search world last week has been the furore surrounding the release of search log information by AOL. Although in theory they felt they had anonymised the data, because it was possible to extract from the data all of the search queries an individual user had made, it didn't taken long for journalists to track real people down via their search terms.
It is interesting that a lot of media outlets are terming it a data "leak", when in fact it was quite deliberately released into the academic community by information science researchers at AOL - a case where seeking forgiveness rather than permission didn't work out so well.
Some of the reactions around the internet have been amusing, from the typical rubbish spluttered by Orlowski saying there is nothing to be gained from analysing search and, anyway, isn't science rubbish these days, to those who have used the story as an excuse to ramp up their unstinting brand loyalty to Google - "Google, I trust you". The story has also proved one other internet truism - that rather like the genie in the bottle, once you've put something out on the internet it is very difficult to retract it. Already not only are there mirror sites hosting the data, but someone has built a search interface onto the data.
The problem with all this fuss is that it runs the risk of demonising the practice of search log analysis. Suddenly proclaiming in public that you don't store or retain personal data on usage has become a marketing angle for smaller search engines - consider this quote in an otherwise fairly balanced piece about the AOL story:
The only way to ensure your Internet search records are private? Consider IX-Quick, the Netherlands-based company does not record who typed in what search terms.
Search log analysis is one of the most valuable tools for people assessing the way users interact with a site. Frankly I think it is commercial suicide for anybody to be running a website and to not be analysing their search queries - Joel Angiolillo described them as "the pained voices of customers who are desperately looking for help"
However, it is always a hazard that people will type in personally identifiable data into computer search boxes. If search logs also keep a record of IP addresses, and people type in their postcode to the search box, you can really narrow them down to a specific area of the UK. Whether analysing search logs for the BBC or for small companies or for public service sites, I've always stumbled upon personal details that people have typed in - their names, their customer number, their postcode, a phone number. That does give businesses a duty of care to protect that data, and also in the UK has DPA and FOI implications.
But what the AOL fuss absolutely shouldn't mean is that researchers should not get access properly anonymised data in order to help develop a better understanding of how people use search. Whether it is the logs of web search from AOL, or site search from the smallest site, amongst all the information we can record about internet usage, search logs still have the most vocal story to tell about the way people interact with a site.