backstage.bbc.co.uk discussion about my old BBC homepage stats article
Last week the delightful and ever-entertaining[1] Kim Plowright published a couple of stats breakdowns of browser and OS usage on the BBC's site on the backstage.bbc.co.uk mailing list. Although the page impression figures were stripped out, she gave percentage breakdowns for the use of browsers across the whole of the site. In the course of the debate reference was made to my own article from the end of 2005 about the user-agent strings that visited the BBC homepage.
Whilst the comparison between the two sets of figures is a little too close to apples / oranges for comfort, the principle findings were basically the same, that only Internet Explorer, Firefox and Safari enjoy any kind of significant market share amongst browsers visiting the BBC site - and Firefox's market share seems chiefly at the expense of Internet Explorer.
The mention of my previous article on the topic sparked some debate, which echoed almost exactly the various points that were made in a Slashdot thread discussing the article when it was first published[2].
And I certainly made at least one factual error in the article. In a turn-of-phrase I described being surprised that more visits were coming from Windows 95 than Windows 2006. Clearly, I hadn't realised it was going to be more accurate in the end to describe Windows Vista as Windows 2007.
However, I do want to address some of the criticism that has been made of my article, especially since some people at the BBC have been accused of being 'misleading' based on the figures I derived. Here I am specifically looking at the accuracy of the method of using User Agents as a means of gathering statistics, and the assertion that this would lead to the level of Linux usage being under-reported.
One of the question marks raised over the figures was:
Detection software may not have been as tuned to recognize a Linux OS, after all many distros don't call themselves 'Linux', it may not be in the user agent string. (simply looking for the word Linux is not good enough).
Well, I can certainly say that the detection "software" used to compile the figures was a little more sophisticated than me just using CTRL+F "Linux" to look for various distributions.
When I went through the nearly 11,000 different user agents I examined by hand, the thing I was most alert for was anything that differed from Windows/IE, which is why my article talks about during the course of the research discovering the delights of the Kummclient, the Japanese open-source browser Sunrise, or the concept of Shonenware. The fact that Shonenware software, even when registered, can't be guaranteed not to wake Godzilla, brings a wry grin to my face every time I think of it.
Another complaint was:
A Linux user may have been misreporting the Operating System
Rather like debating the extent of 'undetected' crime, it is a self-defeating circular argument to suggest that there is a deliberate attempt to mis-represent the statistics for Linux usage, on the grounds that Linux users themselves were being deceptive. I've no doubt that the figures may have been skewed a little by this kind of user agent subterfuge, but as the Linux usage I did detect was around 130,000 requests out of the 32 million that I counted, I'm unconvinced that spoofed user agents would generate a significant statistical shift.
To put it another way, to catch up with the market share of the next biggest OS in the figures, Apple, you'd have to argue that 10 out of 11 Linux users were spoofing their user agent when visiting the BBC. And even then, they would still have achieved an OS market share of around 4%.
The person making these points also stated that the Linux usage figures I gave more accurately represent:
0.4% of users WHERE DETECTED AS using a Linux operating system AT THE TIME THEY VISITED THE BBC SITE.
and
This is the great thing about statistics people like you claim they show something and try to cover up the failings of how the sampling was done.
It shows only as much as it records. The number of recognized User Agent strings for hits on the BBC website.
Well, people like me can only point to the caveats I put around the original figures:
User agent strings aren't an exact science. Or rather, they ought to be, but in the real world they come out a right mess. I've done my best to untangle them
...
I only counted user agents that had made more than 50 requests, but between [those making] 6 million and 50 requests there were nearly 11,000 different user agents to look at. Examining that number of requests accounted for 95% of the reported traffic, but only around 1/3 of the stats report. I initially suspected that counting the whole of the tail was likely to increase the market share I derived for the quirkier set-ups, but a random sample showed that a large proportion of the tail consisted of the most popular browsers and operating systems, but with different installed toolbars or corporate network messages that distinguished them as a unique string.
And I must stress again, these figures don't represent the breakdown of visitors to the BBC site as a whole, they are based on requests to the homepage alone, over the course of one week in September. Nevertheless I think they provide an interesting snapshot of web activity.
And my conclusion opened with:
I don't think there is a huge conclusion to be drawn from this, aside from the fact that Firefox has clearly made large gains at the expense of Microsoft's Internet Explorer in the browser market for PCs, and that similarly Safari as a latecomer to the market has seized the high ground amongst Mac users.
Let me make my own Linux credentials plain here. currybetdotnet is DIY LAMP through-and-through, and when I worked at the BBC I championed using open source software components in systems wherever it was appropriate. One of the projects I was producer on was one of the first applications to be using Red Hat within the BBC News server farm.
At school I was always taught to sense-check my maths work. If I was set a sum combining various amounts of apples and oranges, and my answer turned out to be two pears and a elephant, it was probably a clue that something was wrong.
So let us sense-check the reported figures of Linux usage being less than one half of 1% of visits to the BBC homepage.
I want you to think, out of all the people you know, how many regularly use a Linux desktop OS to browse the web.
Well, OK. Given that you are probably either a regular reader of currybetdotnet, or have bothered to read this far down the page, I'm going to guess that you work in, or have more than a passing interest in, the area of computers.
So I'll re-phrase the question.
Of the people you know, excluding those who work in the computer or IT industry, how many of them regularly use a Linux OS to browse the web?
In one of the posts on the backtage.bbc.co.uk mailing list it is suggested that:
my point was that the true figure may not be quite as low as stated. I did not say it would be greatly higher, certainly not higher than WindowsXP (by a long way). I would be quite surprised if it was more than 10%.
Well, I can certainly agree with the last part of the suggestion. I'd be greatly surprised if anyone reading this can claim that 1 in 10 of their non-computer industry friends regularly uses a Linux OS to browse the web.
The thing that makes me sad about the debate about whether my Linux usage figures were put together with any degree of accuracy, or whether I put them together under orders from the BBC's pro-Microsoft FUD department, is that it detracts from what I think was the most significant finding, and the thing that I was trying to publicise:
It started with a casual enquiry from a colleague - "I wonder how many Firefox users visit the BBC homepage?" - and before I knew it I was involved in a lengthy statistical analysis of the browsers and operating systems that request the BBC homepage at http://www.bbc.co.uk.
In the conclusion of the original article I stated that:
At the BBC we tend to make some assumptions about the computer usage of the audience to the BBC homepage. We expect it to be skewed towards people who are newer to the internet. We expect it to be skewed towards corporate environments, where it will often be deemed an "acceptable" site to visit even if many are restricted by IT policy. We also expect it to be widely used at home on shared computers. All of those lead me to assume that there would be a lower than average take-up of Microsoft alternatives, and a higher skew towards the usage of security patched Internet Explorer and XP.
So I felt it was brilliant news that nearly ten per cent of the user agents hitting the BBC homepage were proudly declaring themselves to be Firefox. What an amazing success story for the open source movement, to get such mainstream adoption for a piece of software, simply by being such a better product than the main alternative, even when at such a marketing and competitive disadvantage.
Internet Explorer is "good enough" for the majority of people who are surfing the web. Firefox gobbled up market share by being a significantly better, friendlier, safer and easier to use product than Internet Explorer.
Microsoft's Windows OS in whatever flavour isn't perfect software. But it is "good enough" for most people to get most of what they want done on their computers.
So far there is no desktop distribution of Linux that even comes close to being so much better than "good enough" that it will encourage non-enthusiasts to switch.
And that, for me, is the real lesson here in the debate about desktop Linux.
Growing the market share for desktop Linux isn't about setting up straw-man arguments about why some statistical analysis of Linux usage might be wrong. It is about making it into a better, easier product for the mainstream public, if it is ever to emulate the success of Firefox.
[1] At one point in the discussion, Kim suggested that asking for the stats to be published regularly and automatically on bbc.co.uk would cause the man who runs the system to give her "a look like I'd strangled his puppy". Since I used to make similarly outlandish requests to him on a regular basis, I know just the look she meant. [Return to article]
[2] Mind you there was one point made on the Slashdot thread where I didn't have much of a leg to stand on - someone posted that regardless of the statistics, currybetdotnet was "*comic book guy voice* ugliest website ever". I can only assume that back at the end of 2005 that person hadn't visited MySpace very much. [Return to article]
I can't even get my other half to use Linux, and that's with me using it for nearly ten years :)
Most of the non-BBC IT people I know are all Windows too!
I work in IT. I don't know ANYONE who admits to using Linux. I know some Mac evangelists who think that people should care more about operating systems. Does that count?
At the risk of actually making a substantive contribution for a change, this long-time Linux user would offer that:
a) It is an unusually-low number for Unices, but
b) Even though I know several relatively-non-geeky Linux users [some are in *nix work environments, some are perpetual neophytes whom I got sick of disinfecting every time we met and who are plenty happy now],
c) I still don't know anybody at all who visits the BBC front door.
Based on nothing but looking at large numbers of numbers of other sites over a period of years, I imagine that the percentages look rather more like one would expect if one measured news.bbc.co.uk/ instead - but still lower than the average because of the crushing weight of the mean in the Beeb's specific case.
(FWIW I monitor numbers on a bunch of sites, both geeky and not, and I'd expect to see 1% to 1.5% for Linux on all of them. Until about 18 months ago it was roughly equal with MacOS variants but there's no arguing, IMHO, that Apple's selling a lot of kit these days because that's now not true by a factor of between 2 and 3. FWIW that's based on a half dozen sites skewed heavily on the untechy side, carrying about 4m pages a month - yes, not remotely in the BBC's area [see above] but still more than the average commenter knows about.)