How many British newspapers use sitemap.xml to help search engine indexing?
When I started my 'Newspaper Site Search Smackdown' the other week, within a few minutes of the first part being published Bruce had posted the following comment.
" 'a newspaper site search ought to be able to index content directly from a CMS faster than Google can crawl a site' - Perhaps, but this seems kind of irrelevant. Any well built site will be creating a sitemap.xml and pinging Google whenever any updates are made so that it's indexes will be updated immediately... "
For those unfamiliar with the protocol, a Sitemap in this sense is an XML file that should list the URLs of all of the pages a site wants to see included in search engine indexes, with additional information about how regularly they are updated, and how important they are relative to each other. Unlike ACAP, the protocol is actively supported by Google, Microsoft and Yahoo!
Now, I'm guessing that Bruce is not too familiar with newspaper and news organisation content management systems, if he expects that over the last decade or so they have all been built to generate well-formed XML sitemaps and automatically ping Google. Of course, it begged the question, how many British newspapers do have a Sitemap file?
The answer?
Two Three Four (The Telegraph have one tucked away somewhere as do The Mirror)
The Daily Mail has a sitemap.xml file at the root of their domain. This in turn lists a whole set of individual section sitemaps which list pages on the Daily Mail site going back to 2000.
The other newspaper to implement the protocol is The Scotsman, which always seems to have a very forward thinking website. Rather than split their sitemap.xml file up as the Mail does, they've chosen to include the URLs in a single file at the root of the domain.
From the client-side, though, it doesn't appear that any of the other papers are currently taking advantage of the technology.
"Rather than split their sitemap.xml file up as the Mail does, they've chosen to include the URLs in a single file at the root of the domain."
I'm not sure we can read much into this as a 'decision', it's simply because there's a limit on how many URLs can be in a single sitemap file (50,000, or 10MB). One you surpass that, you have to have multiple files, with an index file linking to them.
Erm, not really Frankie. It clearly *is* a decision by The Scotsman - they could just have easily made lots of automatically generated smaller sectional sitemaps for each directory that would be a much more scalable solution for five years time.
www.telegraph.co.uk does have a sitemap. It is built to the spec defined by Google. Good to see you continuing your evaluation of British newspaper websites. Keep up the good work Martin.
I wondered how many would creep out of the woodwork ;-)
I suspected that several papers might have the XML file required, but placed somewhere other than at the root of the domain.
For those who have never set one up, you can put a sitemap anywhere, and use Google's Webmaster Tools to tell Google where it is specifically. You might want to do that in order to help Google, but avoid giving HTML scrapers an easier attack vector.
However, there is a drawback to that approach, which is that whilst Google knows where it is, but any other search engines that support the protocol (or bloggers that want to write about it!) haven't been informed of the URL, and can't make use of it.
For info, we also have xml sitemaps across our network of newspaper sites. And thanks for more interesting coverage.
Just to note, we've got one at Metro as well (not unexpectedly, of course, as Associated Northcliffe Digital are responsible for us as well as the Daily Mail.)
You can make your Sitemap auto-discoverable, without having to name it 'sitemap.xml' in the root folder, by specifying
Sitemap: http://sitemaps.example.com/sitemap-www.xml
in your robots.txt file (which DOES have to be in your root folder).
See Google Webmaster Central for info
Indeed Frankie. The Mirror, for example, do specify the custom address of their sitemap file in their robots.txt, but The Telegraph don't.