Latent semantic indexing for the web
Clara Yu, John Cuadrado, Maciej Ceglowski, and J. Scott Payne have published, under a Creative Commons License, a really interesting look at the practice, problems and applications of latent semantic indexing of documents for information retrieval systems called "Patterns in Unstructured Data". Although the mathematics can be potentially a bit head-spinning, they explain very well graphically how to represent a series of documents as mathematical values. And the key of course is that computers do maths well, and they understand language very badly.
The point that struck me most of all though was the comment in their list of potential applications - this could:
"allow an archivist to graphically manipulate data, draw boundaries between clusters, examine content relationships and add classifiers"
The paper demonstrates very well how within a specific type of document, in this case news feeds, semantic meaning could be inferred from the common appearence of stemmed content words in the index. However I believe that identifying and classifying those groups of documents is still a process that requires a human. And that is why we should be valuing the information scientists and taxonomists who work not only on defining higher-level concepts of classification, but on the coal-face practice of classifying a tide of content that is currently in danger of swamping us with information.
The paper also talks about stemming. I have concerns about the value of stemming - I constantly am faced with relevancy issues where stemming has, for example, reduced both "trains" and "training" to train. They may have the same word root, but they are not the same thing. It seems to be that the stemming tools available at the present time are not yet sophisticated enough to deal with the nuances of language, and that if someone could produce a very context sensitive stemming engine they would have a very saleable commodity indeed.
Maciej Ceglowski has also been asking Movable Type users to donate content for a semantic indexing engine, which I assume is related, and uses a fantastic Eastern European propaganda poster graphic to drive the point home.