So in my last post I made a claim that taxonomy work was important to resolve vocabulary variations to common concepts, in order to discern patterns from multiple data sources. This article from the New York Times "Computing Crime and Punishment" is a beautiful example of how a thesaurus can be used to recognise similar concepts in unstructured data sources over very long periods of time - in this case, 121 million words describing 197,000 trials over 239 years. Of course vocabularies changed, but Roget's Thesaurus turned out to be a beautiful instrument layered on top of the data.
By Patrick Lambe