The data team at the Guardian have created a word-tree visualisation tool which lets us query what commenters on the Daily Mail website have to say on various topics. The logic of the tool is pretty simple, as they explain:
It uses the most recent ten comments from over 100 stories featuring the words "young offenders institution" posted by the MailOnline since 2009. To use it just put in any word and it will say what comes after in any of the comments in the database. For example, if you put in the word "scum" then you can see that many users are happy to throw that word around to describe offenders. "Scum and scummer" was one inventive way that a user got their point across.
The article is a cracking read - as you'd expect, it provides example after example of crazy Daily Mail comment-bait.
But that's where I have a problem with the application of this tool - both as a linguist and as someone who works a lot on the fair and balanced use of data. Word-tree visualisation tools are useful for evaluating usage by spotting frequency patterns above the word level.
But here, the Guardian aren't using the tool to evaluate usage, so much as to judge users.
Anyone who takes at face value the Guardian's (genuine, serious and long-standing) commitment to data-driven journalism should question the approach they've taken in this piece. A quick look suggests that they've picked lemmas (words and phrases) that are geared towards providing a quick thrill for the Guardian's readers, who (I rather suspect) enjoy taking a dim view of the parochial opinions of Mail readers. Lemmas they pick for analysis include:
- bring back
- this country
The piece continues:
Some of the old bugbears of the Daily Mail such as "human rights", "Labour", "jail", "prison", "tax" and "the judge" also make for fun reads.
I can't say this strongly enough. If you go into a piece of analysis with strongly-held prejudices, you will tend to find things that confirm those prejudices, because that's what you'll be looking for. The bigger the data set, the more you will find to confirm what you already believe. (It's one of the reasons why big data analysis in data-rich but complex domains like economics is so fraught with error.) That kind of lazy poke-the-monkey analysis is, I'm sad to say, exactly what the Guardian is guilty of here. They should know better, and should do better.
# Alex Steer (27/05/2013)