Alex Steer

Better communication through data / about / archive

Less than you assume, more than you imagine: Futureproofing online privacy

998 words | ~5 min

Bit of a long read, this. Blame cross-country rail travel.

Henry Porter, writing in the Guardian, is apoplectic about alleged efforts by GCHQ and the NSA to collect vast quantities of internet data direct from the fibreoptic cables that form the backbone of the net. He writes:

The story..must surely shake that complacency and demand a review of the profit-and-loss account in the safety versus liberty debate. And that must take in the effect the actions and views of a generation of middle-aged politicians, journalists and spies will have on people aged under 25, who may have to live with total surveillance under regimes that may be much less benign than the ones we know.

Despite this being a classic case of slippery slope rhetoric, I tend to agree. But since plenty of people will be writing about this story (as they have already) in terms of liberty vs security, I'm going to talk instead about expectations of privacy online.

What does it mean to have privacy online? In one sense, not much. Online activity is activity in a domain which is defined by communication: the transmission of information between parts of a network. By communicating over the network you are inviting third parties, not just to overhear your communication, but to be part of it. Asking for privacy in the classic sense of not being overheard is a little like asking for privacy in a game of Chinese Whispers.

But obviously this isn't satisfactory, so it can't really be what we mean when we talk informally about online privacy. Imagine that we are playing Chinese Whispers. You want to get a message to me, so you pass it through a chain of other people. They, rather obviously, know what your message is. You do not expect privacy of communication from them. But you do, reasonably, expect that they will keep your message confidential and not pass it on to others who are not in the chain.

So if I send a message from Machine A to Machine Z, and it passes through Machines B, C, D, and so on, can I reasonably expect that a stored copy of it will not be read by any person or machine not involved in its direct transmission to its destination? Or should I expect this to happen and adjust my behaviour accordingly?

Most of us think probabilistically about this, at least informally. Rather than talking in terms of absolute permission or inhibition, you figure out the probability that someone will access your message, and weigh that against the downside risk of them doing so. In other words: how likely is it that my message will go public, and how much damage would it do?

We are all aware that our online activity is part of a vast amount of similar, almost identical activity by others. So we modify our behaviour to some extent, but not as if we were being broadcast to the nation. Suppose I live in an authoritarian society and hold a critical opinion of the president. I may express this in an email to a like-minded friend, because I make a judgement that the effort required by some secret policeman to dig it out from the whole pile of online communications would be too high to make the risk of being caught badmouthing the great leader all that high.

The problem is, we're rubbish at judging risk.

The Guardian piece at the top demonstrates this in one direction. Henry Porter writes:

The two countries [Britain and the US] are rapidly perfecting a surveillance system that will allow them to capture and analyse a large quantity of international traffic consisting of emails, texts, phone calls, internet searches, chat, photographs, blogposts, videos and the many uses of Google.

Are they? Are they both capturing and analysing it? Because while capturing it may be easy (if expensive), analysing it is much harder. In particular, analysing it down to the level of individual users' individual behaviours is extremely hard, since you're effectively trying to run very granular searches on some of the largest datasets you could imagine. I suspect the author is overestimating the risk to individual liberty by underestimating the cost and complexity of the kind of operations he imagines. This is the conflation of what's plausible with what's possible.

And yet... when it comes to making this sort of judgement we also underestimate the risk, because we tend to think in terms of what we believe is possible now. Which is unwise when we're talking about permanent records of our online activity. Given time, it's perfectly legitimate to worry about what's merely plausible (any logically feasible kind of analysis), because it may become possible (thanks, Moore's Law). We also need to be aware of the fact that there are whole categories of data analysis that are possible now that were impossible a few years ago. I started my career as a dictionary editor, and when the dictionary I edited was first published in the late 19th century, there was no way to search its text except by the alphabetical ordering of its headwords. Now you can call up the results for any word in the dictionary; run regular expression queries to find words and phrases that contain fragments that interest you; and even mine the whole structure of the text for patterns.

In short: people with access to your data can probably do less with it now than you assume, but will probably be able to do more with it in future than you imagine. Any serious debate about online privacy should include that assumption.

# Alex Steer (21/06/2013)