Alex Steer

Better communication through data / about / archive

Search and economy

352 words

Just reading a case study by the NLP (Natural Language Processing) team at PARC on improving search string parsing in Powerset. It's good, but it includes the following very common question:

CEO Barney Pell observes... 'Why do we have to translate our intelligence into a grunting pidgin language in order to interact with computers?' To address this gap, Powerset's founders decided to create a consumer search engine based on natural language processing technologies, which enable people to interact more naturally with computers through normal language expressions instead of forced computer jargon.

This classic defence of NLP for search (which scarcely needs defending, as its benefits are so obvious) is based on the assumption that people want to type natural language expressions into search engines in full. This is based on the correct insight that severe restrictions on string formation (think Boolean operators or Google's inurl: or filetype: operators) can be offputting, but it's an overextension of this insight which ends up assuming that people's time has no value. Compared to computer scientists, who will do almost anything to save on an unnecessary keystroke, general searchers are pretty forgiving. Still, it would be unwise, surely, to pour too many NLP resources into high-end sentence processing of the kind needed to work out the information demand in a string like 'Which is the most expensive restaurant in New York where a fish course is not served?', if most search strings are not formed like that. Better to spend time, money and effort understanding how searchers economise - in other words, to treat search strings as a corpus and analyse the features which over-index in that corpus compared to a corpus of general written English. Click-through rates could be used as a register of semantic intent.

Of course, higher-level sentence processing is useful for answering the queries of users who do search using full sentences, but search strings probably count as a sociolect in themselves whose grammar is worth understanding.

# Alex Steer (16/05/2010)