I felt like I had to carry on with the thought about data & interpreting it, partly because I’ve had replies on twitter about the entire school of thought claiming that innovative companies ignore the data (Fast Company: “To innovate, you have to stop being a slave to data) and make things that they think people will want anyway (e.g. Apple and others who shun pre-testing and focus groups). This is very true, and there’s a lot to learn from things that fail focus groups yet do brilliantly nevertheless.
Not every place is good at innovating (for various reasons), thus becoming a little bit slaves to having to work more cleverly with the information they have on hand.
So innovation aside and back to the data, Eli Pariser wrote a (bit of a long-winding) book called ‘The Filter Bubble’, which I just finished. It’s full of useful anecdotes and stories, especially about self-serving biases and it references other popular books as of lately. I’d say it’s worth reading, even though many people reading my blurbs are probably already aware of the subject at hand and have heard reviews and seen it summarised in magazines. The gist of it is probably in the TED talk:
“Your filter bubble is your own personal, unique universe of information that you live in online. And what’s in your filter bubble depends on who you are, and it depends on what you do. But the thing is that you don’t decide what gets in.”
Problem with so much data: what you get to see is decided by an algorithm and all sorts of other forces at play. The internet you see and experience might be different from what the person next to you sees, and you’d be none the wiser. Eli’s beef seems to be that Google, Facebook and all these other places are no longer just social networks or search engines.
They’re trying very hard to be human and recommend stuff like a human would – a real search engine should serve things that you ask for, not something like ‘popular stuff in the category‘ or ‘popular with your friends right now‘ because they think that’s what you’ll be interested in after looking at your search string and having analysed users with similar interests or (online) behaviour. Remember Bing (no longer with JWT) was sold as a ‘decision engine‘ (Bing & decide):
Data analysis is imperfect: websites will never be human and algorithms will fail when they bump into outliers. There’s the famous ‘Napoleon Dynamite problem‘ people working on the Netflix recommendation algorithm stumbled upon:
It is maddeningly hard to determine how much people will like Napoleon Dynamite. When [you run] algorithms on regular hits like “Lethal Weapon” or “Miss Congeniality” and try to predict how any given Netflix user will rate them, it’s usually within eight-tenths of a star. But with films like “Napoleon Dynamite,” it’s off by an average of 1.2 stars.
It’s very weird and very polarizing. It contains a lot of arch, ironic humour […]. It’s the type of quirky entertainment that tends to be either loved or despised. The movie has been rated more than two million times in the Netflix database, and the ratings are disproportionately one or five stars. Worse (sic), close friends who normally share similar film aesthetics often heatedly disagree.
Algorithms (in the desire to sell you something) would rather recommend something that you will definitely* like, rather than something you could hate. Hate isn’t good for Netflix and others. In order to recommend and sell, they need to know you. And I like the analogy with the storyteller from this anecdote:
“A Peace Corps volunteer or perhaps it was an anthropologist in Africa was in a village when satellite T.V made it’s debut there. For a period of time, normal village life came to a halt as people watched (slack jawed I imagine). Then slowly, things began to return to some semblance of normality. When asked why people were not watching as much TV, a villager replied, “We have our storyteller.” “ I understand,” said the volunteer, “but your storyteller knows a hundred stories- the television knows thousands of stories.”
With a gleam in his eye, the man quickly responded. “That is true, but the storyteller knows me!”
Likewise, there are millions of links, products or articles, but we know you. Planners and market researchers are already aware of this. One of my favourite quotes says that:
“The trouble with market research is that people don’t think how they feel, they don’t say what they think and they don’t do what they say.”
So Google does have thousands of sites, and Facebook friends post hundreds of links – but they’re trying to understand you. Thus, you (and they) start wondering about what you do on Google: are you searching because you don’t know what you’re actually looking for (research) or are you searching because you know what you’re looking for, just not where it’s located (information)?
Back to Eli’s book, there’s an interesting piece on match.com’s algorithm. Good to remember this is people search, not blenders on Amazon they’re talking about:
“Codenamed “Synapse”, the Match algorithm uses a variety of factors to suggest possible mates. While taking into account a user’s stated preferences, such as desired age range, hair colour and body type, it also learns from their actions on the site. So, if a woman says she doesn’t want to date anyone older than 26, but often looks at profiles of thirty-somethings, Match will know she is in fact open to meeting older men. Synapse also uses “triangulation”. That is, the algorithm looks at the behaviour of similar users and factors in that information, too.”
The way the Match algorithm learns, he says, is similar to the way the human brain learns. “When you give it stimuli, it forms neural pathways,” he says. “If you stop liking something, those shut off. It’s learning as you go.”
The same principles are powering the recommendation engines at popular sites around the web. Amazon uses similar technology to recommend new products for people to buy, Pandora learns from likes and dislikes to customise its internet radio stations, and Netflix famously offered $1m to anyone who could improve the effectiveness of its algorithm by 10 per cent.
[+] $1 million would be a bargain price for an improved recommendation engine, which would increase customer satisfaction and generate more movie rental business.
Mix the reality about “don’t think how they feel, don’t say what they think and don’t do what they say” with online behaviour and it turns out that actually, you have decided (through your behaviour vs. your more or less stated intent) about what goes in: from being a blank slate the website knows nothing about to providing them with information on what you like in order to improve the quality of results from the millions of things out there. Similar to deciding with your wallet that you like one shop’s offering over another’s, but being surprised when the choice gets narrowed down (ie high street shops disappearing).
If you tell Google you’re ‘searching’ for Egypt, it won’t know whether you want a holiday, you happen to live near a business called ‘Egypt plc’ or want to know more about the Arab Spring. If your behaviour revolves around looking for holidays, then why are we so surprised that we get burnt when we throw ourselves onto the fire?
Few posts back from an interview with a Pinterest investor, something familiar:
95% of the time we use Google, we’re doing things that are completely unmonetizable. If I Google you, for example, I’m not doing anything that’s going to lead to a purchase. So most of the time, it’s about research and information.
But that’s fine. Google serves all those needs. Because if it didn’t, we would get in the habit of searching somewhere else. But in the 5% of instances, where we’re searching for something commercial, an airline ticket or a Valentine’s Day present, for example, Google monetizes the heck out of those opportunities, and their business model works incredibly well. Those searches are monetized so well that Google’s more than happy to play a public service roll the other 95% of the time to help us find the information we’re looking for.
Google’s mission was to catalogue the world’s information and make it available – and in the process to try to understand which part of it you want to access (95%), while trying to figure out what ads to serve you (5%). Everyone practices information extraction to perfect the algorithm: recovering structured data from formatted text. Identifying fields and the relationship between fields. Taking into account content, text before an item and just after an item plus many other things. This is a huge computational effort and involves a lot of probability.
Or, if you look at stuff like zite, +1 buttons and likes, you can just give that information yourself – tell them what you’re interested. It’s not exactly mirror, mirror on the wall reflecting your own interests and biases back at you, but there’s too much stuff out there to know.
“The larger lesson is that the increasing complexity of human knowledge, coupled with the escalating difficulty of those remaining questions, means that people must either work together or fail alone”
Throw yourself onto the fire, and be shocked at the existence of a ‘filter bubble’. Or perhaps not.
So what I don’t (quiet yet get):
- It seems that it’s not the filter bubble that bothers people most – it’s the fact that there’s no reset button for everything that websites know about you. If you think about making money, then why should there be?
- It doesn’t sound like a moral beef, rather a cry for competition: a place that doesn’t want to know you – but there are browsers that don’t track you and there’s always the possibility of buying from a shop, but we’ve decided that’s not important for us now (Internet sales are 12% of all UK retail sales in January 2012).
- Another interesting question which someone raised somewhere, but can’t remember where: If you’re testing your algorithm on the same data you used to train it, how could you possibly know if it’s any good?
- Is it really better to be asked for permission before (e.g. zite and the likes) instead of just opted in? Apparently so.
*more or less