Richard Howell-Peak

My first exposure to ‘big data’, although we didn’t call it that, was in the insurance industry. We did however work with very large data sets, with millions of records. Insurance is an interesting product, because you don’t know the cost of the product before you sell it. It’s unlike any other product. Say you’re baking cakes, you know how much the flour cost, the eggs, the sugar etc. You know how many cakes you can make with that amount, you know your fixed costs for running the business (rent, rates, insurance etc.), and you know the price of the gas or electricity for running the ovens.

You can add this all up and divide it by how many cakes you’re making, and you know each cake cost a certain amount. You know all this before you even sell any cakes, and then you can price the cake to be higher than this amount and guarantee a profit.

An insurance policy is entirely different however. It’s an agreement to cover potential future costs for a fee – the premium. So when you sell an insurance policy the cost of it is unknown at the time of sale. You may find out later that you sold it far too cheaply, and lose a lot of money, or that you were selling it far too expensively, and putting off customers unnecessarily. This is where the actuaries and pricing teams came in. The actuaries would pore over tens of thousands of historic claims, building models to predict the average cost of the customer. It was surprisingly accurate. Sure, individual to individual, mistakes would be made. But over the long-term, things would even out quite well and we’d often be close to our predictions. Each year we could feed the new data back into the model and refine it, and get better and better over time.

We would also bring in new data sources constantly. There were some interesting things that popped up. One funny example is that we would find people who kept their cars in garages, were much more likely to have expensive claims for their cars being damaged when parked by the road-side. This at first glance seems paradoxical. Surely, a car kept in a garage, is much less likely to be damaged by the road-side. That’s the entire point of keeping it in a garage. Except people know this, and so they lie on the insurance form and select the ‘I keep my car in a locked garage’ option to try and game the system and get a cheaper premium. And so the majority of the policies with ‘cars in the garage’ were actually people in high risk areas, who when first quoted their premium were rightly being charged higher premiums, and so lied to try and save some money.

Legality aside, there is a moral issue here as well. Claims are paid not by the insurance company, but by the policy holders of people who do not claim. If the premiums are high, it’s often down to the fact that the money is needed to pay for the added risk of that type of policy. Insurance companies just calculate that risk, then add a few percentage points for profit, and then sell the insurance. So when people cheat and try and defraud insurance companies, they are really just defrauding other members of the community who are then forced into paying higher premiums to cover the additional costs.

Another interesting example of innovative data sources, was that people who buy their insurance more than one week before the previous policy expires, are much less likely to claim than people who buy the same week, or especially the same day. So let’s say your policy runs out on 15th May, if you buy before the 1st May, you are significantly less likely to claim than if you buy on the 14th or 15th of May. The reason for this may not be immediately obvious, until you realise that insurance is a people-industry, and that one of the biggest risk factors in selling an insurance policy is the person you are selling to. Someone who is purchasing last-minute insurance, is quite likely to be disorganised and irresponsible compared to someone who is purchasing well in advance. This person is more likely to leave the gas on, less likely to get electrical checks done on their property, more likely to leave repairs outstanding that could lead to water damage. These are all things that the insurance company will need to pay for in the future.

And like I said, these methods were surprisingly accurate. The European Union brought in an anti-discrimination law that prevented insurance being priced differently for men and women. Personally, I wasn’t in favour of this. Partly because it is completely fair to charge men and women differently, because the risk is actually different and pricing according to risk is the entire point of insurance. It is not sexist or discriminatory to do this, it is just cold, hard statistical fact. Insurance pricing is by nature discriminatory, and if you want to take this to it’s logical conclusion, the correct thing would be to charge everyone the same fee. Regardless of how responsible someone is and how unlikely it is they are to claim, they would be charged the same as someone who is extremely reckless and doesn’t take the same care as they do.

The second, and more practical reason I was against it, was it was a total waste of time. The law stated that we were not allowed to use the person’s gender to price the insurance, which we didn’t. The interesting thing was, we didn’t need it. Men and women’s behaviour is so consistent, that we could actually infer the person’s gender by looking at other factors. Such as, make and model of car, age of car, colour of car, occupation etc. We are able to accurately predict someone’s gender, to an accuracy of 80%, based on what they did, what car they drove, where they lived and so on. And so we transferred the price loadings out onto these different variables, and charged everyone basically the same premiums as we were doing before. It was a 2 year project that achieved nothing ultimately.

Still, it just goes to show the power of the law of large numbers, and the benefits insurance companies were reaping from this before it was given the buzz word of ‘big data’. More recently, companies like Facebook, Google and Amazon have been applying similar techniques, and in some ways more advanced techniques, to answer the age old question - “what will this person get their wallet out for and buy from me”. The question every marketer and salesperson on planet Earth wants to know the answer to. By using similar statistical analysis, and also by leveraging more modern techniques such as machine-learning, these companies harvest vast quantities of data about their potential customers, and use that to predict what they are likely to buy. It’s the exact same approach used by insurance companies for centuries – what is the probability this person will claim, and how much is that likely to be? Conversely, what is the probability this person will buy? And how much are they likely to spend?

And they know frightening amount of data about you. By tracking phone activity, Google can infer the times when you go to sleep and wake up. They can build a model of your sleep-schedule. They can also monitor what you spend on the internet, thanks to the genius that is Google Analytics (code that exists on almost every website on the internet and tracks everything you do). They might realise that when you are tired late at night, you are more likely to purchase electrical goods. They will then bombard you with these advertisements late in the evening. They might realise that people in your demographic, are more likely to book holidays when it’s raining, and so show you pictures of sunny beaches and so forth when your local weather station is reporting rain.

The amount of data being collected, and being published, is vast. There are APIs everywhere. APIs for tracking weather, shipping movements, housing data, government statistics, the stock market, customer behaviour. There are APIs for controlling remote computer programs, so that vast networks of automated bots can be built to harvest and monitor things, and execute commands in response to that. This is the ‘internet of things’, where big data and portable hardware meet to create a dystopian future where everything is tracked, logged, analysed, then automatically processed to try and make some more money. In recent years, large corporations have cottoned onto this fact, and are investing large sums of money in harnessing the power of ‘big data’. The big problem then for small companies is, how do they compete? They don’t have a multi-million dollar budget to splurge on researching new technologies, or experimenting with exciting new APIs or data-sources. What they need is plug-and-play solutions that can level the playing field and give them access to big data, and machine learning, and harness the power that is data-driven artificial intelligent systems.

There is huge potential for small businesses to start leveraging this technology as it becomes available. Wouldn’t you like to know, which of your customers are most likely to stay and pay more money? Wouldn’t it be great if you found a common denominator to the customers that never come back? Maybe they are all of a certain type of customer, or a certain age. Maybe, it turns out, that most of the people that try your product and don’t come back, are middle-aged men making small purchases. Once you know this information, you can think about what is driving it. Maybe this type of person is looking for something else, maybe they need more time or a longer trial, maybe they actually wanted something more expensive but you didn’t offer it. You can now target your actions to finding and solving the problem for this customer segment, and reap the rewards of it. Maybe you set up automatic systems in place that detect this type of customer, and automatically email them 3 days later and offer them a special discount.

Big Data for Small Businesses