Blog

Big Data for Small Businesses

My first exposure to ‘big data’, although we didn’t call it that, was in the insurance industry. We did however work with very large data sets, with millions of records. Insurance is an interesting product, because you don’t know the cost of the product before you sell it. It’s unlike any other product. Say you’re baking cakes, you know how much the flour cost, the eggs, the sugar etc. You know how many cakes you can make with that amount, you know your fixed costs for running the business (rent, rates, insurance etc.), and you know the price of the gas or electricity for running the ovens.

You can add this all up and divide it by how many cakes you’re making, and you know each cake cost a certain amount. You know all this before you even sell any cakes, and then you can price the cake to be higher than this amount and guarantee a profit.

An insurance policy is entirely different however. It’s an agreement to cover potential future costs for a fee – the premium. So when you sell an insurance policy the cost of it is unknown at the time of sale. You may find out later that you sold it far too cheaply, and lose a lot of money, or that you were selling it far too expensively, and putting off customers unnecessarily. This is where the actuaries and pricing teams came in. The actuaries would pore over tens of thousands of historic claims, building models to predict the average cost of the customer. It was surprisingly accurate. Sure, individual to individual, mistakes would be made. But over the long-term, things would even out quite well and we’d often be close to our predictions. Each year we could feed the new data back into the model and refine it, and get better and better over time.

We would also bring in new data sources constantly. There were some interesting things that popped up. One funny example is that we would find people who kept their cars in garages, were much more likely to have expensive claims for their cars being damaged when parked by the road-side. This at first glance seems paradoxical. Surely, a car kept in a garage, is much less likely to be damaged by the road-side. That’s the entire point of keeping it in a garage. Except people know this, and so they lie on the insurance form and select the ‘I keep my car in a locked garage’ option to try and game the system and get a cheaper premium. And so the majority of the policies with ‘cars in the garage’ were actually people in high risk areas, who when first quoted their premium were rightly being charged higher premiums, and so lied to try and save some money.

Legality aside, there is a moral issue here as well. Claims are paid not by the insurance company, but by the policy holders of people who do not claim. If the premiums are high, it’s often down to the fact that the money is needed to pay for the added risk of that type of policy. Insurance companies just calculate that risk, then add a few percentage points for profit, and then sell the insurance. So when people cheat and try and defraud insurance companies, they are really just defrauding other members of the community who are then forced into paying higher premiums to cover the additional costs.

Another interesting example of innovative data sources, was that people who buy their insurance more than one week before the previous policy expires, are much less likely to claim than people who buy the same week, or especially the same day. So let’s say your policy runs out on 15th May, if you buy before the 1st May, you are significantly less likely to claim than if you buy on the 14th or 15th of May. The reason for this may not be immediately obvious, until you realise that insurance is a people-industry, and that one of the biggest risk factors in selling an insurance policy is the person you are selling to. Someone who is purchasing last-minute insurance, is quite likely to be disorganised and irresponsible compared to someone who is purchasing well in advance. This person is more likely to leave the gas on, less likely to get electrical checks done on their property, more likely to leave repairs outstanding that could lead to water damage. These are all things that the insurance company will need to pay for in the future.

And like I said, these methods were surprisingly accurate. The European Union brought in an anti-discrimination law that prevented insurance being priced differently for men and women. Personally, I wasn’t in favour of this. Partly because it is completely fair to charge men and women differently, because the risk is actually different and pricing according to risk is the entire point of insurance. It is not sexist or discriminatory to do this, it is just cold, hard statistical fact. Insurance pricing is by nature discriminatory, and if you want to take this to it’s logical conclusion, the correct thing would be to charge everyone the same fee. Regardless of how responsible someone is and how unlikely it is they are to claim, they would be charged the same as someone who is extremely reckless and doesn’t take the same care as they do.

The second, and more practical reason I was against it, was it was a total waste of time. The law stated that we were not allowed to use the person’s gender to price the insurance, which we didn’t. The interesting thing was, we didn’t need it. Men and women’s behaviour is so consistent, that we could actually infer the person’s gender by looking at other factors. Such as, make and model of car, age of car, colour of car, occupation etc. We are able to accurately predict someone’s gender, to an accuracy of 80%, based on what they did, what car they drove, where they lived and so on. And so we transferred the price loadings out onto these different variables, and charged everyone basically the same premiums as we were doing before. It was a 2 year project that achieved nothing ultimately.

Still, it just goes to show the power of the law of large numbers, and the benefits insurance companies were reaping from this before it was given the buzz word of ‘big data’. More recently, companies like Facebook, Google and Amazon have been applying similar techniques, and in some ways more advanced techniques, to answer the age old question - “what will this person get their wallet out for and buy from me”. The question every marketer and salesperson on planet Earth wants to know the answer to. By using similar statistical analysis, and also by leveraging more modern techniques such as machine-learning, these companies harvest vast quantities of data about their potential customers, and use that to predict what they are likely to buy. It’s the exact same approach used by insurance companies for centuries – what is the probability this person will claim, and how much is that likely to be? Conversely, what is the probability this person will buy? And how much are they likely to spend?

And they know frightening amount of data about you. By tracking phone activity, Google can infer the times when you go to sleep and wake up. They can build a model of your sleep-schedule. They can also monitor what you spend on the internet, thanks to the genius that is Google Analytics (code that exists on almost every website on the internet and tracks everything you do). They might realise that when you are tired late at night, you are more likely to purchase electrical goods. They will then bombard you with these advertisements late in the evening. They might realise that people in your demographic, are more likely to book holidays when it’s raining, and so show you pictures of sunny beaches and so forth when your local weather station is reporting rain.

The amount of data being collected, and being published, is vast. There are APIs everywhere. APIs for tracking weather, shipping movements, housing data, government statistics, the stock market, customer behaviour. There are APIs for controlling remote computer programs, so that vast networks of automated bots can be built to harvest and monitor things, and execute commands in response to that. This is the ‘internet of things’, where big data and portable hardware meet to create a dystopian future where everything is tracked, logged, analysed, then automatically processed to try and make some more money. In recent years, large corporations have cottoned onto this fact, and are investing large sums of money in harnessing the power of ‘big data’. The big problem then for small companies is, how do they compete? They don’t have a multi-million dollar budget to splurge on researching new technologies, or experimenting with exciting new APIs or data-sources. What they need is plug-and-play solutions that can level the playing field and give them access to big data, and machine learning, and harness the power that is data-driven artificial intelligent systems.

There is huge potential for small businesses to start leveraging this technology as it becomes available. Wouldn’t you like to know, which of your customers are most likely to stay and pay more money? Wouldn’t it be great if you found a common denominator to the customers that never come back? Maybe they are all of a certain type of customer, or a certain age. Maybe, it turns out, that most of the people that try your product and don’t come back, are middle-aged men making small purchases. Once you know this information, you can think about what is driving it. Maybe this type of person is looking for something else, maybe they need more time or a longer trial, maybe they actually wanted something more expensive but you didn’t offer it. You can now target your actions to finding and solving the problem for this customer segment, and reap the rewards of it. Maybe you set up automatic systems in place that detect this type of customer, and automatically email them 3 days later and offer them a special discount.

(Published on 27 Dec 2019)


Compounding - 8th Wonder of the Universe

I’ve recently become very fascinated with this idea of compounding. It reminds me of the classic school maths question of simple interest vs compound interest. You know how it goes. Stuffy exam hall, sun beaming in through the windows, the guy next to you with the squeaky desk that makes you want to get up and throw him out of his chair every time he starts rubbing out one of his answers:

14) Bill has £100 to invest. He gets two offers from different banks. One is for 10% per year simple interest over 10 years. The other is for 5% compound interest over 10 years. Which should he go for?

It’s kind of an artificial question, because they’re not including other options like stocks, or taking into account how much short-term debt Bill may have built up on his credit cards chasing women and living the high-life. But this deceptively simple question touches upon one of the most interesting things about living itself. The forces of compounding.

Another way to look at it, is what are called feedback loops. The term comes from the unpleasant sound you get by holding a microphone too close to a loudspeaker. The microphone picks up sound energy, converts it to electrical signals, feeds that signal to the speaker, which amplifies it and then converts it back to sound energy, which feeds the microphone etc. Etc. We have A feeding B, and B feeding A. This feedback loop then compounds and rapidly escalates the noise to the ‘who the bloody hell is doing that?’ level.

Ok so back to the maths. Simply put, this is the difference between an arithmetic (or linear) progression or a geometric (or exponential) progression. Linear progressions are repeatedly adding the same amount over and over again. Something like this:

3 + 3 + 3 + 3 + 3 + 3...

And if you plot this on a graph you get a straight line. A geometric progression is repeatedly multiplying by the same amount, so something like

3 X 3 X 3 X 3 ...

And if you plot this on a graph you get a nice upwards curve. But this curve starts off slow, then gets faster and faster and faster. The effects feed back on each other, and they compound to produce a VERY large effect.

Ok that’s the end of the maths I promise. But the maths is important nonetheless, to understand the principle we then apply to other areas. Another comment that has stuck in my mind recently is one from Grant Cardone – “middle class people learn how to add, wealthy people learn how to multiply”. And I have a feeling he’s driving at the same principle here.

Let’s look at some interesting feedback loops I’ve observed or even experienced myself. I was actually pretty decent at maths when I was at school (and went on to get a Masters Degree as well!) but I was always a bit miffed at how I was perceived as ‘gifted’, or ‘lucky’ by other people in my class. They couldn’t understand how I often knew the right answers. What they didn’t know is in my free time at home, I would read the textbooks we were given, or try things and be actively pursuing the knowledge in my own free time.

I did, basically, work harder than they did, and ultimately got the rewards of that. But I think compounding is an important factor. You see, when you first start learning a subject, it is difficult. The difficulty of it makes you feel stupid, serves to demotivate you, and that leads to people investing less time and effort into understanding the subject.

This lack of time and effort invested only serves to cement the incompetence, which then produces bad results and re-inforces the lower levels of motivation. It’s a negative feedback loop, and one that can be hard to get out of. I should know, I’ve tutored numerous students in this situation with maths. There is a way out however, and the way out is to begin a positive feedback that will counter the negative one. The solution is one I got from L. Ron Hubbard’s book, “The Problems of Work”, in which he says to find ONE thing, no matter how simple, that someone can do, and stick with that until they can do it.

That sense of accomplishment they get from mastering one task, serves to motivate them to try something else. And you then stick with that one thing until they get that. Then the confidence builds, and the person is on the road to recovery. They are on a positive feedback loop that will pick up momentum and carry them over the line.

I think life is like this. I think everything we do has feedback loops. Things start slow, so people don’t seem to feel there is any progress. What they don’t understand is they are starting off at a 1% growth rate on a tiny principle of $25.

Small multipliers of small amounts don’t result in much change. But, over time, feedback loops develop and start to build powerful forces that cannot be stopped. Let’s take a simple example. Let’s say there is someone down on their luck, working a minimum wage job with no opportunities.

One solution to this problem, is to get down about it, lose hope, start drinking and smoking, and give up on themselves. Ten years down the line and this person could be a total mess. Drinking is expensive, it impacts your health and mental health, which in turn makes it harder to take advantage of opportunities. The added expense means they haven’t saved any money, can’t invest in things that could create opportunities for themselves. Etc. Etc. They can quickly get into a negative feedback loop and traps them in a hell-hole of a life.

I’m actually writing this article sat in a public library right now surrounded by books. Each and every one of them is free, all you have to do is walk in and start reading.

Imagine this same person, instead of going for a drink every evening after work, came to the library to read for 1 hour. That’s all, 1 hour a day. And then let’s say 5 hours over the weekend. Doesn’t cost any money, doesn’t cost anything they don’t have, just a bit of their time. Well that’s 10 hours a week, or 500 hours a year. So over that same decade that would be 5000 of time invested in their own mind and knowledge.

I would be very surprised if that person hadn’t learned something new and valuable they can take into the market place and make more money with it. Very surprised. And all that money they didn’t spend on drink, could be invested in a nice stock portfolio worth $50,000-$100,000 earning them passive dividend income.

Gary Vee always talks about patience, and I think this is also what he might be driving at. Patience to recognise that feedback loops take time to develop. I think a vital part of success in life is to recognise negative feedback loops when they develop and nip them in the bud before they grow out of control. Another key ingredient is to plant seeds and invest in positive feedback loops, and have the patience to wait until they bear fruit, then reinvest the results back into the feedback loop.

I think the ingredients to success are well known. Most people know they should exercise more, drink less, eat better, control their expenses better, learn more, work harder, be kinder to people, improve their attitude. What I suspect happens to a lot of people, is they get 1 month into a project like that, don’t see any results, and throw in the towel. And that is why they are not successful. The do not understand compounding and how it works. It’s like a child planting a seed, then coming back every ten minutes to see how it’s getting along. And at the end of the day concluding that the seeds were no good and throwing them, and the soil, in the bin.

Now, we have to guard against one potential trap. And that is being SO patient with something that actually has very little positive benefit. Don’t invest $1000 in a bank account paying 1.00000001% interest. It will take a million years before you see the benefits of that investment! Luckily for us, we don’t have to take those kind of risks. Like I said, the ingredients to success are well known. Educate yourself, push yourself, keep good company, eat well, take care of yourself, exercise discipline and self-control, work hard, keep your spirits up, help other people. Keep doing these things and in 10 years time you are almost guaranteed success. The compounding effect will be too strong to be undone.

(Published on 14 Nov 2019)


Why Sky Sports News is wrong about Cristiano Ronaldo

In this article here- SKY SPORTS

Mr Balague makes a pretty hasty claim that Ronaldo's days as a goal-scoring machine are behind him. In this article we'll take a closer look at that claim and see if it stacks up to mathematical scrutiny. This is an actual piece of maths I did with a student recently in a statistics lesson. Ok, so first of all, let's get the numbers. To simplify matters we'll just look at league goals over his entire career:

Which if you graph looks like this:

Pretty decent. Now we can see that from 2010 up until 2015, he's been pretty consistent in his goal scoring capabilities. Let's take that then to be the mean rate of goals, which works out to roughly 1.164 goals per game. This is a Poisson distribution, with mean rate of 1.164. We take this to be the null hypothesis. The alternative hypothesis then is that the mean is less than that. Let's do a standard 5% tolerance test. Now let's look at the most recent season gone. In this he's played 28 league games, which means if he was scoring at the mean rate that would be roughly 32 goals we would expect. So to test this we would need to establish what the probability is of him scoring a measly 27 goals is when his supposed average is 32. In mathematical language then, what is P(X <= 27) when X~Po(32).

Since the Poisson tables don't go up to 32, we approximate using a Normal distribution N(32, 32). So, to look this up in the tables we must translate it to the standard Normal distribution, so z = (27 - 32 ) / 32 = -0.16 roughly. Looking that up in the Normal tables gives a probability of 56%. But remember the tables only give us the value for the positive side. As in the diagram below we've essentially worked out the probability it is less than +0.16, which is the orange region, but we want the probability it is less than -0.16, which would be the blue region.

Hopefully it's easy to see that they both add up to one, so we must take the result to be 44%. So what have we established? Well according to the null hypothesis, we would have a 44% chance of obtaining the goal scoring record we have seen of Ronaldo this season. This is well outside the very reasonable 5% tolerance band we allowed ourselves. So under these assumptions, we are forced to conclude that we do not have enough evidence to support Mr. Balague's claim that Ronaldo is past his best. I'm quite disappointed to see such a lack of mathematical rigor in our sports analysis. In the future I'd like to see football pundits backing up their wild claims with some sound statistical analysis, rather than (what has now been made clear) is just his opinion and nothing else. With role models like these football pundits, it's no wonder mathematical standards in schools are dropping at such an alarming rate.

(Published on 15 Mar 2016)