If the law of averages holds good in our day to day cases that we see, should we not name it something else! :)
Long ago, the curriculum in the Indian intermediate schools introduced us to 'average' for the very first time. When it was mentioned of five persons with weight 68kg, 71kg, 75kg, 73kg and 70kg, we learnt to calculate the 'average' weight of the persons to be 71.4kg. Close enough! We kept creating the sphere of averages around every dataset we saw then and we were never less proud of the outcome.
A few years from there and we cam across the terms 'mean', 'median' and 'mode'. 'Mean' had nothing mean about it. It was the common man's average with a glittering new name. We learnt to adept to this new term quickly and use it at discussions just so that we bring an air of intellect along with it.
We learnt that mode was nothing but the most frequently occurring value in a set while the median was the middle term in a sorted set. We were no less than statisticians now that we know the 'trinity'.
It always troubled me, that MEAN is not the 'real average' in most practical scenarios. And I also knew that median fits into the role better (to define averages) in most of the datasets that I had handled till then. However, the why of it was what I had always wondered about. It was not until I started playing with numbers professionally that I had an answer to it.
I had started rejecting the notion of taking the mean and would rather go ahead with median in most of the analysis in the many projects I was involved in and it never fired back. Was I just lucky or was there something going which I couldn't figure out?
I realized it was all about the variance in the data that called the shots. Real world marketing data would have lot of variance. Averages fail miserably there. Take an example of a dataset having the revenue of the merchants in a region. The variety (or variance as statisticians call) is huge. Let me help you imagine. Consider this region to have around fifty small merchants having revenue of a few thousand. Additionally, there are a couple of mega merchants in the vicinity with revenues close to a million. If we find the mean of the group, we would get an average that is close to four times to the scenario if we didn't have the two mega merchants. Do you realize how disturbed the mean value is? What happens if we take the middle term of this data set. You move much closer to the 'real' value. Bang on. Don't just start applying MEDIAN in all business situations. Lets look at one more scenario before deciding the pick among the trinity.
Consider a service company that has service tickets to track all its services. The overall range of duration of ticket closures is found to be between 3 to 48 hours. We already understand, in cases of such high variation in data, its safer to ignore mean value. Had we considered median value, the middle term would have been nine hours. When we closely observe the durations, we see that most of the tickets are getting closed in three hours. Let's assume that the business wants to benchmark their closure standards to measure performance. What would they consider - 9 hours or 3 hours ? They would do well to set three hours as their benchmark closure time. Won't they? Hence, MODE gets an upper hand in this scenario.
I haven't got into the quartiles & the range - looking at which we decide whether there is a need to 'clean' the data before we proceed with any analysis. This, I will take on some other day.
I like to make an analogy of the trio of averages with the Hindu Trinity. We have the Lord of Creation, Brahma. He has created everything, full of knowledge, respected by all, and yet doesn't have a temple to Himself. It's like the mean. It all started here. And yet it is often best to not choose mean in real life!
The remaining duo of Vishnu & Mahesh, the Lord of Sustenance and Destroyer of Maya respectively is all powerful, followed by millions, feared by many and loved by all. And yet you make a fool of yourself if you choose One over the Other. Its like the duo of Mean and Mode. We need to carefully understand the data and the business problem that we are trying to solve before jumping on to pick one out of the duo (in some cases the trio).
I will end this post by reminding the readers that Analysis is more Art than Science - here, nothing is an absolute right or an absolute wrong. It all depends on what you want to achieve out of the analysis. Statistics on the other hand is hard numbers. The irony is numbers do lie!
Happy Analysis!! :)