The batting average is perhaps the most discussed number in cricket. A single figure that seeks to encapsulate the overall ability of a batsman, used by pundits and watercooler debaters alike to make their point, and immortalised by Don Bradman's 99.94.
However, the same ubiquitous metric has been a point of debate among professional and amateur statisticians for its unique definition: it is the average runs per dismissal, an algorithm that leaves open the question of not-outs.
Not-outs in cricket signify loose ends: open stories that could go anywhere. An unbeaten innings is taken as "incomplete" for the batting average; the runs get added to the numerator, but the denominator remains the same. How to statistically deal with these kinds of innings?
In 1993, "A Statistical Analysis of Batting in Cricket" was published in the Journal of the Royal Statistical Society, where the authors outlined a proposed approach to handling not-outs using survival analysis, a set of tools that are used in medical studies, which provide an answer for "censored" data. In a typical study, the time to the occurrence of a given event (usually the death of a patient) is observed for a large set. A censored observation is when the study ends before the natural death of the patient. To me this was analogous to the runs scored by a batsman till dismissal, with a censored observation being a not-out, since the innings ends before a dismissal can take place.
In the realm of survival analysis, I found three related measures that could be applied to batting careers with enlightening results: the survival function, the hazard rate, and the mean lifetime, or what we commonly call life expectancy.
An oft-quoted metric for gauging batsmen is the rate at which they convert fifties to hundreds. Technically, this number is the conditional probability that a batsman scores a hundred given that he has already scored 50 runs. In the same spirit, I could define 20 runs as a "start", and then ask what the conversion rate of "starts" to hundreds is. You could define your own barrier of a certain score, and find the conversion rate from that barrier to another score.
To generalise this paradigm of conversion between any two scores for a batsman, we can look at the survival curve. This is simply plotting the probability that a batsman will survive past a certain score, for all scores. Rather than simply plotting the distribution of scores, we use a tool from survival analysis: the Kaplan-Meier estimator. Without going into the nuts and bolts, this technique takes care of the not-out innings while counting the "survival", not counting the not-outs as pure dismissals*. Here is the survival curve for Don Bradman:
The value of the curve tells us the chance of Bradman surviving past that score, accounting for the unbeaten innings. If the curve dips somewhere, it means that the batsman got out more often around that score. Also, the score at which the curve has a value 0.5 is the median score for that batsman.
Moreover, by looking at the curve values at any two scores and dividing them, you can gauge the effective conversion rate between those two scores for that batsman, again, accounting for the uncertainty brought by the not-out innings. An average is a single statistic, a conversion rate is defined between any two points, but this curve splits a batsman's game open through the prism of his scoring tendencies.
Here are the survival curves for four prominent batsmen dominating cricket currently:
Steven Smith outshines his peers at very early or very high scores: he is likelier to survive past a significant number of given points in a batting innings compared to the other three. Virat Kohli lags behind the others, but is second only to Smith as he passes a score of around 70 runs, unleashing his penchant for high scores. Joe Root's much-documented Achilles heel of getting out post 50 is clearly depicted by the steep fall of his graph past that score.
Here are the survival curves of the four leading batsmen of the previous generation, all pretty close to each other at most points:
As is well-known to statisticians, the collection of a batsman's scores follows an exponential distribution. However, we saw in the survival curves how the tendency to get out at some scores leads to fluctuations from a perfectly smooth exponential curve. For instance, Kohli's curve above flattens out after he reaches about 75 runs.
To better visualise these changes, we look at another closely related graph: the hazard rate. Simply put, the hazard rate talks about a batsman's relative chance of getting out at a particular score. Another way of looking at it is the following: if a batsman has already reached a given score, what is his chance of getting out at that point? The hazard curve simply takes the survival curve, changes its point of view, and then tracks the differences in the survival as the run values change**.
Let's look at it using our previous example. We know that Root gets out frequently after scoring a fifty; we could see shades of that in his survival curve. Let's now look at the hazard rates of the "Fab Four":
Voila! We see Root's curve rising high, confirming the perception of his weakness in the 50-100 zone. The hazard rates mirror the little deviations in the survival curve to better convey a batsman's propensity for getting out at different times in his innings. The minutiae of the hazards of batting are spread out into a telling Manhattan graph, from which we can infer the likelihood of dismissal at any point in the innings. Is he likely to get out once he goes past 50? Is he weak in the 90s?
Studying the above graph:
In the first five runs, Smith has an extremely low hazard rate. Well begun is half done, and it reflects in his superior numbers.
Kohli has conspicuous peaks at around 20 and 40, but becomes solid as he approaches a fifty.
Once he is past 60, Kohli's innings is devoid of much hazard. Corroborated by his high conversion rate of fifties to tons, he is tough to evict from the crease once he is past that initial barrier.
Kane Williamson is prone to getting out very early in the innings, perhaps because he plays in swinging conditions in New Zealand
Remember how the survival curves for the previous generation's batsmen were close to each other? A look at their hazard rates will zoom in and prise the differences open. Let's look at the hazard rates for Tendulkar, Lara, Ponting and Kallis:
The first observation, underlining popular perception, is that Tendulkar was fairly likely to get dismissed in the 90s. He had all those 90s because he played a lot of innings, but once he got to 90, he was also more likely to get out than Kallis or Ponting. Lara was the most susceptible in terms of the relative chance of getting out just before a ton. Like his successor in the Australian batting pantheon, Ponting rarely got out after just arriving at the crease.
With all the talk of the chance of survival, it makes sense to ask what the expected lifetime of an object under observation is. The answer lies in a commonly known number: life expectancy. The oft-quoted figure of 80 years or so is the life expectancy at birth. This concept can be extended to any age. At a given age, the life expectancy is the expected number of years left to live, given that an individual has already survived to that age.
This has an exact cricketing analogue: if a batsman has survived to a given score, what is the expected score he might make before getting out? If I could compute the "life expectancy" for a batsman at a given score, I could predict the expected score, were that innings to be allowed to reach its natural conclusion: a dismissal.
The life expectancy can be straightforwardly obtained from the survival curve***. In cricketing terms, it gives us the expected extra runs to be scored if a batsman is not out at a given score. Here are the values for the "Fab Four":
Smith and Kohli have established themselves as high scorers in Tests recently, and the graph shows that clearly. Three curves start out close together, but Kohli then joins Smith: they are expected to construct big innings if you let them pass 40. At the 30-run mark, one expects both of them to score close to a hundred, on average. Kohli's career has towering scores punctuated by lean spells, and one can see that: once he goes past 45, his expected returns are higher than those of the other three, a statistical confirmation of high-scoring impact. On the other hand, Root dips after fifty, but is expected to score high if he crosses that barrier.
Let's look at the same for the older batsmen. The graph below brings forth Lara's capacity for mammoth innings:
The survival curve and the life expectancy come together to deliver crucial information: how often does a batsman go past a score, and how high does he go if he goes past it?
To cap this analysis off, let's go back to the paper mentioned at the start of this article and indulge in an academic exercise that formed the primary motivation for this line of thought. The original issue handled in the paper was to furnish an adjusted average for each batsman, accounting for not-outs. The method followed uses the life expectancy curves for each batsman.
We take all the unbeaten scores in a batsman's career, and add to them the life expectancy value at that score. Let's remind ourselves what this means. The life expectancy at a score tells you the expected extra runs to be scored from that point in the innings. So, if a batsman is left unbeaten at a certain score, adding the life expectancy at that value would project the innings to a "natural" conclusion, i.e. were his innings to progress normally, when would he get out, on average?
With this set of modified scores, we now simply take an arithmetic mean to calculate the "Adjusted Average". Here is the table of the top 15 batsmen sorted by adjusted averages, with the filter being 4000 Test runs.
The top rankers mostly stay the same, on account of being top-order batsmen whose averages have little to do with not-outs. The rise in average for Wally Hammond, Garry Sobers, Kumar Sangakkara, Virat Kohli and Javed Miandad speaks of their penchant for big innings: their life expectancies are very high at their unbeaten scores.
To end: who are the biggest gainers under our new method?
VVS Laxman, with his many unbeaten knocks batting with the tail, gains ten ranks. His adjusted average is much higher than his normal average, which means his not-out innings are on track to be high scores if uninterrupted.
Singular statistics might be compact, but a game as deep as cricket calls for deeper inspection. We know that getting out is an occupational hazard of batting. This analysis sheds light upon where exactly these hazards lurk. After all, your time at the crease is a metaphor for life. It only makes sense to analyse survival.
*The exact calculation of the Kaplan-Meier curve takes place through the construction of life tables, which list the number of innings ongoing, dismissed, and "at risk" at a given score. An estimator for the survival function is then computed by multiplying the proportions of "at risk" innings at each score successively.
**The hazard rate is the logarithmic derivative of the survival curve. Effectively, it measures the differences in the survival between successive run values in the logarithmic space. Therefore, it is the relative change in survival.
***The life expectancy at any run value is the integral of the survival curve from that value, divided by the survival function at that value.
Player numbers as of May 1, 2018.