June 14, 2014

# Consistency of Test batsmen - Part 1

Which batsmen have been the most consistent in Test cricket?

A couple of years back I did a two-part analysis on Test player consistency. You can access the batsmen-specific article here. You have to move to the top of the page to view the article. Overall, it was well received. The analysis was based on a "slice concept". I split the careers of Test batsmen into slices of ten innings and looked at consistency across these slices. As many readers had expressed therein, this went past the unit of innings, which is the most important measurable contribution of a batsman. It also allowed a batsman to be very inconsistent within a slice but come out with acceptable numbers for the slice.

I realised that I have to do the batsmen consistency work with innings as the base, not even a Test. Based on Tests, a batsman could come out roses in the consistency stakes by scoring a 100 and 0. Perfect for the Test but way off as far as innings are concerned.

Let me remind the readers that I will not do any article which is not understood by 90% of the readers. These articles may not come through the statistical validations test but have to be based on common sense and understood by most of the readers. So there will not be any Z-factors or skewness coefficients, or whatever else it is that statisticians look for. Do not look for these in this article and complain about the absence of the same.

First, let me say that the score distribution for almost all batsmen is skewed (note only a verb is used) to the left. An established batsman's lowest score is 0 and the highest score could be anything from, say, 200 to 400. His mean score is around 50. This means that he would have more scores below the mean than above. This is what I meant by being skewed to the left. For the selected population of 200 batsmen, the average percentage of scores above the mean is only 35%. The highest is for Bruce Mitchell with 44.9% and the lowest is for Marvan Atapattu with 29.1%. So this is way away from a normal distribution and we have to adopt special methods to analyse the scores.

What is consistency? OED says: The quality of achieving a level of performance that does not vary greatly in quality over time. DicCom says: Agreement or accordance with facts, form, or characteristics previously shown or stated. FreeDic says: Reliability or uniformity of successive results or events. So what we are looking at is uniformity of performance, absence of surprises, reduction in number of outliers and probably clustering of performances towards the central positions.

Taking a pair of scores, it is clear and obvious that a 100 and 0 is woefully inconsistent, an 85 and 15, quite inconsistent, a 70 and 30, reasonably consistent, a 60 and 40 quite consistent and a 50 and 50 the pinnacle of consistency. For this analysis it does not matter if the 100 was scored master-minding a successful 150 for 9 chase or part of a 700 for 3 score in Faisalabad. Let us see how we can move forward on this premise.

Let us assume that this is a three-Test series and the eight batsmen below have played five innings each. All these batsmen have scored 250 runs in the series and are averaging 50. Let us get a handle on their consistency by perusing the scores, rather than through any mathematical methods.

```A;  25@  45@  50@   60@   70@  (5)
B:  10   45@  55@   65@   75@  (4)
C:  25@  30@  40@   55@  100   (4)
D:   5   30@  45@   75@   95   (3)
E:   0   10   40@   60@  140   (2)
F:   0   30@  40@   80   100   (2)
G:   5   10   20    50@  165   (1)
H:   0    0   10   110   130   (0)
```

A is the epitome of consistency and can be called Mr Consistent (with apologies to Michael Hussey, the original Mr C). No really low or high score.
B and C can be called very consistent. B has got one low score and C, one high score. The other four are in the consistency zone.
D is consistent. There are two outliers: one on each side. Three are in the zone.
E and F can be called somewhat inconsistent. Only two of the five scores are in the consistency zone, i.e. in the middle.
G is quite unpredictable. Four of his scores are outliers. Tough to expect what his next score would be.
H is so inconsistent that we have no clue what he will do. A duck or 100 might come off his bat next.

Even though I used only a visual inspection while determining the consistency levels of these batsmen, we are beginning to get a handle on what analytical method can be used to determine consistency of a batsman. The key phrase is "consistency zone", which I used couple of times in these sentences.

Let me make a brace of somewhat sweeping statements and justify these later.

Define a consistency zone for each batsman and check how many of his innings are within this zone. The higher the percentage of innings within the consistency zone, the more consistent the batsman was.

There is nothing intrinsically wrong with this statement. There is no attempt to define a consistency zone across batsmen. This postulate accepts that the basis for consistency determination for Don Bradman would be totally different to the same for Habibul Bashar. It is dynamic and will accommodate significant changes across the career of batsmen. It could be applicable to selected parts of a batsman's career. So we seem to be on a very nice wicket.

The only problem seems to be to define a valid consistency zone, hereafter called Con_Zone. There is no mathematical solution. If one exists, I would not understand it myself and cannot explain the same in simple words to the readers. So I have to use common sense and the cricketing knowledge acquired over the years.

The one point I am certain is that for this exercise, the batting average cannot be used as the basis. Especially when I am going to say that 400* or 257* are two of the greatest outliers ever, what is the point of adding these runs but not the innings played? I have to use a Runs per innings (RpI), but a slightly modified one, RpxI, after taking care of the next bone of contention, the not-outs. I will come to this later, after explaining the basis for Con_Zone.

After days of trials and evaluating aggregates of various measures, I have defined Con_Zone as the range of scores that falls between 50% of RpxI to 150% of RpxI. It is dynamic and varies according to the batsman's career performance. It gives me an exact RpxI width of scores, enough to give very high confidence level while proclaiming a batsman's consistency or lack of.

Three examples - Bradman's Con_Zone ranges between 44.4 and 133.3. Ken Barrington's Con_Zone ranges between 26.7 and 80.1. Habibul Bashar's between 15.3 to 45.8. While looking at these examples, do not forget that a 365 or 293 is as much of an outlier as a 0 or 1.

Now for the not-outs. My first article in the Cordon was called "The vexed question of not outs in Test cricket". Unfortunately, I could not view the comments and respond to those because of certain technical issues. But I knew that there were arguments for and against my suggestion of extending the not-out innings by his recent-form runs. A revolutionary idea it was but some of the respondents felt that there was really no problem and I was trying to solve a non-existent problem. They were probably correct. Some felt that the RpFI, described below, was an arbitrary number.

It is clear that the not-outs have to be addressed properly. Let us take Garry Sobers with a basic RpI value of around 50. His 178* or 365* are clear outliers and have to be considered as valid innings. His 50* has to be considered, as a perfect innings, along with his 50. His 33* is considered since this is within the Con_Zone. His 5 or 8 are clear outliers and cannot be ignored. But what about the 16*? It is not fair to Sobers if we take this innings as one falling outside the Con_Zone (25.6 to 75.7). He could have scored 34 more runs or 134 more. On the other hand we cannot certify that this falls within the Con_Zone. He could have been out next ball.

In the article I have referred to, I also developed an alternate and simpler concept of considering only fulfilled innings(FI). These are the not-outs above 50% of the RpI and all dismissals. It was an elegant and simple method.

Incidentally Milind has tackled the question of not-outs in his excellent blog, which takes cricket analysis to a higher level. He has tweaked the RpFI, which I had created for the said article and created a further adjusted RpI, called µ, by mapping all not-out innings based on their values. It is a lovely idea and the reader could get the complete information on this tweak and other fascinating analyses. Once you are there his earlier articles on Geometric Mean, Bradman's innings and the like can be viewed.

However, I have decided to stick to my RpFI concept since it is simpler and this is only a Batsman Consistency analysis. Like a perfect Lego block fitting, the beginning of the Con_Zone is pegged at 50% of the RpI value. So I have come to a (hopefully Solomonic and not Tughlaqian) decision for this analysis. I will ignore all not-outs that are below the low-end of the Con_Zone (50% of RpI). These will be excluded from the innings count, RpI determination and consistency determination.

I can hear those knives being sharpened. Before you take those off the scabbard, look at it carefully. No batsman loses out. Sobers' 16* would be outside the consistency calculations, that is all. He will neither benefit nor be hampered. No assumption of any sort has been made regarding his innings. There are no magic numbers. The RpI, if anything, will only be slightly boosted. So any reader who is offended by this, if he takes a minute to think laterally, will see the soundness behind this tweak. And let us not forget, it is uniform but customised and dynamic treatment for all batsmen.

The final justification. For the 200 batsman considered, there are 26,172 innings and of these the excluded special not-outs are just 642, a mere 2.4%. So there is a negligible impact on the numbers but a considerable improvement in the soundness of calculations.

The cut-off is 2684 runs. What? Such an odd number! Before anyone says that I have done this to exclude or include any specific player, let me say that my initial cut-off was 3000 Test runs. Two-thousand, I felt, was too low since only around 30-40 innings would have been played. Three-thousand meant that a reasonable number of innings, well over 50, would have been played.

However, when I did a run with 2500, I suddenly found out that a new batsman started dominating the tables. That was Dudley Nourse. His numbers were way out and I felt that his inclusion would set a benchmark for other batsmen and would validate the approach taken very effectively. But he had scored only 2960 runs. Hence I lowered the cut-off to 2950 Test runs. After all it is my analysis. Finally I decided that instead of having runs as cut-off, I would select the top 200 run scorers. So the population size determined the cut-off. Hence the number 2684. Mark Burgess was the last batsman to get in. In the bargain, Glenn Turner, MAK Pataudi, Norman O'Neill, Stan McCabe and Keith Miller got in. Not a bad lot to look at.

Let us move on to the tables. I have also plotted the graph for five interesting batsman to get a visual idea of how the Consistency Index works.

Test Batsmen Consistency analysis: 30 most consistent batsmen
No Batsman LHB Ctry Tests Inns NOs Runs Avge AdjInns AdjRuns AdjRpi Cons-Zone Range Cons-Zone Inns Cons-Index
1AD NourseSaf 34 62 7296053.82 60292448.7324.4 to 73.13151.7%
2WW ArmstrongAus 50 8410286338.69 81283334.9817.5 to 52.53846.9%
3BF ButcherWin 44 78 6310443.11 77309740.2220.1 to 60.33444.2%
4H SutcliffeEng 54 84 9455560.73 82454155.3827.7 to 83.13643.9%
5VL ManjrekarInd 55 9210320839.12 90320835.6417.8 to 53.53943.3%
6JB HobbsEng 61102 7541056.95 98534854.5727.3 to 81.94242.9%
7CC HunteWin 44 78 6324545.07 75322342.9721.5 to 64.53242.7%
8WR HammondEng 8514016724958.46137723452.8026.4 to 79.25842.3%
9Imran KhanPak 8812625380737.69119374131.4415.7 to 47.25042.0%
10ER DexterEng 62102 8450247.89100449744.9722.5 to 67.54242.0%
11CC McDonaldAus 47 83 4310739.33 81309938.2619.1 to 57.43442.0%
12IJL TrottEng 49 87 6376346.46 86374643.5621.8 to 65.33641.9%
13RB RichardsonWin 8614612594944.40140593042.3621.2 to 63.55841.4%
14SR WatsonAus 52 97 3340836.26 97340835.1317.6 to 52.74041.2%
15IR RedpathAus 6612011473743.46119472539.7119.9 to 59.64941.2%
16RC FredericksLWin 59109 7433442.49107432840.4520.2 to 60.74441.1%
17ND McKenzieSaf 58 94 7325337.39 90321835.7617.9 to 53.63741.1%
18AB de VilliersSaf 9215416716851.94149711447.7423.9 to 71.66140.9%
19RB KanhaiWin 79137 6622747.53133618846.5323.3 to 69.85440.6%
20DI GowerLEng11720418823144.25201818440.7220.4 to 61.18140.3%
21PJL DujonWin 8111511332231.94113331029.2914.6 to 43.94539.8%
22GS SobersLWin 9316021803257.78156798151.1625.6 to 76.76239.7%
23TW GraveneyEng 7912313488244.38121487240.2620.1 to 60.44839.7%
24GM TurnerNzl 41 73 6299144.64 71296841.8020.9 to 62.72839.4%
25AJ StraussLEng100178 6703740.91175701740.1020.0 to 60.16939.4%
26KF BarringtonEng 8213115680658.67127677853.3726.7 to 80.15039.4%
27L HuttonEng 7913815697156.67134691651.6125.8 to 77.45238.8%
28GC SmithLSaf11720412926648.26201924846.0123.0 to 69.07838.8%
30AW GreigEng 58 93 4359940.44 93359938.7019.3 to 58.03638.7%

Most consistent batsmen: When readers peruse the tables they will realise why I was so enthused about Dudley Nourse. Let me present his career numbers. 62 innings. The mean score was 48.7 allowing the Con_Zone range of 24.4 to 73.1. This entire range is indicative of acceptable scores. Two scores, 17* and 19*, are ignored. Nourse has 31 scores in the Con_Zone. He is the only batsman to have more scores inside the Con_Zone than outside it. If this is not consistency, that too across 16 years, I am not sure what is. He has two double-hundreds but the next highest score is 149. That explains his excellent Con_Index.

Herbert Sutcliffe and Jack Hobbs are almost inseparable even in this analysis, as they were on the field. For Sutcliffe, two unbeaten innings, viz., 1* and 13*, are excluded. For Hobbs, four innings, viz., 9*, 11*, 19* and 23*, are removed. Otherwise, look at how close their numbers are. Very similar Con_Zone ranges (~20 to ~80). Con_Index coming at well above 42%. These are their individual numbers. How well they would have performed together. Right at the top, as far opening pairs are concerned.

Wally Hammond, who followed Hobbs and Sutcliffe, has similar figures. His Consistency Index is also well above 42%. The top 20 of the table features batsmen who have Consistency Index values above 40%. This includes some unlikely batsman. Who would have expected the flamboyant Kanhai to have a fairly high value of 40.6%. David Gower is another surprise 40+% batsman featured here. Sobers and Barrington are two top-level batsmen standing at just below 40%.

Contemporary batsmen: For all the problems he has faced recently, Trott is the most consistent of the contemporary batsmen. Thirty-six of his 86 qualifying innings are within the Con_Zone, giving him an index value of 41.9%. Watson might not have scored many hundreds but he is certainly high on the Consistency Index value table, with 41.2%. His Con_Zone range is, of course, lower at 18-53. He is expected to deliver at lower levels.

Since Watson and Trott have played fewer matches, AB de Villiers' lays claim to be the most consistent current batsman. This is borne out by his recent record-breaking form. His exclusions are 4*, 4*, 8*, 19* and 19*. He has 61 innings within the Con_Zone range of 25-79, out of 149 qualifying innings. This gives him a high Consistency Index of 40.9%. Any number above 35% is very good and anything above 40% is outstanding.

Strauss with 39.4% and Langer, with 38.2% are in the top-40.

Summary of a few top batsmen: Many top batsmen are not even in the top 50 of the table. Hence I have summarised the Consistency Index of a few top batsmen. Bradman is way down the table with a barely acceptable index of 30.8%. This is understandable since 15% of his innings are above 200 and there have to be compensating low scores.

Sachin Tendulkar's index value is a fairly low 31.2%, Brian Lara's is slightly better at 33%, Rahul Dravid at a relatively high 37.2%, Kumar Sangakkara is similarly placed at 36.8%, Ricky Ponting at a low index value of 32.9%, Jacques Kallis at a moderate 34.4%, and finally Sunil Gavaskar, at a very low 30.5%. To those who are surprised at the last figure, let me remind readers that Gavaskar was a poor starter and had 55 single-digit dismissals. And these have been balanced by 12 150-plus scores.

Test Batsmen Consistency analysis: 10 most inconsistent batsmen
No Batsman LHB Ctry Tests Inns NOs Runs Avge AdjInns AdjRuns AdjRpi Cons-Zone Range Cons-Zone Inns Cons-Index
191NS SidhuInd 51 78 2 320242.13 78 320241.05 20.5- 61.6 2126.9%
192MJ ClarkeAus10518020 824051.50176 818246.49 23.2- 69.7 4726.7%
193DL AmissEng 50 8810 361246.31 84 356942.49 21.2- 63.7 2226.2%
194TT SamaraweeraSlk 8113220 546248.77127 540742.57 21.3- 63.9 3326.0%
195HW TaylorSaf 42 76 4 293640.78 74 290839.30 19.6- 58.9 1925.7%
196C Hill~Aus 49 89 2 341239.22 88 340238.66 19.3- 58.0 2225.0%
197JR ReidNzl 58108 5 342833.28108 342831.74 15.9- 47.6 2725.0%
198MN SamuelsWin 51 90 6 298335.51 88 296833.73 16.9- 50.6 2225.0%
199Mansur Ali KhanInd 46 83 3 279334.91 82 277933.89 16.9- 50.8 1923.2%
200Ijaz AhmedPak 60 92 4 331537.67 90 328736.52 18.3- 54.8 1820.0%

Now for the other end. The most interesting in this lot is Michael Clarke, with a really low index value of 26.7%. That means that just about one in four fulfilled innings have been within the Con_Zone range of 23 to 70. His exclusions are 6*, 14*, 17* and 21*. The fact that there are 16 other not-outs has also contributed to this. He has had 47 single-digit dismissals and ten 150-plus scores do not help.

Dennis Amiss, Clem Hill and Mansur Ali Khan are two prominent batsmen in this group. Let us look at the most inconsistent batsman amongst the selected 200 - Ijaz Ahmed. Look at the Consistency Index. It is a very low 20%, which means one in five innings are within the Cons_Zone of 18-54. He has only 18 innings in this group, out of a total of 90. Not surprising considering the fact that 33 innings, out of 90, a whopping 37%, are single-digit dismissals. No doubt compensated by 12 hundreds.

Now for a few graphs. The graphs are plotted in increasing order of scores. Only the fulfilled innings are plotted. Also the Con_Zone and mean are shown.

Let us look at the graphs of three batsmen. Bradman is the king, albeit an inconsistent one, Nourse is the most consistent and Ijaz, the least consistent.

In Bradman's case, the reason for the inconsistency is very clear. Look at those seven zeroes and seven single-digit dismissals. At the other end, we have huge peaks relating to those 18 150-plus scores. All pointing to nummerous innings of total domination or dismissals within the first hour. Perfect candidate for a high degree of inconsistency.

Look at Nourse's graph. Look at the way the graph moves up quickly and the width of the Con_Zone. He has had 13 single-digit dismissals but many intermediate scores. There are not many peaks. Confirmation of a very high degree of consistency. All these lead to a Consistency Index of over 50%. Very few innings are below 10.

Now for the other end. Look at the width of the Con_Zone of Ijaz . Especially look at the number of low scores. More than the peaks on the right hand side of the Con_Zone, it is the number of low scores which leads to a wholly inconsistent career. There are many innings below 10.

This graph depicts the career of Clarke, the most inconsistent current batsman. This is in a way similar to Ijaz's graph. A very high number of innings to the left-hand side of the Con_Zone and significant number of innings to the right side. The width of the Con_Zone is quite low at only 26%. Look at how many innings are below 10. This is borne out by the fact that Clarke has scored four hundreds, three huge ones, in his last 11 Tests. The other 18 innings are 30 or lower.

Hammond has been a very consistent player with an index value of 42.3%. Let us look at the career graph of Hammond. Look at the width of the Con_Zone. The number of innings within this zone is quite high. On either side there are no great tail-offs. Look at how few innings are below 10.

Finally, the graph for de Villiers. It is far closer to the Hammond graph rather than the Clarke graph. A fairly wide Cons_Zone, just over 40%. All those recent fifties helped. There are not many innings below 10.

The common view is that the sedate, defensive batsmen are more consistent than the attacking batsmen. This unfounded adage has been given a serious jolt in this analysis. When attacking batsmen like Ted Dexter, Roy Fredericks, Rohan Kanhai, Gower, Sobers et al are in the top 40 and defensive stalwarts like Gavaskar, Shivnarine Chanderpaul, Kallis, Mohammad Yousuf, Hanif Mohammad et al are in the lower half of the table, there seems no justification for this axiom.

Couple of messages to my readers.

If you perceive this to be some sort of batsman-ranking table and come out with comments such as: "xyz is ranked too high or low", it is your problem, not mine. This is not a ranking list at all. It is an indication of how consistent a batsman was, relative to his own mean. That is all. If this table does not conform to your subjective perception of a batsman's consistency, maybe it is time to change that to an objective perception and not find fault.

If anyone tries to hijack this article into a xyz-lauding or pqr-bashing exercise, I will be quite ruthless, cutting off such comments right at the top.