On Generalized Geometric Distributions: Application to Modeling Scores in Cricket
In the game of cricket, batting average is the most common and basic measure of a batsman's performance during a short duration, like a series or calendar year, as well over a longer span like the career. Batting average is considered in isolation or in combination with other measures like strike rate, at times depending on the form of the game. However, in either case, treatment of runs scores from notout innings throws particular challenge in adopting batting average as a measure of true performance. The conventional way of computing batting average enjoys favour as well as criticism from intuitive standpoint - but it can be justified as the maximum likelihood estimate if the scores come from an Exponential or Geometric distribution. Either of these distributions is quite unreasonable in modeling cricket scores of a batsman because of obviously non-constant hazard or propensity to get out after scoring different runs. Towards this, we discuss the role of the Kaplan Meir estimator treating the scores from the notout innings as right censored data. We show that while it provides a vast conceptual improvement over the traditional average, there are some associated some problems as well. The first of these is because of its nonparametric nature, specially in the context of reflecting true average performance in a short duration like a tournament or a series - the other because of its inability to produce a finite-valued estimate when the largest score is from a notout innings.
To address these concerns, we propose a generalized class of Geometric distributions (GGD) as model for the runs scored by individual batsmen. The generalization comes in the form of hazard of getting out changing from one score to another. We consider the change points as the known or specified parameters and derive the general expressions for the restricted maximum likelihood estimators of the hazard rates under the generalized structure considered. Given the domain context, we propose and test ten different variations of the GGD model and carry out the test across the nested models using the asymptotic distribution of the likelihood ratio statistic to determine the best possible model. This family of GGD subsumes the traditional average as well as the Kaplan-Meir based estimate, as the 1 parameter GGD is the simple Geometric distribution, while the infinite order GGD corresponds to the non-parametric Kaplan-Meir based survival function. Finally to estimate the true batting average, we propose two methods: first being the simple mean of the fitted GGD and in the second case the notout scores are replaced by conditional mean of the fitted GGD, before averaging out. We show that while the two methods coincide for the two extreme GGD (simple Geometric and nonparametric) it is not so in general. We also discuss how different approaches for estimating average over a short or long time horizon. Finally we compute batting averages by the different methods for all top players, in both forms of the game and study the rank correlation. We also present results from numerical computation is carried out using scores of all opening batsmen as well as No 11 batsmen in one day cricket matches, to illustrate model selection procedures. This also establishes that any model in the family need not be appropriate for all situation. We also focus on Batting average of two players. In particular, we show that quite possibly Bradman's true average was greater than 100, while Bevan have been distinctly beneficiary of prevalent way of computing average as his 1-day average seems to be an overestimate by fair degree.
Keywords: Average, Censored, Hazard, Kaplan Meir estimator, (restricted) maximum likelihood.