AStA-Advances in Statistical Analysis

Clustering of extreme values: estimation and application
Ferreira M
The extreme value theory (EVT) encompasses a set of methods that allow inferring about the risk inherent to various phenomena in the scope of economic, financial, actuarial, environmental, hydrological, climatic sciences, as well as various areas of engineering. In many situations the clustering effect of high values may have an impact on the risk of occurrence of extreme phenomena. For example, extreme temperatures that last over time and result in drought situations, the permanence of intense rains leading to floods, stock markets in successive falls and consequent catastrophic losses. The extremal index is a measure of EVT associated with the degree of clustering of extreme values. In many situations, and under certain conditions, it corresponds to the arithmetic inverse of the average size of high-value clusters. The estimation of the extremal index generally entails two sources of uncertainty: the level at which high observations are considered and the identification of clusters. There are several contributions in the literature on the estimation of the extremal index, including methodologies to overcome the aforementioned sources of uncertainty. In this work we will revisit several existing estimators, apply automatic choice methods, both for the threshold and for the clustering parameter, and compare the performance of the methods. We will end with an application to meteorological data.
A dynamic causal modeling of the second outbreak of COVID-19 in Italy
Bilancia M, Vitale D, Manca F, Perchinunno P and Santacroce L
While the vaccination campaign against COVID-19 is having its positive impact, we retrospectively analyze the causal impact of some decisions made by the Italian government on the second outbreak of the SARS-CoV-2 pandemic in Italy, when no vaccine was available. First, we analyze the causal impact of reopenings after the first lockdown in 2020. In addition, we also analyze the impact of reopening schools in September 2020. Our results provide an unprecedented opportunity to evaluate the causal relationship between the relaxation of restrictions and the transmission in the community of a highly contagious respiratory virus that causes severe illness in the absence of prophylactic vaccination programs. We present a purely data-analytic approach based on a Bayesian methodology and discuss possible interpretations of the results obtained and implications for policy makers.
Control charts for measurement error models
Golosnoy V, Hildebrandt B, Köhler S, Schmid W and Seifert MI
We consider a linear measurement error model (MEM) with AR(1) process in the state equation which is widely used in applied research. This MEM could be equivalently re-written as ARMA(1,1) process, where the MA(1) parameter is related to the variance of measurement errors. As the MA(1) parameter is of essential importance for these linear MEMs, it is of much relevance to provide instruments for online monitoring in order to detect its possible changes. In this paper we develop control charts for online detection of such changes, i.e., from AR(1) to ARMA(1,1) and vice versa, as soon as they occur. For this purpose, we elaborate on both cumulative sum (CUSUM) and exponentially weighted moving average (EWMA) control charts and investigate their performance in a Monte Carlo simulation study. The empirical illustration of our approach is conducted based on time series of daily realized volatilities.
Prediction model-based kernel density estimation when group membership is subject to missing
He H, Wang W and Tang W
The density function is a fundamental concept in data analysis. When a population consists of heterogeneous subjects, it's often of great interest to estimate the density functions of the subpopulations. Nonparametric methods such as kernel smoothing estimates may be applied to each subpopulation to estimate the density functions if there are no missing values. In situations where the membership for a subpopulation is missing, kernel smoothing estimates using only subjects with membership available are valid only under missing complete at random (MCAR). In this paper, we propose new kernel smoothing methods for density function estimates by applying prediction models of the membership under the missing at random (MAR) assumption. The asymptotic properties of the new estimates are developed, and simulation studies and a real study in mental health are used to illustrate the performance of the new estimates.
Longitudinal Dynamic Analyses of Cognition in the Health and Retirement Study Panel
McArdle JJ
Authors' response: on the role of data, statistics and decisions in a pandemic
Jahn B, Friedrich S, Behnke J, Engel J, Garczarek U, Münnich R, Pauly M, Wilhelm A, Wolkenhauer O, Zwick M, Siebert U and Friede T
Comment on: On the role of data, statistics and decisions in a pandemic statistics for climate protection and health-dare (more) progress!
Radermacher WJ
In the Corona pandemic, it became clear with burning clarity how much good quality statistics are needed, and at the same time how unsuccessful we are at providing such statistics despite the existing technical and methodological possibilities and diverse data sources. It is therefore more than overdue to get to the bottom of the causes of these issues and to learn from the findings. This defines a high aspiration, namely that firstly a diagnosis is carried out in which the causes of the deficiencies with their interactions are identified as broadly as possible. Secondly, such a broad diagnosis should result in a therapy that includes a coherent strategy that can be generalised, i.e. that goes beyond the Corona pandemic.
Estimating the change in soccer's home advantage during the Covid-19 pandemic using bivariate Poisson regression
Benz LS and Lopez MJ
In wake of the Covid-19 pandemic, 2019-2020 soccer seasons across the world were postponed and eventually made up during the summer months of 2020. Researchers from a variety of disciplines jumped at the opportunity to compare the rescheduled games, played in front of empty stadia, to previous games, played in front of fans. To date, most of this post-Covid soccer research has used linear regression models, or versions thereof, to estimate potential changes to the home advantage. However, we argue that leveraging the Poisson distribution would be more appropriate and use simulations to show that bivariate Poisson regression (Karlis and Ntzoufras in J R Stat Soc Ser D Stat 52(3):381-393, 2003) reduces absolute bias when estimating the home advantage benefit in a single season of soccer games, relative to linear regression, by almost 85%. Next, with data from 17 professional soccer leagues, we extend bivariate Poisson models estimate the change in home advantage due to games being played without fans. In contrast to current research that suggests a drop in the home advantage, our findings are mixed; in some leagues, evidence points to a decrease, while in others, the home advantage may have risen. Altogether, this suggests a more complex causal mechanism for the impact of fans on sporting events.
Robust fitting of mixtures of GLMs by weighted likelihood
Greco L
Finite mixtures of generalized linear models are commonly fitted by maximum likelihood and the EM algorithm. The estimation process and subsequent inferential and classification procedures can be badly affected by the occurrence of outliers. Actually, contamination in the sample at hand may lead to severely biased fitted components and poor classification accuracy. In order to take into account the potential presence of outliers, a robust fitting strategy is proposed that is based on the weighted likelihood methodology. The technique exhibits a satisfactory behavior in terms of both fitting and classification accuracy, as confirmed by some numerical studies and real data examples.
Estimation of final standings in football competitions with a premature ending: the case of COVID-19
Gorgi P, Koopman SJ and Lit R
We study an alternative approach to determine the final league table in football competitions with a premature ending. For several countries, a premature ending of the 2019/2020 football season has occurred due to the COVID-19 pandemic. We propose a model-based method as a possible alternative to the use of the incomplete standings to determine the final table. This method measures the performance of the teams in the matches of the season that have been played and predicts the remaining non-played matches through a paired-comparison model. The main advantage of the method compared to the incomplete standings is that it takes account of the bias in the performance measure due to the schedule of the matches in a season. Therefore, the resulting ranking of the teams based on our proposed method can be regarded as more fair in this respect. A forecasting study based on historical data of seven of the main European competitions is used to validate the method. The empirical results suggest that the model-based approach produces more accurate predictions of the true final standings than those based on the incomplete standings.
The Probabilistic Final Standing Calculator: a fair stochastic tool to handle abruptly stopped football seasons
Van Eetvelde H, Hvattum LM and Ley C
The COVID-19 pandemic has left its marks in the sports world, forcing the full stop of all sports-related activities in the first half of 2020. Football leagues were suddenly stopped, and each country was hesitating between a relaunch of the competition and a premature ending. Some opted for the latter option and took as the final standing of the season the ranking from the moment the competition got interrupted. This decision has been perceived as unfair, especially by those teams who had remaining matches against easier opponents. In this paper, we introduce a tool to calculate in a fairer way the final standings of domestic leagues that have to stop prematurely: our Probabilistic Final Standing Calculator (PFSC). It is based on a stochastic model taking into account the results of the matches played and simulating the remaining matches, yielding the probabilities for the various possible final rankings. We have compared our PFSC with state-of-the-art prediction models, using previous seasons which we pretend to stop at different points in time. We illustrate our PFSC by showing how a probabilistic ranking of the French Ligue 1 in the stopped 2019-2020 season could have led to alternative, potentially fairer, decisions on the final standing.
Regional now- and forecasting for data reported with delay: toward surveillance of COVID-19 infections
De Nicola G, Schneble M, Kauermann G and Berger U
Governments around the world continue to act to contain and mitigate the spread of COVID-19. The rapidly evolving situation compels officials and executives to continuously adapt policies and social distancing measures depending on the current state of the spread of the disease. In this context, it is crucial for policymakers to have a firm grasp on what the current state of the pandemic is, and to envision how the number of infections is going to evolve over the next days. However, as in many other situations involving compulsory registration of sensitive data, cases are reported with delay to a central register, with this delay deferring an up-to-date view of the state of things. We provide a stable tool for monitoring current infection levels as well as predicting infection numbers in the immediate future at the regional level. We accomplish this through nowcasting of cases that have not yet been reported as well as through predictions of future infections. We apply our model to German data, for which our focus lies in predicting and explain infectious behavior by district.
A spatial randomness test based on the box-counting dimension
Caballero Y, Giraldo R and Mateu J
Statistical modelling of a spatial point pattern often begins by testing the hypothesis of spatial randomness. Classical tests are based on quadrat counts and distance-based methods. Alternatively, we propose a new statistical test of spatial randomness based on the fractal dimension, calculated through the box-counting method providing an inferential perspective contrary to the more often descriptive use of this method. We also develop a graphical test based on the log-log plot to calculate the box-counting dimension. We evaluate the performance of our methodology by conducting a simulation study and analysing a COVID-19 dataset. The results reinforce the good performance of the method that arises as an alternative to the more classical distances-based strategies.
Editorial special issue: Statistics in sports
Groll A and Liebl D
Triggered by advances in data gathering technologies, the use of statistical analyzes, predictions and modeling techniques in sports has gained a rapidly growing interest over the last decades. Today, professional sports teams have access to precise player positioning data and sports scientists design experiments involving non-standard data structures like movement-trajectories. This special issue on statistics in sports is dedicated to further foster the development of statistics and its applications in sports. The contributed articles address a wide range of statistical problems such as statistical methods for prediction of game outcomes, for prevention of sports injuries, for analyzing sports science data from movement laboratories, for measurement and evaluation of player performance, etc. Finally, also SARS-CoV-2 pandemic-related impacts on the sport's framework are investigated.
On the role of data, statistics and decisions in a pandemic
Jahn B, Friedrich S, Behnke J, Engel J, Garczarek U, Münnich R, Pauly M, Wilhelm A, Wolkenhauer O, Zwick M, Siebert U and Friede T
A pandemic poses particular challenges to decision-making because of the need to continuously adapt decisions to rapidly changing evidence and available data. For example, which countermeasures are appropriate at a particular stage of the pandemic? How can the severity of the pandemic be measured? What is the effect of vaccination in the population and which groups should be vaccinated first? The process of decision-making starts with data collection and modeling and continues to the dissemination of results and the subsequent decisions taken. The goal of this paper is to give an overview of this process and to provide recommendations for the different steps from a statistical perspective. In particular, we discuss a range of modeling techniques including mathematical, statistical and decision-analytic models along with their applications in the COVID-19 context. With this overview, we aim to foster the understanding of the goals of these modeling approaches and the specific data requirements that are essential for the interpretation of results and for successful interdisciplinary collaborations. A special focus is on the role played by data in these different models, and we incorporate into the discussion the importance of statistical literacy and of effective dissemination and communication of findings.
Describing a landscape we are yet discovering
Contreras S, Dehning J and Priesemann V
Discussion on : by Jahn et al (2022)
Berger U, Kauermann G and Küchenhoff H
The authors make an important contribution presenting a comprehensive and thoughtful overview about the many different aspects of data, statistics and data analyses in times of the recent COVID-19 pandemic discussing all relevant topics. The paper certainly provides a very valuable reflection of what has been done, what could have been done and what needs to be done. We contribute here with a few comments and some additional issues. We do not discuss all chapters of Jahn et al. (AStA Adv Stat Anal, 2022. 10.1007/s10182-022-00439-7), but focus on those where our personal views and experiences might add some additional aspects.
Having a ball: evaluating scoring streaks and game excitement using in-match trend estimation
Ekstrøm CT and Jensen AK
Many popular sports involve matches between two teams or players where each team have the possibility of scoring points throughout the match. While the overall match winner and result is interesting, it conveys little information about the underlying scoring trends throughout the match. Modeling approaches that accommodate a finer granularity of the score difference throughout the match is needed to evaluate in-game strategies, discuss scoring streaks, teams strengths, and other aspects of the game. We propose a latent Gaussian process to model the score difference between two teams and introduce the Trend Direction Index as an easily interpretable probabilistic measure of the current trend in the match as well as a measure of post-game trend evaluation. In addition we propose the Excitement Trend Index-the expected number of monotonicity changes in the running score difference-as a measure of overall game excitement. Our proposed methodology is applied to all 1143 matches from the 2019-2020 National Basketball Association season. We show how the trends can be interpreted in individual games and how the excitement score can be used to cluster teams according to how exciting they are to watch.
Comment "On the role of data, statistics and decisions in a pandemic" by Jahn et al
Höhle M
We comment the paper by Jahn et al. (On the role of data, statistics and decisions in a pandemic, 2022).
Integration of model-based recursive partitioning with bias reduction estimation: a case study assessing the impact of Oliver's four factors on the probability of winning a basketball game
Migliorati M, Manisera M and Zuccolotto P
In this contribution, we investigate the importance of Oliver's Four Factors, proposed in the literature to identify a basketball team's strengths and weaknesses in terms of shooting, turnovers, rebounding and free throws, as success drivers of a basketball game. In order to investigate the role of each factor in the success of a team in a match, we applied the MOdel-Based recursive partitioning (MOB) algorithm to real data concerning 19,138 matches of 16 National Basketball Association (NBA) regular seasons (from 2004-2005 to 2019-2020). MOB, instead of fitting one global Generalized Linear Model (GLM) to all observations, partitions the observations according to selected partitioning variables and estimates several ad hoc local GLMs for subgroups of observations. The manuscript's aim is twofold: (1) in order to deal with (quasi) separation problems leading to convergence problems in the numerical solution of Maximum Likelihood (ML) estimation in MOB, we propose a methodological extension of GLM-based recursive partitioning from standard ML estimation to bias-reduced (BR) estimation; and (2) we apply the BR-based GLM trees to basketball analytics. The results show models very easy to interpret that can provide useful support to coaching staff's decisions.