GEV ML Inference with Adjustment for Missing Data

Fits a GEV distribution to block maxima using maximum likelihood estimation, with the option to make an adjustment for the numbers of non-missing raw values in each block. The GEV location and scale parameters are adjusted to reflect the proportion of raw values that are missing.

Usage

gev_mle(
  data,
  block_length,
  block,
  adjust = TRUE,
  discard = 0,
  init = "quartiles",
  ...
)

Arguments

data

Either

a numeric vector containing a time series of raw data,
an object returned from block_maxima, a list with components maxima, notNA and n,
a data frame or named list containing the same information, that is, the variables maxima, notNA and n, as an object returned from block_maxima, such as the data frame BrestSurgeMaxima.

block_length

A numeric scalar. Used calculate the maxima of disjoint blocks of block_length contiguous values in the vector data. If length(data) is not an integer multiple of block_length then the values at the end of data that do not constitute a complete block of length block_length are discarded, without warning.

block

A numeric vector with the same length as data. The value of block[i] indicates the block into which data[i] falls. For example, block could provide the year in which observation i was observed.

adjust

A logical scalar or a numeric scalar in [0, 100].

If adjust = TRUE then the adjustment, described in Details, for the numbers of non-missing values underlying each block maximum is performed.
If adjust = FALSE then no adjustment is made, that is, the block maxima are treated as if the underlying raw data have no missing values.

discard

A numeric scalar. Any block maximum for which greater than discard percent of the underlying raw values were missing is discarded. Whether or not an adjustment for missingness is made for the block maxima that remain is determined by adjust.

init

Either a character scalar, one of "quartiles" or "moments", or a numeric vector of length 3 giving initial estimates of the GEV location, scale and shape parameters: $\mu$, $\sigma$ and $\xi$. If init = "quartiles" then initial estimates of $\mu$ and $\sigma$ are based on sample quartiles of block maxima, ignoring the underlying numbers of non-missing raw data, and a value of 0 for $\xi$. If init = "moments" then instead we use the sample mean and variance of these maxima and an initial value of 0.1 for $\xi$.

...

Further arguments to be passed to stats::optim.

Value

A list, returned from stats::optim (the MLEs are in the component par), with the additional components:

loglik: value of the maximised log-likelihood.
vcov: estimated variance-covariance matrix of the parameters.
se: estimated standard errors of the parameters.
maxima: the vector of block maxima used to fit the model.
notNA: the number of non-missing raw values on which the maxima in maxima are based.
n: the maximal block length, that is, the largest number of values that could have been observed in each of these blocks.
adjust,discard : the values of these input arguments.

The call to gev_mle is provided in the attribute "call".

The class of the returned object is c("evmissing", "mle", "list").

Objects inheriting from class "evmissing" have coef, logLik, nobs, summary, vcov and confint methods. See evmissing_methods.

Details

If data is numeric vector then exactly one of the arguments block_length or block must be supplied. The parameters are fitted using maximum likelihood estimation.

The adjustment for the numbers of non-missing values underlying the block maxima is based on the strong assumption that missing values occur completely at random. We suppose that a block maximum $M_n$ based on a full (no missing raw values) block of length $n$ has a $\text{GEV}(\mu, \sigma, \xi)$ distribution, with distribution function $G(x)$. Let $n_i$ be the number of non-missing values in block $i$ and let $M_{n_i}$ denote the block maximum of such a block. We suppose that $M_{n_i}$ has a $\text{GEV}(\mu(n_i), \sigma(n_i), \xi)$ distribution, where $$\mu(n_i) = \mu + \sigma [(n_i/n)^\xi -1] / \xi,$$ $$\sigma(n_i) = \sigma (n_i/n)^\xi.$$

These expressions are based on inferring the parameters of an approximate GEV distribution for $M_{n_i}$ from its approximate distribution function $[G(x)]^{n_i/n}$.

A likelihood is constructed as the product of contributions from the maxima from distinct blocks, under the assumption that these maxima are independent. Let $\theta = (\mu, \sigma, \xi)$ and let $\ell_F(\underline{z}; \theta)$ denote the usual, unadjusted, GEV log-likelihood for the full-data case where there are no missing values. It can be shown that our adjusted log-likelihood $\ell(\theta, \underline{z})$ is given by

$$\ell(\theta, \underline{z}) = \ell_F(\underline{z}; \theta) - \sum_{i=1}^n p_i \log G(z_i; \theta)$$

where $p_i = 1 - n_i / n$ is the proportion of missing values in block $i$.

The negated log-likelihood is minimised using a call to stats::optim with hessian = TRUE. If stats::optim throws an error then a warning is produced and the returned object has NA values for the components par, loglik, vcov and se and an extra component optim_error containing the error message. If the estimated observed information matrix is singular then a warning is produced and the returned object has NA values for the components vcov and se.

Examples

## Simulate raw data from an exponential distribution

set.seed(13032025)
blocks <- 50
block_length <- 365
sdata <- sim_data(blocks = blocks, block_length = block_length)

# sdata$data_full have no missing values
# sdata$data_miss have had missing values created artificially

# Fit a GEV distribution to block maxima from the full data
fit1 <- gev_mle(sdata$data_full, block_length = sdata$block_length)
summary(fit1)
#> 
#> Call:
#> gev_mle(data = sdata$data_full, block_length = sdata$block_length)
#> 
#>       Estimate Std. Error
#> mu     5.82700     0.1769
#> sigma  1.08400     0.1306
#> xi    -0.01449     0.1243

# An identical fit supplying the block indicator instead of block_length
fit1b <- gev_mle(sdata$data_full, block = sdata$block)
summary(fit1b)
#> 
#> Call:
#> gev_mle(data = sdata$data_full, block = sdata$block)
#> 
#>       Estimate Std. Error
#> mu     5.82700     0.1769
#> sigma  1.08400     0.1306
#> xi    -0.01449     0.1243

# Make adjustment for the numbers of non-missing values per block
fit2 <- gev_mle(sdata$data_miss, block_length = sdata$block_length)
summary(fit2)
#> 
#> Call:
#> gev_mle(data = sdata$data_miss, block_length = sdata$block_length)
#> 
#>       Estimate Std. Error
#> mu     5.97300     0.1907
#> sigma  1.07400     0.1157
#> xi    -0.05945     0.1177

# Do not make the adjustment
fit3 <- gev_mle(sdata$data_miss, block_length = sdata$block_length,
                adjust = FALSE)
summary(fit3)
#> 
#> Call:
#> gev_mle(data = sdata$data_miss, block_length = sdata$block_length, 
#>     adjust = FALSE)
#> 
#>       Estimate Std. Error
#> mu     5.66300     0.1785
#> sigma  1.10600     0.1295
#> xi    -0.06951     0.1152

# Remove all block maxima with greater than 25% missing values and
# do not make the adjustment
fit4 <- gev_mle(sdata$data_miss, block_length = sdata$block_length,
                adjust = FALSE, discard = 25)
summary(fit4)
#> 
#> Call:
#> gev_mle(data = sdata$data_miss, block_length = sdata$block_length, 
#>     adjust = FALSE, discard = 25)
#> 
#>       Estimate Std. Error
#> mu     5.67200     0.2454
#> sigma  1.08200     0.1845
#> xi    -0.01368     0.1880

## Plymouth ozone data

# Calculate the values in Table 4 of Simpson and Northrop (2026)
# discard = 50 is chosen to discard data from 2001 and 2006
fit1 <- gev_mle(PlymouthOzoneMaxima)
fit2 <- gev_mle(PlymouthOzoneMaxima, adjust = FALSE)
fit3 <- gev_mle(PlymouthOzoneMaxima, discard = 50)
fit4 <- gev_mle(PlymouthOzoneMaxima, adjust = FALSE, discard = 50)
se <- function(x) return(sqrt(diag(vcov(x))))
MLEs <- cbind(coef(fit1), coef(fit2), coef(fit3), coef(fit4))
SEs <- cbind(se(fit1), se(fit2), se(fit3), se(fit4))
round(MLEs, 2)
#>         [,1]   [,2]   [,3]   [,4]
#> mu    128.77 126.52 129.55 128.59
#> sigma  18.81  25.50  17.65  17.78
#> xi      0.00  -0.28   0.04   0.04
round(SEs, 2)
#>       [,1] [,2] [,3] [,4]
#> mu    4.40 5.53 4.60 4.55
#> sigma 2.63 4.00 3.41 3.61
#> xi    0.16 0.15 0.26 0.27