library(ggplot2)
library(lubridate)
library(dplyr)
library(kableExtra)
library(scales)
library(boot)
library(parallel)
library(purrr)
source("CDF-grapher.R")
source("clopper-pearson-CI.R")
source("log-likelihood.R")
source("mixture-distribution.R")
source("generalized-pareto.R")
set.seed(123)
# Constants
p = 0.8 # Our goal is a confidence interval for the p-quantile.
Our data consists of day-over-day insurance claims made after fire damage. We will assume that our data comprises an independent and identically distributed random sample.
Our data consists of the total loss claimed by day from 2005-01-01 to 2016-12-31 (12 years).
Let’s look at some summary statistics of the total daily loss.
# We use the population quantile estimator recommended by Hyndman and Fan (1996).
summary(total_loss, quantile.type = 8)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 500 5000 27523 23192 3450500
Notice that the range of the total daily loss is in the millions. Let’s naively try making a histogram.
hist(total_loss, main = "Total Daily Loss", xlab = "Total Daily Loss")
The total daily loss looks extremely right-skewed. How often did we have a total daily loss of $0?
num_days = length(total_loss)
num_zero_days = length(total_loss[total_loss == 0])
proportion_zero_days = num_zero_days/num_days
There were 835 days with zero claims. This is 19.05% of the days.
Let’s look at the sum of total daily loss per month.
Do some months have more days with non-zero claims than other months?
It appears that the proportion of non-zero claim days differs from month to month.
Let’s test the null hypothesis that the proportion of non-zero claim days is the same for each month in the timespan that the data was collected vs. the alternative that the proportions are not all the same.
prop.test(x = prop_loss_grouped_2$num_non_zero_days, n = prop_loss_grouped_2$total_days)
##
## 12-sample test for equality of proportions without continuity
## correction
##
## data: prop_loss_grouped_2$num_non_zero_days out of prop_loss_grouped_2$total_days
## X-squared = 24.889, df = 11, p-value = 0.009464
## alternative hypothesis: two.sided
## sample estimates:
## prop 1 prop 2 prop 3 prop 4 prop 5 prop 6 prop 7 prop 8
## 0.7553763 0.7669617 0.8145161 0.8222222 0.8064516 0.8361111 0.8306452 0.8548387
## prop 9 prop 10 prop 11 prop 12
## 0.7805556 0.7903226 0.8444444 0.8091398
It appears likely at the 1% significance level that the proportion of non-zero claim days is not the same per month. In other words, there is an association between month and fire risk.
For the purposes of the present analysis, we will only try to estimate the p-quantile of total daily loss without reference to the month. In future studies, a hierarchical model could perhaps be built that looks at the number of claims that occur per month and uses that information to estimate a p-quantile on a month-by-month basis. Future studies could perhaps make use of the multinomial distribution and quantile regression.
Let’s compute \(G_{1}\), the Adjusted Fisher–Pearson Standardized Moment Coefficient[1] to get an idea of the population skewness.
\[ G_{1} = \frac{\sqrt{n\left(n-1\right)}}{n-2} \cdot \frac{\frac{1}{n}\sum_{i=1}^{n}\left(x_{i}-\bar{x}\right)^{3}}{\left[\frac{1}{n}\sum_{i=1}^{n}\left(x_{i}-\bar{x}\right)^{2}\right]^{3/2}} \]
## [1] 18.03185
For comparison, a symmetric distribution would have a skewness of 0. An exponential distribution would have a skewness of 2.
Excess kurtosis measures the heaviness of the tails of a distribution in relation to the normal distribution. Distributions with fat tails will have a positive excess kurtosis.[2] We expect our distribution to have positive excess kurtosis because both high-valued properties and accidents are rare. We will estimate the population excess kurtosis by using the statistic
\[ G_{2} = \frac{\left(n+1\right)n\left(n-1\right)} {\left(n-2\right)\left(n-3\right)} \cdot \frac{\sum_{i=1}^{n}\left(x_{i}-\bar{x}\right)^{4}} {\left[\sum_{i=1}^{n}\left(x_{i}-\bar{x}\right)^{2}\right]^{2}} -3 \cdot \frac{\left( n-1 \right)^{2}} {\left( n-2 \right)\left( n-3 \right)} \]
## [1] 453.0895
This high positive value for the excess kurtosis indicates that our distribution has heavy tails. This is often the case for non-life insurance claims.
Is the total daily loss rounded to the nearest dollar?
is.integer(total_loss)
## [1] TRUE
Is the total daily loss rounded?
Last Digit(s) of Total Loss | Number of Days | Proportion |
---|---|---|
0 | 835 | 0.1905 |
00 | 2971 | 0.6778 |
other | 577 | 0.1316 |
We can see that it is very common for the total daily loss to be rounded to the nearest $100. What happens when we admit the last 3 digits into consideration?
Last Digit(s) of Total Loss | Number of Days | Proportion |
---|---|---|
0 | 835 | 0.1905 |
00 | 0 | 0.0000 |
000 | 1312 | 0.2993 |
500 | 626 | 0.1428 |
other | 1610 | 0.3673 |
Now we see that many total losses are in units of 500 or 1000 USD. For simplicity, we will ignore the possibly discrete nature of the total daily loss.
total_loss_ecdf = ecdf(total_loss)
plot(total_loss_ecdf, main = "Empirical Cumulative Distribution Function\n of Total Daily Loss")
We can zoom in around the 0.8-quantile.
# This finds the p-quantile of our sample data.
p_quantile = quantile(total_loss_ecdf, probs = p, type = 8) # We use the population quantile estimator recommended by Hyndman and Fan (1996).
plot(total_loss_ecdf, main = "Empirical Cumulative Distribution Function\n of Total Loss", xlim = c(2e4,6e4), ylim = c(0.7, 0.9))
abline(h = p, v = p_quantile, col = "red")
plot(total_loss_ecdf, main = "Empirical Cumulative Distribution Function\n of Total Daily Loss", xlim = c(3.1e4,3.2e4), ylim = c(0.798, 0.802))
abline(h = p, v = p_quantile, col = "red")
Our point estimate of the 0.8-quantile is $31,400.00.
The Dvoretzky–Kiefer–Wolfowitz Inequality[3] allows for the construction of simultaneous confidence bands around the Emperical CDF. The inequality is non-parametric so it is not necessary to first fit a distribution to the total daily loss. We will choose \(\alpha = 0.05\). The inequality states:
\[ P\left(\sup_{x \in \mathbb{R}} \left| F_{n} \left( x \right) - F \left( x \right)\right| > \epsilon \right) \le 2e^{-2n\epsilon ^{2}} = \alpha \]
or, with probability \(1 - \alpha\), the interval containing the true CDF is
\[ F_{n} \left(x \right) - \epsilon \leq F \left(x \right) \leq F_{n} \left(x \right) + \epsilon \quad \text{where} \ \epsilon = \sqrt{\frac{\ln \frac{2}{\alpha}}{2n}} \]
alpha = 0.05
epsilon = sqrt(log(2/alpha)/(2*num_days))
For our problem, \(\epsilon =\) 0.0205.
One weakness of the Dvoretzky–Kiefer–Wolfowitz bands is that their coverage probability is lower near the median of the distribution while higher near the tails. However, these bands very easily capture where the true CDF may lie.
From the convergence of the Empirical CDF, [4], it has been shown that
\[ n \hat{F}_{n} \left( x \right) \sim \text{Binom}\left( n, F \left( x \right) \right) \]
We can use our preferred method of constructing a CI for a binomial success probability to obtain a point-wise CI for \(F(x)\). Note that our point-estimate for the true CDF in these intervals will be the Empirical CDF. We will try to construct an exact Clopper-Pearson CI because it is conservative.
Note that the Hyndman and Fan (1996) Type 8 Estimate of the p-quantile uses two order statistics. Therefore, our point-estimate of the p-quantile corresponds to not one, but to two order statistics. There is thus no obvious way to map our estimate back to a single order statistic. We choose the larger of the two order statistics to give an upward bias to our confidence interval. We feel that it is a greater cause for concern if the actual p-quantile is above the upper bound than if it is below the lower bound.
k = floor(num_days*p + (p + 1)/3) + 1 # k is the index for the larger order stat
# Get the upper and lower bounds on the true CDF corresponding to k.
CDF_bounds = get_clopper_pearson(x = k, n = num_days, alpha = 0.05)
# Transform the bounds obtained above into quantiles of daily total loss using the Hyndman and Fan (1996) type 8 quantile estimator.
CI = data.frame(Method = "Binomial CI on ECDF",
Nominal_Coverage_Probability = 1 - alpha,
LB = round(quantile(x = total_loss, probs = CDF_bounds[1], type = 8, names = FALSE), 2),
UB = round(quantile(x = total_loss, probs = CDF_bounds[2], type = 8, names = FALSE), 2),
row.names = NULL
)
We are 95% confident that the 0.8-quantile lies between $30,000 and $34,250.
# This function takes a vector of numeric sample data and finds the estimate of the p-quantile of the corresponding population according to the method of Ivan Frohne and Rob J Hyndman.
find_qp = function(data, i, p) {
quantile(data[i], p, type = 8) # Approximately median-unbiased
}
# Reduce the number of replications to speed up computation time.
R = 1e4
cpus_on_machine = detectCores()
bootstrap_results_1 = boot(data = total_loss, statistic = find_qp, p = p, R = R, sim = "ordinary", parallel = "snow", ncpus = cpus_on_machine)
alpha = 0.05
CI[2, "Method"] = "Percentile Bootstrap"
CI[2, "Nominal_Coverage_Probability"] = 1 - alpha
CI[2, "LB"] = round(quantile(bootstrap_results_1$t[,1], p = alpha/2, type = 8, names = FALSE), 0)
CI[2, "UB"] = round(quantile(bootstrap_results_1$t[,1], p = 1 - alpha/2, type = 8, names = FALSE), 0)
hist(bootstrap_results_1$t[,1], xlim = c(2.8e4, 3.6e4), nclass = 30, xlab = paste0(p, "-Quantile of Resampled Total Loss"), main = paste0("Percentile Bootstrap Estimate of ", p, "-Quantile of Total Loss\nR = ", R))
abline(v = CI[2, "LB"], col = "red")
abline(v = CI[2, "UB"], col = "red")
This compares favorably with the CI obtained using the Empirical CDF.
# We can also get a Basic Bootstrap CI for comparison.
CI[3, "Method"] = "Basic Bootstrap"
CI[3, "Nominal_Coverage_Probability"] = 1 - alpha
CI[3, "LB"] = 2 * p_quantile - round(quantile(bootstrap_results_1$t[,1], p = 1 - alpha/2, type = 8, names = FALSE), 0)
CI[3, "UB"] = 2 * p_quantile - round(quantile(bootstrap_results_1$t[,1], p = alpha/2, type = 8, names = FALSE), 0)
Method | Nominal Coverage Probability | LB | UB | Length | Shape |
---|---|---|---|---|---|
Binomial CI on ECDF | 0.95 | 30000 | 34250 | 4250 | 2.04 |
Percentile Bootstrap | 0.95 | 30000 | 34000 | 4000 | 1.86 |
Basic Bootstrap | 0.95 | 28800 | 32800 | 4000 | 0.54 |