This is an autoregressive classification model for continuous data. It does "direct" forecasting, meaning that it estimates a class at a particular target horizon.
Usage
arx_classifier(
epi_data,
outcome,
predictors,
trainer = logistic_reg(),
args_list = arx_class_args_list()
)
Arguments
- epi_data
An
epi_df
object- outcome
A character (scalar) specifying the outcome (in the
epi_df
). Note that as witharx_forecaster()
, this is expected to be real-valued. Conversion of this data to unordered classes is handled internally based on thebreaks
argument toarx_class_args_list()
. If discrete classes are already in theepi_df
, it is recommended to code up a classifier from scratch usingepi_recipe()
.- predictors
A character vector giving column(s) of predictor variables. This defaults to the
outcome
. However, if manually specified, only those variables specifically mentioned will be used. (Theoutcome
will not be added.) By default, equals the outcome. If manually specified, does not add the outcome variable, so make sure to specify it.- trainer
A
{parsnip}
model describing the type of estimation. For now, we enforcemode = "classification"
. Typical values areparsnip::logistic_reg()
orparsnip::multinom_reg()
. More complicated trainers likeparsnip::naive_Bayes()
orparsnip::rand_forest()
can also be used.- args_list
A list of customization arguments to determine the type of forecasting model. See
arx_class_args_list()
.
Value
A list with (1) predictions
an epi_df
of predicted classes
and (2) epi_workflow
, a list that encapsulates the entire estimation
workflow
Details
The arx_classifier()
is an autoregressive classification model for epi_df
data that is used to predict a discrete class for each case under
consideration. It is a direct forecaster in that it estimates the classes
at a specific horizon or ahead value.
To get a sense of how the arx_classifier()
works, let's consider a simple
example with minimal inputs. For this, we will use the built-in
covid_case_death_rates
that contains confirmed COVID-19 cases and deaths
from JHU CSSE for all states over Dec 31, 2020 to Dec 31, 2021. From this,
we'll take a subset of data for five states over June 4, 2021 to December
31, 2021. Our objective is to predict whether the case rates are increasing
when considering the 0, 7 and 14 day case rates:
jhu <- covid_case_death_rates %>%
filter(
time_value >= "2021-06-04",
time_value <= "2021-12-31",
geo_value %in% c("ca", "fl", "tx", "ny", "nj")
)
out <- arx_classifier(jhu, outcome = "case_rate", predictors = "case_rate")
out$predictions
#> # A tibble: 5 x 4
#> geo_value .pred_class forecast_date target_date
#> <chr> <fct> <date> <date>
#> 1 ca (-Inf,0.25] 2021-12-31 2022-01-07
#> 2 fl (-Inf,0.25] 2021-12-31 2022-01-07
#> 3 nj (-Inf,0.25] 2021-12-31 2022-01-07
#> 4 ny (-Inf,0.25] 2021-12-31 2022-01-07
#> 5 tx (-Inf,0.25] 2021-12-31 2022-01-07
The key takeaway from the predictions is that there are two prediction
classes: (-Inf, 0.25]
and (0.25, Inf)
. This is because for our goal of
classification the classes must be discrete. The discretization of the
real-valued outcome is controlled by the breaks
argument, which defaults
to 0.25
. Such breaks will be automatically extended to cover the entire
real line. For example, the default break of 0.25
is silently extended to
breaks = c(-Inf, .25, Inf)
and, therefore, results in two classes:
[-Inf, 0.25]
and (0.25, Inf)
. These two classes are used to discretize
the outcome. The conversion of the outcome to such classes is handled
internally. So if discrete classes already exist for the outcome in the
epi_df
, then we recommend to code a classifier from scratch using the
epi_workflow
framework for more control.
The trainer
is a parsnip
model describing the type of estimation such
that mode = "classification"
is enforced. The two typical trainers that
are used are parsnip::logistic_reg()
for two classes or
parsnip::multinom_reg()
for more than two classes.
workflows::extract_spec_parsnip(out$epi_workflow)
#> Logistic Regression Model Specification (classification)
#>
#> Computational engine: glm
From the parsnip model specification, we can see that the trainer used is
logistic regression, which is expected for our binary outcome. More
complicated trainers like parsnip::naive_Bayes()
or
parsnip::rand_forest()
may also be used (however, we will stick to the
basics in this gentle introduction to the classifier).
If you use the default trainer of logistic regression for binary
classification and you decide against using the default break of 0.25, then
you should only input one break so that there are two classification bins
to properly dichotomize the outcome. For example, let's set a break of 0.5
instead of relying on the default of 0.25. We can do this by passing 0.5 to
the breaks
argument in arx_class_args_list()
as follows:
out_break_0.5 <- arx_classifier(
jhu,
outcome = "case_rate",
predictors = "case_rate",
args_list = arx_class_args_list(
breaks = 0.5
)
)
#> Warning: glm.fit: algorithm did not converge
#> Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
out_break_0.5$predictions
#> # A tibble: 5 x 4
#> geo_value .pred_class forecast_date target_date
#> <chr> <fct> <date> <date>
#> 1 ca (-Inf,0.5] 2021-12-31 2022-01-07
#> 2 fl (-Inf,0.5] 2021-12-31 2022-01-07
#> 3 nj (-Inf,0.5] 2021-12-31 2022-01-07
#> 4 ny (-Inf,0.5] 2021-12-31 2022-01-07
#> 5 tx (-Inf,0.5] 2021-12-31 2022-01-07
Indeed, we can observe that the two .pred_class
are now (-Inf, 0.5] and
(0.5, Inf). See help(arx_class_args_list)
for other available
modifications.
Additional arguments that may be supplied to arx_class_args_list()
include
the expected lags
and ahead
arguments for an autoregressive-type model.
These have default values of 0, 7, and 14 days for the lags of the
predictors and 7 days ahead of the forecast date for predicting the
outcome. There is also n_training
to indicate the upper bound for the
number of training rows per key. If you would like some practice with using
this, then remove the filtering command to obtain data within "2021-06-04"
and "2021-12-31" and instead set n_training
to be the number of days
between these two dates, inclusive of the end points. The end results
should be the same. In addition to n_training
, there are forecast_date
and target_date
to specify the date that the forecast is created and
intended, respectively. We will not dwell on such arguments here as they
are not unique to this classifier or absolutely essential to understanding
how it operates. The remaining arguments will be discussed organically, as
they are needed to serve our purposes. For information on any remaining
arguments that are not discussed here, please see the function
documentation for a complete list and their definitions.
Examples
tiny_geos <- c("as", "mp", "vi", "gu", "pr")
jhu <- covid_case_death_rates %>%
filter(time_value >= as.Date("2021-11-01"), !(geo_value %in% tiny_geos))
out <- arx_classifier(jhu, "death_rate", c("case_rate", "death_rate"))
#> Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
out <- arx_classifier(
jhu,
"death_rate",
c("case_rate", "death_rate"),
trainer = parsnip::multinom_reg(),
args_list = arx_class_args_list(
breaks = c(-.05, .1), ahead = 14,
horizon = 14, method = "linear_reg"
)
)