flocker is an R package for fitting occupancy models. To date, software for occupancy modeling has required users either to work directly with probabilistic programming languages like Stan or JAGS, or to restrict themselves to simple effects structures in packages like unmarked. flocker changes that, providing occupancy modelers with a simple formula-based syntax for sophisticated model structures. Based on highly optimized Stan code, flocker is also fast, especially for large models.

flocker is built on R package brms, which in turn is a front-end for Stan. Thus, mastering flocker is mostly a matter of mastering the formula syntax available in brms.

In the remainder of this vignette, we

• define some useful terms
• explain how to format data for use with flocker
• provide an overview of brms formula syntax, with links to additional documentation for advanced topics
• illustrate how users can (and should!) specify their own priors
• review flocker’s functionality for posterior prediction and model comparison.

## Installation and feedback

Installation instructions are available here. To request features or report bugs (much appreciated!), please open an issue on GitHub.

## Terms and defintions

The following terms feature importantly in this vignette. Some are not standard in the literature (but we think maybe they should be):

closure-unit: The groupings of observations over which closure is assumed. In single-species models, a closure-unit corresponds to a “site” or “point”. In multi-species models, a closure-unit is a species-site combination. In single-species dynamic models (not yet implemented in flocker), a closure-unit is a site-season or site-year combination.

Z: The (unobserved) true occupancy state of each closure-unit. We can represent Z as a vector of ones and zeros with one element for each closure-unit: a one if occupied; a zero if unoccupied.

$$\boldsymbol{\psi}$$, $$\boldsymbol{\theta}$$: The occupancy ($$\psi$$) and detection ($$\theta$$) probabilities. In many models, both $$\psi$$ and $$\theta$$ will vary across closure-units. In some models $$\theta$$ will additionally vary across repeated sampling events within a closure-unit.

Q The (observed) detection/nondetection state of each closure-unit (i.e. does the unit have at least one detection in the data or not). As for Z, we represent Q as a vector of ones and zeros.

rep-constant, rep-varying: We refer to models where $$\theta$$ is constant across repeated sampling events within closure-units as rep-constant models, as contrasted with rep-varying models that incorporate event-specific detection covariates. It turns out that rep-constant models enable a more efficient parametrization of the likelihood than rep-varying models.

unit covariates, event covariates: We refer to any covariate that does not vary across sampling events within closure-units as a “unit covariate”. This includes covariates that are intrinsically properties of single closure-units (e.g. the elevations of sites in a single-species model), covariates that are intrinsically properties of groups of closure units (e.g. elevations of sites in a multispecies model), and covariates that are intrinsically properties of sampling events but happen to be constant within all closure-units (e.g. observer in a sampling design where every site is visited by exactly one observer). We refer to any covariate that varies across sampling events within covariates as an “event covariate”. Note that while unit covariates may appear in either the occupancy or the detection formula, event covariates are restricted to the detection formula. Models that incorporate event covariates are rep-varying (see above); those that do not are rep-constant.

## Data formatting

The main function in flocker for fitting occupancy models, called flock(), expects a highly specific and somewhat peculiar data format. The function make_flocker_data() formats data for use with flock() automatically. At a minimum, make_flocker_data() expects a matrix or dataframe of detection/non-detection data. Rows represent closure-units, columns represent repeated sampling events within closure-units, and entries must be 0 (nondetection), 1 (detection), or NA (no corresponding sampling event). The data must be formatted so that all NAs are trailing within their rows. For example, if some units were sampled four times and other three times, the three sampling events must be treated as events 1, 2, and 3 (with the fourth event NA) rather than as events 1, 3, and 4 (with the second event NA) or any other combination.

Many occupancy models also include covariates that influence occupancy or detection probabilities. Unit covariates (see Terms and definitions above) can be passed to make_flocker_data() as a dataframe with the same number of rows as the observation matrix and data in the same order as the rows of the observation matrix. Columns are covariates, and we recommend using informative column names. Event covariates (see Terms and definitions above) can be passed as a named list of matrices whose elements [i, j] are the covariate values for the sampling event represented by the corresponding position of the observation matrix. Again, we recommend using informative names for the list elements. If the corresponding observation is NA, then the value of the event covariate does not matter.

Here’s an example of how we format data, using example data provided via the example_flocker_data() function:

library(flocker)
ex_data <- example_flocker_data()
names(ex_data)
#> [1] "obs"        "unit_covs"  "event_covs"
names(ex_data$event_covs) #> [1] "ec1" "ec2" head(ex_data$obs) # observation matrix
#>      [,1] [,2] [,3] [,4]
#> [1,]    0    0    0    0
#> [2,]    0    0    0    0
#> [3,]    0    0    0    0
#> [4,]    0    0    0    0
#> [5,]    0    0    0    1
#> [6,]    0    0    0    0
head(ex_data$unit_covs) # observation matrix #> uc1 uc2 grp species #> 1 -0.5604756 0.4264642 7 sp_1 #> 2 -0.5604756 0.4264642 3 sp_2 #> 3 -0.5604756 0.4264642 15 sp_3 #> 4 -0.5604756 0.4264642 5 sp_4 #> 5 -0.5604756 0.4264642 8 sp_5 #> 6 -0.5604756 0.4264642 19 sp_6 head(ex_data$event_covs$ec1) #> [,1] [,2] [,3] [,4] #> [1,] 0.3215200 0.1058816 -0.8922459 1.7166491 #> [2,] 0.3215200 1.7166491 0.1058816 -0.8922459 #> [3,] 0.3215200 -0.8922459 1.7166491 0.1058816 #> [4,] 0.3215200 0.1058816 -0.8922459 1.7166491 #> [5,] 0.1058816 0.3215200 -0.8922459 1.7166491 #> [6,] 0.3215200 0.1058816 1.7166491 -0.8922459 flocker_data <- make_flocker_data(obs = ex_data$obs,
unit_covs = ex_data$unit_covs, event_covs = ex_data$event_covs)

## Model formulas

Once we’ve formatted data with make_flocker_data(), we are ready to fit an occupancy model using the flock() function. Internally, flock calls brms::brm(), and the key to mastering flock() is to master the formula synax from brms. We supply formulas for both occupancy and detection. Simple formulas follow the same syntax as R’s lm() function. For example:

flock(f_occ = ~ uc1,
f_det = ~ 1,
flocker_data = flocker_data)

### Random effects

Simple random effects follow lme4 syntax, including advanced lme4 syntax is supported, including || for uncorrelated effects and / and : for expansion of multiple grouping terms. Here’s a simple example:

flock(f_occ = ~ uc1 + (1 | species),
f_det = ~ 1,
flocker_data = flocker_data)

When a model includes multiple random effects with the same grouping term, by default they are modeled as correlated within the occupancy or detection formulas, but as uncorrelated between formulas. For example, the code below estimates a single correlation for the intercept and slope in the occupancy sub-model.

flock(f_occ = ~ uc1 + (1 + uc1 | species),
f_det = ~ ec1 + (1 | species),
flocker_data = flocker_data)

However, this assumption can easily be relaxed using the |<ID>| syntax from brms. The ID is an arbitrary character string representing a group of terms to model as correlated. The below code, for example, models correlated intercepts in the occupancy and detection sub-models, and correlated effects of sc1 on occupancy and vc1 on detection, but no correlations between the intercepts and the slopes in either sub-model:

flock(f_occ = ~ uc1 + (1 |g1| species) + (0 + uc1 |g2| species),
f_det = ~ ec1 + (1 |g1| species) + (0 + ec1 |g2| species),
flocker_data = flocker_data)

For more on brms syntax for random effects syntax, see the documentation here.

Via brms, flocker supports mgcv syntax for thin-plate regression splines (brms::s()) and tensor product smooths (brms::t2()). For example:

flock(f_occ = ~ s(uc1),
f_det = ~ 1,
flocker_data = flocker_data)

brms is capable of fitting a variety of additional effects structures. We believe that the following structures should translate directly to flocker, but these remain untested. As we test them and verify adequate performance, we will update this vignette with examples.

#### Phylogenetic models

Phylogenetic effects can be included by providing a covariance matrix as a data2 argument and using the brms::gr() function to link species identities in flocker_data with the supplied covariance matrix. Note that phylogenetic effects can be included in either the occupancy component, the detection component, or both!

# simulate an example phylogeny
phylogeny <- ape::rtree(30, tip.label = paste0("sp_", 1:30))

# calculate covariance matrix
A <- ape::vcv.phylo(phylogeny)

ff1 <- flock(f_occ = ~ 1 + (1|gr(species, cov = A)),
f_det = ~  1 + ec1 + (1|species),
flocker_data = flocker_data,
data2 = list(A = A))

ff2 <- flock(f_occ = ~ 1 + (1|gr(species, cov = A)),
f_det = ~  1 + ec1 + (1|gr(species, cov = A)),
flocker_data = flocker_data,
data2 = list(A = A))

See here for further details about specifying phylogenetic effects in brms.

#### Spatial autoregressive models

See here for details about conditional autoregressive (CAR) models in brms. Note that if the spatial effect is applied to occupancy, it is essential closure-units be grouped such that many groups contain more than one unit. With just one unit per group (the brms default if no grouping is supplied), the logit-scale residual is not identified. Note that flock() directly accepts a data2 argument that it can pass to brms as necessary.

#### Monotonic effects

See here for relevant brms documentation.

#### Measurement error

See here for relevant brms documentation.

flock will pass any relevant parameters forward to brms::brm(), giving the user important control over the algorithmic details of how the model is fit. See ?brms::brm for details. To speed up the execution, we recommend supplying the argument backend = "cmdstanr". This requires the cmdstanr package and a working installation of cmdstan; see here for instructions to get started and further details.

## Prior specification

Priors can be implemented as they would with any brms model. Priors can be specified using set_prior(), with priors specified for groups of parameters (via class) or individual parameters (via coef). The priors used for a particular model can be retrieved using prior_summary()

user_prior <- c(brms::set_prior("normal(0, 3)", class="b"),
brms::set_prior("normal(0, 2)", class="Intercept"),
brms::set_prior("normal(0, 1)", coef="ec1"))

ff <- flock(f_occ = ~ uc1 + uc2 + (1|species),
f_det = ~  ec1 + ec2 + (1|species),
flocker_data = flocker_data,
prior = user_prior)

brms::prior_summary(ff)

Note that if there are parameters shared between both the occupancy and detection model formulas, e.g.

ff <- flock(f_occ = ~ uc1 + (1|species),
f_det = ~ uc1 + ec2 + (1|species),
flocker_data = flocker_data, prior=user_prior)

then there will be two entries for each of the shared parameters in the prior table (uc1 in this example). Specifying a prior for each parameter individually can be done with reference to the dpar column, e.g.:

user_prior <- c(brms::set_prior("normal(0, 1)", coef = "uc1"),
brms::set_prior("normal(0, 3)", coef = "uc1", dpar = "occ"))

where the uc1 parameter in the occupancy component is specified by the addition of the dpar argument, and the uc1 parameter in the detection component is specified without reference to dpar.

For more on priors in brms, see ?brms::set_prior.

## Post-processing

flocker provides functions for three main types of post-processing. get_Z() provides the posterior distribution for the latent occupancy state. predict_flocker() provides posterior predictions at the observed points (e.g. for use in posterior predictive checking) or for new data. loo_flocker() and loo_compare_flocker() both provide functionality for model comparison. See below for details on all three types of post-processing. Both posterior predictions and model comparison rely on subtle aspects of the occupancy model likelihood that we explain in more detail here.

### Posterior Z

The function get_Z() returns the posterior distribution of occupancy probabilities across the closure-units. The output is a matrix where rows are posterior iterations, columns are closure-units, and values are draws from the posterior distribution of occupancy probabilities. For example:

ff <- flock(f_occ = ~ uc1,
f_det = ~ 1,
flocker_data = flocker_data)
get_Z(ff)

get_Z() accepts several additional arguments that control the way that posterior is obtained and the values of returned. See ?get_Z for details.

### Posterior predictions

The funtion predict_flocker() provides posterior predictions. By default, predictions are provided for the covariate data to which the model were fit, but predictions to new data are also possible via the new_data argument. The output differs for rep-constant and rep-varying models. For rep-constant models, a matrix where rows are iterations, columns are units, and values are the number of detections. For rep-varying models, an array whose first dimension is units, second dimension is sampling events, third dimension is iterations, and values are 1, 0, or NA, representing detection, nondetection, and no corresponding sampling event. For example:

ff <- flock(f_occ = ~ uc1,
f_det = ~ 1,
flocker_data = flocker_data)
predict_flocker(ff)

predict_flocker() accepts several additional arguments that control the way that posterior is obtained and the values of returned. See ?predict_flocker for details.

### Model comparison

flocker supports computationally efficient approximate leave-one-out cross-validation via R package loo using a method commonly known as PSIS-LOO. This method generally provides superior performance to other computationally efficient model performance criteria (e.g. WAIC), and it comes with diagnostics that signal when the approximation is unreliable. Importantly, when diagnostics indicate that PSIS-LOO is unreliable, other model-comparison metrics such as WAIC are even less reliable. For details about PSIS-LOO, see ?loo::loo package documentation and:

Vehtari, A., Gelman, A., & Gabry J. (2016). Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC. In Statistics and Computing, doi:10.1007/s11222-016-9696-4. arXiv preprint arXiv:1507.04544.

Vehtari, A., Simpson, D., Gelman, A., Yao, Y., and Gabry, J. (2019). Pareto smoothed importance sampling. preprint arXiv:1507.02646

The most straightforward way to compare models fit with flocker is the function loo_compare_flocker(). This function takes a list of flocker_fit objects as its argument and returns a model comparison table based on the difference in the expected log predictive density (elpd) between models. This table is a compare.loo object from loo::loo_compare(). The “leave-one-out” holdouts consist of entire closure-units, not single sampling events (see here for details of why).

ff1 <- flock(f_occ = ~ uc1,
f_det = ~ 1,
flocker_data = flocker_data)

ff2 <- flock(f_occ = ~ uc1,
f_det = ~ ec1 + ec2,
flocker_data = flocker_data)

loo_compare_flock(list(ff1 = ff1, ff2 = ff2))

Flocker also provides the function loo_flocker() to return a table of elpd_loo, p_loo, and looic estimates from loo::loo() or brms::loo() (the former for rep-varying models, the latter for rep-constant models).

For more about PSIS-LOO with flocker_fit objects, see flocker’s LOO vignette.