Occupancy model likelihoods: advanced topics

Occupancy models are an important class of models for biological survey data. Here, I discuss some subtleties of the likelihood function for occupancy models and the implications for posterior predictive checking and cross-validation. I assume basic familiarity with occupancy models in both their “familiar” and marginalized parametrizations. See here for basic background on occupancy models and here for an explanation of the marginalized parametrization.

Mind your Z’s and Q’s

We can think of an occupancy model as a pair of latent logistic regressions, one for occupancy and one for detection, that are united by a likelihood term that depends on both linear predictors.¹ In the “familiar” parametrization, we introduce a latent quantity \(Z_i\) that is a binary indicator for the unknown true occupancy state within a closure-unit \(i\) (i.e. a site, or in a multispecies model a species-site, or in a dynamic model a site-year). The familiar parametrization looks something like \[ \begin{eqnarray} Z_i &\sim& \mathrm{Bernoulli}(\psi_i) \\ y_{ij} &\sim& \mathrm{Bernoulli}(Z_i*\theta_{ij}) \\ \end{eqnarray} \] To be very explicit about how the model involves a pair of logistic regressions, we can re-express the model as \[ \begin{eqnarray} Z_i &\sim& \mathrm{Bernoulli}(\psi_i) \\ y_{ij} &\sim& \mathrm{Bernoulli}(\theta_{ij}) \\ D_{ij} &=& Z_i*D_{ij} \end{eqnarray} \] where \(\psi_i\) and \(\theta_{ij}\) are probabilities derived from the occupancy regression and detection regression, respectively. Note that in the absence of covariates that vary across repeated sampling events \(j\) within a closure-unit \(i\), we can use a binomial sufficient statistic in place of the repeated Bernoulli term for each sampling event. In R package flocker, we call these binomial models “rep-constant”, as opposed to “rep-varying” models that do not admit a binomial sufficient statistic. Regardless of whether the model is rep-constant or rep-varying, the structure as a pair of logistic regressions remains.

To marginalize and eliminate the unobserved \(Z\) from the likelihood, we introduce an observed indicator variable \(Q\) to indicate whether the closure-unit had at least one detection. I explain this marginalization in detail and provide examples in both JAGS and Stan here ². The marginalized likelihood is easier to express in mathematical form as an explict function rather than via JAGS-style sampling statements (though it can be coded in JAGS using sampling statements!). Below, \(\mathrm{BernoulliP}(a | b)\) is the Bernoulli probability mass function for observed data \(a\) given probability parameter \(b\): \[ \begin{equation} \mathcal{L_i} = \begin{cases} \mathrm{BernoulliP}(1|\psi_i)*\prod_{j=1}^n\left(\mathrm{BernoulliP}(y_{ij}|\theta_{ij})\right) & \text{if}\ Q_i=1 \\ \mathrm{BernoulliP}(0 | \psi_i) + \mathrm{BernoulliP}(1|\psi_i)*\prod_{j=1}^n\left(\mathrm{BernoulliP}(y_{ij}|\theta_{ij})\right) & \text{if}\ Q_i=0 \end{cases} \end{equation} \] The marginalized likelihood lays bare three crucial subtleties of occupancy model likelihoods that I believe are underappreciated by practitioners.

Subtleties I: the pointwise likelihood

The pointiwse likelihood of an occupancy model intrinsically exists at the level of closure-units, not at the level of repeated sampling events. The reason for this is simply that the likelihood does not factor to independent multiplicative contributions at the level of individual sampling events. See for yourself; attempt to factor \(\mathcal{L_i}\) above. You can’t.

As a result, cross-validation (even leave-one-out cross-validation) is properly performed by withholding data at the level of closure-units, not at the level of repeated sampling events.

Subtleties II: likelihood, parameters, and “unobserved data”

The likelihood in an occupancy model does not condition on \(Z\). Based on the “familiar” parametrization in JAGS, it is tempting to view \(Z\) as a parameter, and indeed the literature often refers to \(Z\) as a parameter. But \(Z\) is not a parameter in the traditional sense, and the likelihood does not condition on \(Z\). The easiest way to see that conditioning the likelihood on \(Z\) leads to problems is to consider what would happen in cross-validation at closure-units were \(Q_i=1\). At these closure-units, the posterior distribution for \(Z_i\) is always a point mass on 1, and so conditioning the likelihood on \(Z_i\) does not penalize at all for estimating arbitrarily low occupancy probabilities. Therefore, cross-validation based on a “likelihood” that conditions on \(Z\) would favor models that estimate the lowest occupancy probabilities possible (irrespective of the true occupancy probabilities), because those models would lead to higher “likelihoods” at points with no detections, with no penalty at points with detections.

So if \(Z\) isn’t a parameter, then what is it? Some people call \(Z\) unobserved data, and I like this convention. We can distinguish between the so-called full data likelihood \(P(y, Z | \psi, \theta)\) (which is unobserved and unknown when \(Q_i=0\)), the Z-conditioned likelihood \(P(y|Z,\psi,\theta)\), and the plain old likelihood \(\mathcal{L_i} = P(y | \psi, \theta)\). The full data likelihood is generally unknown, the Z-conditioned likelihood is generally not useful, and only the plain old likelihood is needed for evaluating occupancy models.

Subtleties III: Posterior predictive distributions

A crucial point is that the posterior predictive distribution for the observed data \(y\) uses the likelihood, not the Z-conditioned likelihood. This should make intuitive sense; a posterior predictive check ought not to have access to absolutely certain information about the observed data (in this case \(Q\), which can be directly inferred from the posterior distribution for \(Z\) but is clearly observed data, known only a posteriori). The notion that the posterior predictive distribution must not use the Z-conditioned likelihood is standard for zero-inflated models (Poissons, binomials, etc). Here’s an example zero-inflated binomial (which is exactly equivalent to a rep-constant occupancy model) using brms

library(brms)

## Warning: package 'brms' was built under R version 4.1.1

## Loading required package: Rcpp

## Loading 'brms' package (version 2.16.1). Useful instructions
## can be found by typing help('brms'). A more detailed introduction
## to the package is available through vignette('brms_overview').

## 
## Attaching package: 'brms'

## The following object is masked from 'package:stats':
## 
##     ar

N <- 100 # number of closure-units
nrep <- 15 # number of repeat sampling events (a large number is useful for illustration)
p <- .8 # detection probability
zi <- .5 # 1 minus occupancy probability

nzi <- rbinom(1, N, zi) # number of unoccupied units
nBinomial <- N - nzi # number of occupied units

mydata <- data.frame(y = c(rbinom(nBinomial, nrep, p), rep(0, nzi)),
                     trials = nrep)

fit <- brm(y | trials(trials) ~ 1, family = zero_inflated_binomial(), data = mydata,
                 backend = 'cmdstanr', refresh = 0)

## Start sampling

## Running MCMC with 4 sequential chains...
## 
## Chain 1 finished in 0.2 seconds.
## Chain 2 finished in 0.1 seconds.
## Chain 3 finished in 0.1 seconds.
## Chain 4 finished in 0.1 seconds.
## 
## All 4 chains finished successfully.
## Mean chain execution time: 0.1 seconds.
## Total execution time: 2.0 seconds.

head(posterior_predict(fit)[, 1:10])

##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,]    0    0   14   13   12    0    0    0   11     9
## [2,]    0    0    0    0   14   14   13    0   12     0
## [3,]    0    0   13   14    0   14   13    0   11    12
## [4,]   12   13   12    0   14    0    0    0    0     0
## [5,]    0   13    0    0    0   10    0    0    0    11
## [6,]    0    0    0   12   13   13   11   14   13     0

tail(posterior_predict(fit)[, 1:10])

##         [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [3995,]   12   14   14   14   12    0   14   13   10    11
## [3996,]   10   12    0    0   14   13   12    0   10    12
## [3997,]   13   10    0    0    0    0   11   13   12    12
## [3998,]    0    0   12   13   12   13    0   14   12     0
## [3999,]    0    0   12   12   11   10   10   13   12    14
## [4000,]    0   11    0   13    0   12   12    0   12     0

If brms were conditioning the posterior predictive distribution on the latent zero-inflation state, we should see nonzero predictions for the first half of the posterior predictions followed by exclusively zero predictions for the second half. But brms does not condition the posterior predictive distribution on \(Z\), and we see zero and non-zero predictions scattered at random.

The posterior distribution for Z

The posterior distribution for \(Z\) remains well defined, and it encodes information about \(Q\) with certainty. In the non-marginalized model, the posterior recovered for the latent state \(Z\) directly from model fitting is correct (provided that the MCMC fitting converges properly). In the marginalized model, the posterior distribution for \(Z_i\) is one with probability \[ \begin{equation} P(Z_i = 1) = \begin{cases} 1 & \text{if}\ Q_i=1 \\ \frac{\psi_i * \prod_{j = 1}^{n} (1 - \theta_{ij})}{(1 - \psi_i) + \psi_i * \prod_{j = 1}^{n} (1 - \theta_{ij})} & \text{if}\ Q_i=0 \end{cases} \end{equation} \]

Note that we cannot obtain the posterior distribution for \(Z\) by simply simulating occupancy based on \(\psi\). The posterior distribution for \(Z\) conditions on \(\psi\), \(\theta\), and \(Q\).

The pair of regressions might optionally also be united by shared parameters, such as covariance parameters for multivariate random effects that span both sub-models.↩︎
Note that in the JAGS code at my explanation of marginalization, \(Q\) shows up implicitly; it’s encoded via the ordering of the closure-units in the data, rather than as an explicit integer vector.↩︎