## Introduction

Stan doesn’t allow discrete unknown parameters in its models1. This limitation2 prevents us from fitting occupancy models under the familiar parameterization that we might have first learned from one of Marc Kery’s excellent books. The issue is that the unknown occupancy states (the Z-matrix in community models) are discrete, and this is not allowed in Stan.

The solution is to marginalize out the discrete occupancy states. There are a few pages floating around the web that show us how to do this. I initially found those pages frustrating to understand, which is my motivation for writing this up. The Fukami Lab page doesn’t really explain where the marginalization happens or how it works. Bob Carpenter’s translation of the Dorazio data-augmented multi-species model does the marginalization very quickly in heavy math symbology, and moreover relies on a lack of visit-specific detectability covariates to do so. To be fair, neither of these pages’ main purpose is to explain marginalization for beginners, so no complaint!

What I really wanted was a satisfying explanation of the marginalization written in JAGS/BUGS language first, and translated to Stan second. That’s what I’ve tried to produce here. We’ll start with the likelihood for a single-species, single-season model in JAGS. Then we’ll marginalize, still in JAGS. Then we’ll translate to Stan3. We’ll write the Stan implementation in such a way that it’s pretty clear how to generalize to a multi-species model. Once the reader is comfortable with the multi-species model in Stan, she should be able to follow Bob Carpenter’s data-augmented version, even if she requires visit-specific detection covariates.

## The familiar JAGS/BUGS model

The familiar parameterization of the single-species, single-season occupancy model has a likelihood of the following form:

for(i in 1:n.site){
Z[i] ~ dbern(psi[i])                                       # Line A
logit(psi[i]) <- a0 + a1*cov1[i] + a2*cov2[i]
for(j in 1:n.visit){
det.data[i,j] ~ dbern(Z[i]*theta[i,j])                   # Line B
logit(theta[i,j]) <- b0 + b1*cov1[i] + b2*cov3[i,j]
}
}

This particular specification contains two occupancy covariates and two detection covariates, one of which is site-specific and one of which is visit-specific.

## The marginalized JAGS/BUGS model

Our goal is to reparameterize the model to remove the latent occupancy state $$Z$$. To do so, we need to keep track of sites with at least one detection versus sites with none. For now, we can simply sort the sites in our data such that all $$N$$ sites with at least one detection come first, followed by the $$n.site - N$$ sites with no detections. We then split the likelihood into two parts—one for sites with at least one detection, and another for sites with no detections.

### Sites with a detection

We know that $$Z == 1$$. Thus, we can write lines A and B of the likelihood above as:

ones[i] ~ dbern(psi[i])             # Line A; ones is simply a vector of ones.
det.data[i,j] ~ dbern(theta[i,j])   # Line B

This formulation has no $$Z$$ in it and thus no discrete unknowns!

### Sites with no detection

Things are a bit trickier, because we don’t know whether $$Z$$ is one or zero. We do know, however, that $$Z$$ is either one or zero, but not both. Thus, the general strategy will be to write down the likelihood when $$Z == 0$$, then to write down the likelihood when $$Z == 1$$, and then to add those likelihoods together4.

One way to observe the data is via a pathway where $$Z == 0$$. When $$Z == 0$$, line B does not matter, and only line A contributes to the likelihood. So the likelihood associated with this pathway corresponds to the statement:

zeros[i] ~ dbern(psi[i])            # Line A

or equivalently

ones[i] ~ dbern(1 - psi[i])         # Line A

The other way to observe the data is via a pathway where $$Z == 1$$. In that case, both lines matter. Note that because the site has no detections, $$det.data[i,j]$$ is always zero, so we can write:

ones[i] ~ dbern(psi[i]).        # Because Z=1;            line A
zeros[i] ~ dbern(theta[i,j])    # Because x[i,j] is zero    line B

We rewrite the second line as:

ones[i] ~ dbern(1 – theta[i,j])         # line B

Since detection (conditioned on covariates) on successive visits is independent5, we can multiply the probabilities6. Note that we eliminate the loop over j and replace it with an explicit term for each visit.

ones[i] ~ dbern(psi[i] * (1 – theta[i,1]) * (1 - theta[i,2]) * ... * (1 - theta[i,n.visit]))            # line B

Finally, we can combine the two cases ($$Z == 1$$ and $$Z == 0$$) in a single sampling statement by adding the probabilities7.

ones[i] ~ dbern(1-psi[i] + psi[i] * (1-theta[i,1]) * (1-theta[i,2]) * ... * (1-theta[i,n.visit]))  # lines A & B

Putting it all together yields a full marginalized JAGS likelihood, shown here for a model with two occupancy covariates, two detection covariates, and four visits to each site.

for(i in 1:N){
ones[i] ~ dbern(psi[i])
logit(psi[i]) <- a0 + a1*cov1[i] + a2*cov2[i]
for(j in 1:n.visit){
det.data[i,j] ~ dbern(theta[i,j])
logit(theta[i,j]) <- b0 + b1*cov1[i] + b2*cov3[i,j]
}
}
for(i in (N+1):n.site){
ones[i] ~ dbern(1-psi[i] + psi[i] * (1-theta[i,1]) * (1-theta[i,2]) * (1-theta[i,3]) * (1-theta[i,4]))
logit(psi[i]) <- a0 + a1*cov1[i] + a2*cov2[i]
for(j in 1:n.visit){
logit(theta[i,j]) <- b0 + b1*cov1[i] + b2*cov3[i,j]
}
}

Note that there are no $$Z$$’s in this likelihood!

In my view, the key trick in the marginalization is splitting the likelihood at sites without detections into separate cases for $$Z == 1$$ and $$Z == 0$$, and then writing down a single statement that subsumes both paths towards observing a zero. Syntactically, this shows up as the elimination of sampling statements from the for-loop over the visits (the j-indexed for-loop over visits exists only to compute theta)8. The crux of the marginalization is to remove $$Z$$ while nevertheless ensuring that the likelihood of the data for a single site cannot receive simultaneous contributions from $$Z == 0$$ on one visit and $$Z == 1$$ on another visit.

Once we fit the model, we can still recover our posterior Z-matrix. Posterior $$Z$$ is one at every site with at least one detection. At every other site, the likelihood for $$Z == 1$$ is $$\psi_i * \prod_{j = 1}^{n.visit} (1 - \theta_{ij})$$, whereas the likelihood for $$Z == 0$$ is $$1 - \psi_i$$. Thus, posterior Z is one with probability $$\frac{\psi_i * \prod_{j = 1}^{n.visit} (1 - \theta_{ij})}{(1 - \psi_i) + \psi_i * \prod_{j = 1}^{n.visit} (1 - \theta_{ij})}$$ and zero otherwise. So we just monitor psi and theta and compute the Z-matrix post-hoc.

## The Stan model

### A very quick Stan primer

This section assumes that the reader knows how to run a Stan program on their computer (e.g. via rstan), a bit of Stan syntax, the general structure of a Stan program, and how to search the Stan manual if you encounter an unfamiliar function like log1m(x). However, because the code below uses a couple of fundamental features of Stan that might be unfamiliar to a novice user who has simply been porting models over from JAGS, I’ll quickly review some key points.

The crux of Stan’s functionality is to compute the log-likelihood function and its derivatives. The crux of a Stan program is to tell Stan what the log-likelihood function is. Each line of the ‘model block’ in a Stan program is interpreted as a way to increment the log-likelihood. So for example, if we write

1 ~ bernoulli(p1);
0 ~ bernoulli(p2);

Stan interprets that to mean that it should increment the log-likelihood by $$log(p1) + log(1-p2)$$.

Stan also contains functionality to directly increment the log-likelihood using the nifty function target +=. So we could rewrite the lines above as

target += log(p1);
target += log(1-p2);

Or equivalently

target += log(p1) + log(1-p2);

Or we can mix and match, for example

target += log(p1);
0 ~ bernoulli(p2);

All of these alternatives tell Stan exactly the same thing9.

Another thing to know about Stan is that it contains an ‘if’ statement. The ‘if’ statement isn’t necessary to translate the marginalized single-species JAGS model to Stan, but it will be quite useful for a multi-species model10. Therefore, we’ll use the ‘if’ statement from the very beginning.

### The Stan likelihood

To take advantage of the ‘if’ statement, we generate a new data variable Q such that $$Q[i] = 1$$ if there is at least one detection at site i, and $$Q[i] = 0$$ otherwise. (We generate Q external to Stan, in our R session or whatever we’re using). Then the likelihood looks like this:

for(i in 1:n_site){
if(Q[i] == 1) {
target += log(psi[i]);
target += bernoulli_lpmf(det_data[i] | theta[i]); //vectorized over visits j
}
if(Q[i] == 0) {
target += log_sum_exp(log(psi[i]) + log1m(theta[i,1]) + log1m(theta[i,2]) + log1m(theta[i,3]) + log1m(theta[i,4]),
log1m(psi[i]));
}
}

Above, the first target += statement is the 1 ~ dbern(psi) statement; the second target += is the det.data[i,j] ~ dbern(theta[i,j]) statement; and the third target += is the full contribution to the likelihood for a site where $$Q[i] == 0$$. The function log_sum_exp() takes two arguments. Here, the first argument captures the contribution of the $$Z == 1$$ possibility, and the second argument captures the contribution of the $$Z == 0$$ possibility.

## Testing and benchmarking the parameterizations

It doesn’t really make sense to compare the Stan runtime against JAGS, because most of the Stan runtime is the one-time model compilation. However, it is quite interesting to compare the marginalized and non-marginalized JAGS models (and we show the full Stan model too, and make sure it gives the same parameter estimates!).

### Simulating covariates and data

set.seed(3)
inv.logit <- function(x){exp(x)/(1+exp(x))}
n.site <- 200
n.visit <- 4

# generate covariate values
site.covars <- data.frame(cov1 = rnorm(n.site), cov2 = rnorm(n.site))
visit.covars <- as.data.frame(matrix(rnorm(n.site * n.visit), nrow = n.site))

# specify coefficient values
occ.int <- 0
occ.beta1 <- .5     # to multiply by site.covars$cov1 occ.beta2 <- -.5 # to multiply by site.covars$cov2
det.int <- -.7
det.beta1 <- .3     # to multiply by site.covars$cov1 det.beta2 <- .3 # to multiply by visit.covars # get linear predictors psi <- inv.logit(occ.int + occ.beta1*site.covars$cov1 + occ.beta2*site.covars$cov2) theta <- matrix(NA, nrow = n.site, ncol = n.visit) for(j in 1:4){ theta[, j] <- inv.logit(det.int + det.beta1*site.covars$cov1 + det.beta2*visit.covars[, j])
}

# simulate detection histories
Z <- vector()
for(i in 1:n.site){
Z[i] <- rbinom(1, 1, psi[i])
}
det.data <- matrix(NA, nrow = n.site, ncol = n.visit)
for(i in 1:n.site){
for(j in 1:n.visit){
det.data[i, j] <- Z[i]*rbinom(1,1,theta[i, j])
}
}

### JAGS analysis: not marginalized

library(rjags)
library(dclone)
set.seed(3)
occu.model <- function() {
# Priors
a0 ~ dnorm(0,0.001)
a1 ~ dnorm(0,0.001)
a2 ~ dnorm(0,0.001)
b0 ~ dnorm(0,0.001)
b1 ~ dnorm(0,0.001)
b2 ~ dnorm(0,0.001)

# Likelihood
for(i in 1:n.site){
Z[i] ~ dbern(psi[i])
logit(psi[i]) <- a0 + a1*cov1[i] + a2*cov2[i]
for(j in 1:n.visit){
det.data[i,j] ~ dbern(Z[i]*theta[i,j])
logit(theta[i,j]) <- b0 + b1*cov1[i] + b2*cov3[i,j]
}
}
}

jags.data <- list(n.site = n.site, n.visit = n.visit, det.data = det.data,
cov1 = site.covars$cov1, cov2 = site.covars$cov2,
cov3 = visit.covars)

params <- c('a0', 'a1', 'a2', 'b0', 'b1', 'b2')

inits <- function() {
a0 <- rnorm(1, 0, .5)
a1 <- rnorm(1, 0, .5)
a2 <- rnorm(1, 0, .5)
b0 <- rnorm(1, 0, .5)
b1 <- rnorm(1, 0, .5)
b2 <- rnorm(1, 0, .5)
Z <- as.numeric(rowSums(det.data) > 0)
return(list(a0 = a0, a1 = a1, a2 = a2, b0 = b0, b1 = b1, b2 = b2, Z = Z))
}

# run model
nc <- 4
n.burn <- 1000
n.iter <- 3000
thin <- 1

occu.start.time = proc.time()
cl <- makePSOCKcluster(nc)
tmp <- clusterEvalQ(cl, library(dclone))
parListModules(cl)
occu.samples <- jags.parfit(cl, jags.data, params, occu.model, inits=inits, n.chains=nc,
n.adapt=n.adapt, n.update = n.burn, thin = thin, n.iter = n.iter)
stopCluster(cl)
occu.end.time = proc.time()
occu.dtime = occu.end.time - occu.start.time

### JAGS analysis: marginalized

# First we sort the data so that sites with a detection come first
N <- sum(rowSums(det.data) > 0)
neworder <- rev(order(rowSums(det.data)))
det.data2 <- det.data[neworder, ]
site.covars2 <- site.covars[neworder, ]
visit.covars2 <- visit.covars[neworder, ]

# Now we fit the model
margin.model <- function() {
# Priors
a0 ~ dnorm(0,0.001)
a1 ~ dnorm(0,0.001)
a2 ~ dnorm(0,0.001)
b0 ~ dnorm(0,0.001)
b1 ~ dnorm(0,0.001)
b2 ~ dnorm(0,0.001)

# Likelihood
for(i in 1:N){
ones[i] ~ dbern(psi[i])
logit(psi[i]) <- a0 + a1*cov1[i] + a2*cov2[i]
for(j in 1:n.visit){
det.data[i,j] ~ dbern(theta[i,j])
logit(theta[i,j]) <- b0 + b1*cov1[i] + b2*cov3[i,j]
}
}
for(i in (N+1):n.site){
ones[i] ~ dbern(1-psi[i] + psi[i] * (1-theta[i,1]) * (1-theta[i,2]) * (1-theta[i,3]) * (1-theta[i,4]))
logit(psi[i]) <- a0 + a1*cov1[i] + a2*cov2[i]
for(j in 1:n.visit){
logit(theta[i,j]) <- b0 + b1*cov1[i] + b2*cov3[i,j]
}
}
}

jags.data = list(n.site = n.site, n.visit = n.visit, det.data = det.data2,
cov1 = site.covars2$cov1, cov2 = site.covars2$cov2,
cov3 = visit.covars2, N = N, ones = rep(1, n.site))

params = c('a0', 'a1', 'a2', 'b0', 'b1', 'b2')

inits <- function() {
a0 <- rnorm(1, 0, .5)
a1 <- rnorm(1, 0, .5)
a2 <- rnorm(1, 0, .5)
b0 <- rnorm(1, 0, .5)
b1 <- rnorm(1, 0, .5)
b2 <- rnorm(1, 0, .5)
return(list(a0 = a0, a1 = a1, a2 = a2, b0 = b0, b1 = b1, b2 = b2))
}

nc <- 4
n.burn <- 1000
n.iter <- 3000
thin <- 1

margin.start.time = proc.time()
cl <- makePSOCKcluster(nc)
tmp <- clusterEvalQ(cl, library(dclone))
parListModules(cl)
margin.samples <- jags.parfit(cl, jags.data, params, margin.model, inits=inits, n.chains=nc,
n.adapt=n.adapt, n.update = n.burn, thin = thin, n.iter = n.iter)
stopCluster(cl)
margin.end.time = proc.time()
margin.dtime = margin.end.time - margin.start.time

### Stan analysis

library(rstan)
stan.model <- 'data {
int<lower=0> n_site; //number of sites
int<lower=0> n_visit; //number of visits
int<lower=0, upper=1> det_data[n_site, n_visit]; //detection history
real<lower=0, upper=1> Q[n_site]; //at least one detection
real cov1[n_site]; //site covariate 1
real cov2[n_site]; //site covariate 2
real cov3[n_site, n_visit]; //visit covariate
}
parameters {
real a0;
real a1;
real a2;
real b0;
real b1;
real b2;
}
transformed parameters{
real psi[n_site];
real theta[n_site, n_visit];
for(i in 1:n_site){
psi[i] = inv_logit(a0 + a1 * cov1[i] + a2 * cov2[i]);
for(j in 1:n_visit){
theta[i,j] = inv_logit(b0 + b1 * cov1[i] + b2 * cov3[i,j]);
}
}
}
model {
for(i in 1:n_site){
if(Q[i] == 1) {
target += log(psi[i]);
target += bernoulli_lpmf(det_data[i] | theta[i]); //vectorized over visits j
}
if(Q[i] == 0) {
target += log_sum_exp(log(psi[i]) + log1m(theta[i,1]) + log1m(theta[i,2]) + log1m(theta[i,3]) + log1m(theta[i,4]),
log1m(psi[i]));
}
}
}'

stan.data <- list(n_site = n.site, n_visit = n.visit, det_data = det.data,
cov1 = site.covars$cov1, cov2 = site.covars$cov2,
cov3 = visit.covars, Q = as.numeric(rowSums(det.data)>0))

nc <- 4

stan.samples <- stan(model_code = stan.model, data = stan.data, iter = 4000, chains = nc, cores = nc)

### Comparison and benchmarks

The fitted posterior distributions are essentially identical:

summary(occu.samples)
##
## Iterations = 2001:5000
## Thinning interval = 1
## Number of chains = 4
## Sample size per chain = 3000
##
## 1. Empirical mean and standard deviation for each variable,
##    plus standard error of the mean:
##
##       Mean     SD Naive SE Time-series SE
## a0 -0.1668 0.2063 0.001883       0.003850
## a1  0.6328 0.2157 0.001969       0.003913
## a2 -0.5895 0.1957 0.001786       0.003139
## b0 -0.4777 0.1527 0.001394       0.003165
## b1  0.1584 0.1382 0.001261       0.002494
## b2  0.2778 0.1113 0.001016       0.001395
##
## 2. Quantiles for each variable:
##
##        2.5%      25%     50%      75%   97.5%
## a0 -0.54459 -0.30352 -0.1739 -0.03844  0.2594
## a1  0.22950  0.48830  0.6253  0.77018  1.0803
## a2 -0.98997 -0.71513 -0.5830 -0.45857 -0.2211
## b0 -0.78593 -0.57850 -0.4743 -0.37446 -0.1852
## b1 -0.10987  0.06448  0.1592  0.25013  0.4295
## b2  0.06033  0.20233  0.2770  0.35199  0.4972
summary(margin.samples)
##
## Iterations = 2001:5000
## Thinning interval = 1
## Number of chains = 4
## Sample size per chain = 3000
##
## 1. Empirical mean and standard deviation for each variable,
##    plus standard error of the mean:
##
##       Mean     SD Naive SE Time-series SE
## a0 -0.1685 0.2032 0.001855       0.002887
## a1  0.6367 0.2113 0.001929       0.002969
## a2 -0.5859 0.1946 0.001776       0.002444
## b0 -0.4763 0.1531 0.001397       0.002402
## b1  0.1538 0.1377 0.001257       0.002028
## b2  0.2775 0.1126 0.001028       0.001353
##
## 2. Quantiles for each variable:
##
##        2.5%      25%     50%      75%   97.5%
## a0 -0.55208 -0.30487 -0.1736 -0.03722  0.2488
## a1  0.25017  0.48922  0.6310  0.77579  1.0736
## a2 -0.99127 -0.70953 -0.5773 -0.45270 -0.2257
## b0 -0.77638 -0.58045 -0.4734 -0.37179 -0.1781
## b1 -0.11182  0.06061  0.1541  0.24688  0.4219
## b2  0.05996  0.20123  0.2753  0.35137  0.5041
summary(stan.samples, pars = c('a0', 'a1', 'a2', 'b0', 'b1', 'b2'))\$summary
##          mean     se_mean        sd        2.5%         25%        50%
## a0 -0.1626243 0.002193053 0.2014183 -0.54666278 -0.29701201 -0.1694955
## a1  0.6379032 0.002294689 0.2148382  0.23349121  0.48960223  0.6285395
## a2 -0.5892978 0.002067211 0.1963884 -0.99680819 -0.71557615 -0.5812635
## b0 -0.4811580 0.001730436 0.1532074 -0.78024178 -0.58245710 -0.4805816
## b1  0.1567425 0.001437112 0.1373855 -0.10839888  0.06061739  0.1549959
## b2  0.2790928 0.001118686 0.1119941  0.06210818  0.20410113  0.2771369
##            75%      97.5%     n_eff      Rhat
## a0 -0.03783685  0.2574724  8435.282 0.9999507
## a1  0.77843557  1.0819310  8765.473 1.0000718
## a2 -0.45405548 -0.2259494  9025.314 0.9999146
## b0 -0.38085255 -0.1850518  7838.775 0.9998840
## b1  0.24996941  0.4251791  9139.045 1.0001419
## b2  0.35418718  0.5005714 10022.441 0.9998465

The effective sample sizes and r-hat (Gelman-Rubin) diagnostics are generally better in the marginalized JAGS model verus the “familiar” parameterization:

effectiveSize(occu.samples)
##       a0       a1       a2       b0       b1       b2
## 2911.311 3137.325 3888.757 2356.468 3092.461 6570.295
effectiveSize(margin.samples)
##       a0       a1       a2       b0       b1       b2
## 5000.131 5090.954 6364.942 4082.330 4644.523 6928.157
gelman.diag(occu.samples)
## Potential scale reduction factors:
##
##    Point est. Upper C.I.
## a0          1       1.01
## a1          1       1.00
## a2          1       1.00
## b0          1       1.01
## b1          1       1.00
## b2          1       1.00
##
## Multivariate psrf
##
## 1
gelman.diag(margin.samples)
## Potential scale reduction factors:
##
##    Point est. Upper C.I.
## a0          1          1
## a1          1          1
## a2          1          1
## b0          1          1
## b1          1          1
## b2          1          1
##
## Multivariate psrf
##
## 1

The execution time is faster in the marginalized JAGS model versus the “familiar” parameterization. On different runs and different machines, I’ve seen speedup between 5 and 19 percent, which means that for some parameters the “familiar” JAGS model takes twice as long per effective sample as the marginalized model.

occu.dtime
##    user  system elapsed
##   0.045   0.012  14.304
margin.dtime
##    user  system elapsed
##   0.017   0.010  12.363
(margin.dtime - occu.dtime)/occu.dtime
##       user     system    elapsed
## -0.6666667  0.2222222 -0.1356963

1. To be clear, data may be discrete, but unknowns (including missing data) must not be.

2. The requirement to avoid discrete unknowns turns out not to be a limitation in most cases. As we’ll see below, we actually get better performance out of JAGS/BUGS by avoiding discrete unknowns (i.e. marginalizing), which according the Stan development team allows for better exploration of the tails of the posterior. Thus, forcing us to marginalize might be more of a feature than a limitation.

3. This part will assume rudimenary familiarity with the structure of a Stan program.

4. We can add the probabilities of mutually exclusive events to find the probability that exactly one of the events occurs. In this case, two mutually exclusive events are 1) [observed data AND $$Z == 0$$] and 2) [observed data AND $$Z == 1$$]. Note that in addition to being mutually exclusive, these events are the only two ways that our observed data could have arisen, so the sum of these probabilities equals the likelihood of the observed data.

5. This is true by assumption in all “garden-variety” occupancy models.

6. It’s weird to talk about occupancy being independent of detection, but remember that in these models the detection probabilities are conditional on occupancy. And indeed $$p(det) = p(det|occ)*p(occ)$$, so we see we’re on safe footing to multiply the occupancy and conditional detection probabilities.

7. The probabilities of mutually exclusive events add. In this case, $$Z == 1$$ and $$Z == 0$$ are indeed mutually exclusive.

8. In Bob Carpenter’s translation of the Dorazio model, this is achieved by modeling the sum of detections over all visits as a sample from a binomial distribution–a nifty trick if there are no visit-specific covariates to worry about!

9. Or close enough. For all I know there might be efficiency gains to be had by doing everything on one line (incrementing the log-likelihood once instead of multiple times), but if these exist they will be inconsequential for present purposes.

10. It should be obvious why we don’t want to rely on sorting the sites for the multi-species model, where different species will be detected at different sites. Yes, we could pass a complex data structure where every species has its own covariate tables that are properly sorted for that species’ detection history, but the ‘if’ statement makes things so much easier.