## Saturday, February 24, 2018

### Compose JAGS model statements for human readability

There's a great new book by Farrell and Lewandowsky, Computational Modeling of Cognition and Behavior (at the publisher, at Amazon.com), that includes some chapters on Bayesian methods. Each chapter includes a little "in vivo" commentary by an outside contributor. My commentary accompanies their chapter regarding JAGS. The commentary is posted here in a succession of three blog posts; this is 2 of 3. (Part 1 is here.) Do check out their book!

Compose JAGS model statements for human readability

All mathematical models are designed to describe structure in data. Logically, to comprehend a model, we must first know what the data are that the model is supposed to describe. We begin with describing how the data are probabilistically distributed according to some likelihood function. The likelihood function has parameters, which typically describe some trend or relation in the data. The parameters might be expressed in terms of higher-level parameters. Finally, the parameters have uncertainty, expressed as prior distributions on the parameters. The JAGS model-specification language lets us write models in this logical and comprehensible way: Start with the data, write the likelihood function, then write any dependencies among parameters, and finish with the prior distribution on the parameters. This makes it easy to write the model, and, importantly, easy for readers of the model specification to make sense of the model.

For example, consider a JAGS model specification for describing a set of data with a normal distribution (cf. Listing 8.3 [in Farrell and Lewandowsky's book]):
 model { for ( i in 1:N ) { y[i] ~ dnorm( mu , 1/sigma^2 ) } mu ~ dunif( -100 , 100 ) sigma ~ dunif( 0 , 100 ) }  (Listing 8.11. Describe data with a normal distribution in JAGS.)
The model specification (above) is easy to comprehend sequentially in reading order.

JAGS does not execute the lines of the model specification as if they were procedural R commands, but instead JAGS examines the overall model statement for structural consistency. The three lines in the model specification (above) could be put in any order and JAGS would not care. For example, JAGS would also allow the following:
 model { sigma ~ dunif( 0 , 100 ) mu ~ dunif( -100 , 100 ) for ( i in 1:N ) { y[i] ~ dnorm( mu , 1/sigma^2 ) } }  (Listing 8.12. Alternative JAGS description of a normal distribution.)
In terms of information content, it does not matter if you say “the knee bone’s connected to the thigh bone, and the thigh bone’s connected to the hip bone,” or instead say “the thigh bone’s connected to the hip bone, and the knee bone’s connected to the thigh bone.”

But for human readers trying to comprehend the statements, order does matter. Especially for complicated models with unfamiliar or arbitrary parameter names, it can be very difficult to understand model specifications that begin by specifying priors on parameters before specifying what distributions those parameters play a role in, and what the relation of the data to the parameters is. Therefore, be kind to your readers, and to your future self who will look back on your code months later. Specify JAGS models starting with the data likelihood then working through the parameters and their priors. These ideas are expressed with more examples on p. 199 and p. 414 of Kruschke (2015).

## Friday, February 23, 2018

### Run MCMC to achieve effective sample size (ESS) of 10,000

There's a great new book by Farrell and Lewandowsky, Computational Modeling of Cognition and Behavior (at the publisher, at Amazon.com), that includes some chapters on Bayesian methods. Each chapter includes a little "in vivo" commentary by an outside contributor. My commentary accompanies their chapter regarding JAGS. The commentary is posted here in a succession of three blog posts; this is 1 of 3. Do check out their book!

Run MCMC to achieve effective sample size (ESS) of 10,000

Bayesian analysis of complex models is possible only by virtue of modern software that takes an abstract model specification and returns a representation of the posterior distribution. In software that uses Markov-chain Monte Carlo (MCMC) methods, such as JAGS, the representation is inherently noisy. The random noise from MCMC tends to cancel out as the chain gets longer and longer. But different aspects of the posterior distribution are differently affected by noise. A relatively stable aspect is the median value of the chain. The median tends to stabilize relatively quickly, that is, with relatively shorter chains, because the median is usually in a high-density region of the posterior and the value of the median does not depend on the distance to outliers (unlike the mean). But other crucial aspects of the posterior distribution tend to need longer chains to achieve stable values.

In particular, a crucial aspect of a parameter distribution is its width. Narrower distributions connote more certainty in the estimate of the parameter. A very useful indicator of the width of a distribution is its 95% highest density interval (HDI). Parameter values within the 95% HDI have higher probability density than parameter values outside the HDI, and the parameter values inside the 95% HDI have a total probability of 95%. An example of an HDI is illustrated in Figure 1.

 Figure 1. Example of a 95% highest density interval (HDI). On the axes of the graph, θ denotes a parameter in the model, and p(θ|D) denotes the posterior distribution of that parameter. The limits of the HDI are marked by the ends of the double-headed arrow. Any value of θ within the HDI has higher probability density than any value outside the HDI. The mass within the 95% HDI, shaded by gray in the figure, is 95%.
Because the limits of an HDI are usually in the low-density tails of the distribution, there are relatively few steps in the MCMC chain near the limits. Therefore it takes a long chain to generate sufficiently many representative values of the parameter to stabilize the estimate of the HDI limits.

How long of a chain is needed to produce stable estimates of the 95% HDI? One useful heuristic answer is 10,000 independent steps. The rationale for the heuristic is explained in Section 7.5.2 of Kruschke (2015). Note that the requirement is 10,000 independent steps. Unfortunately, most MCMC chains are strongly autocorrelated, meaning that successive steps are near each other, and are not independent. Therefore we need a measure of chain length that takes into account the autocorrelation of the chain. Such a measure is called the effective sample size (ESS), for which a formal definition is provided in Section 7.5.2 of Kruschke (2015).

ESS is computed in R by the effectiveSize function (which is in the coda package, which in turn is part of the rjags package for JAGS). For example, suppose we have generated an MCMC chain using the rjags function, coda.samples, and the resulting object is called mcmcfin. Then we can find the ESS of the parameters by typing effectiveSize(mcmcfin).

It is crucial to realize that (i) the ESS will usually be much less than the number of steps in the MCMC chain, and (ii) every parameter in a multi-parameter model has a different ESS. Some parameters might have large ESS while others have small ESS. Moreover, combinations of parameters, such as a difference of two means, can have quite different ESS than the separate parameters. Therefore it is important to check the ESS of every parameter of interest, and the ESS of any interesting parameter combinations.

Stay tuned for Parts 2 and 3...

# Psychonomic Bulletin & Review

Scroll down on the page linked here for the articles in the special issue on Bayesian data analysis. The list of links includes these:

#### Articles

Bayesian data analysis for newcomers [<-clickable link]
John K. Kruschke and Torrin M. Liddell
The Bayesian New Statistics: Hypothesis testing, estimation, meta-analysis, and power analysis from a Bayesian perspective [<-clickable link]
John K. Kruschke and Torrin M. Liddell

Starting February 19, there will be a series of commentaries on the articles. I will update the blog when more information is available.

## Sunday, January 28, 2018

### Don't treat ordinal data as metric -- update of movie ratings

In a previous post I applied a Bayesian ordered-probit model to movie ratings and showed how the results differ from treating the data as if they were metric. The metric model used frequentist t tests (because that's what most applied researchers would do). In this post, I re-analyze that data as if they metric but using a Bayesian model that has the same hierarchical structure as the Bayesian hierarchical ordered-probit model I used before. Here we can compare ordered-probit to metric treatments with all else held constant. Spoiler: Same conclusions, of course. The ordered-probit model fits the data much better than the metric model. Don't treat ordinal data as metric.

Please see the previous post for details about the data and the models. In particular, note that there is hierarchical structure on the standard deviations across movies, but not on the means across movies. In other words, there is no direct shrinkage on the means.

Here is a repeat of the data, from 30 movies, fit by the ordered-probit model:
The pink bars (above) are frequency histograms of the data from each movie; the blue dots (with vertical blue whiskers) are the posterior predictions and 95% HDIs for the ordered-probit model. A very good fit, all in all.

Here is some news. The data fit by the metric model. Here a 1-star rating is considered to be a score of 1.0 on a metric scale, a 2-star rating is considered to be a score of 2.0 on a metric scale, and so on. Each histogram is fit by a normal distribution (as is assumed by t tests, ANOVA, etc.). The result:
The superimposed blue curves (above) are a smattering from the posterior distribution. Not a very good fit to the data distributions.

More news. The means of the ordered-probit model plotted against the means of the metric model, with 95% HDI's:
You can see (above) that the rank ordering of the movies is quite different for the two models!

Case in point (and more news):
You can see (above) that the ordered-probit puts movie 10 well above movie 26, but treating the data as metric yields the opposite conclusion. Which conclusion is more appropriate? Clearly the ordered-probit describes the data much more accurately than treating the data as metric. Watch movie 10 before movie 26.

And don't be a free rider on the rating system. If you use the rating system, give a rating.

## Monday, December 25, 2017

### Which movie is rated better? (Don't treat ordinal ratings as metric)

When deciding what movie to watch online, have you ever considered the star ratings provided by previous viewers? For example, Amazon.com has a 5-star rating system, in which reviewers can give a movie an ordinal rating from 1 star to 5 stars. Here are frequency histograms of 30 movies listed under "drama":

 Frequency histograms of star ratings from 30 movies (shown as pink bars). Posterior predictions of an ordered-probit model are shown by blue dots with blue vertical segments indicating the 95% HDIs.

Usually people analyze rating data as if the data were metric, that is, people pretend that 1 star is 1.0 on a metric scale, and 2 stars is 2.0 on the metric scale, and 3 stars is 3.0, and so forth. But this is not appropriate because all we know about the star ratings is their order, not their interval separation. The ordinal data should instead be described with an ordinal model. For more background, see Chapter 23 of DBDA2E, and this manuscript.

Here I used an ordered-probit model to describe the data from the 30 movies. I assumed the same response thresholds across the movies because the response scale is presented to everyone the same way, for all movies; this is a typical assumption. Each movie was given its own latent mean (mu) and standard deviation (sigma). I put no hierarchical structure on the means, as I didn't want the means of small-N movies to be badly shrunken toward enormous-N movies. But I did put hierarchical structure on the standard deviations, because I wanted some constraint on the sigma's of movies that show extreme ceiling effects in their data; it turns out the sigma's were estimated to vary quite a lot anyway.

Below is a graph of the resulting latent means (mu's) of the movies plotted against the means of their ordinal ratings treated as metric:
 Each point is a movie. Vertical axis is posterior mean (mu) of ordered probit model, with 95% HDI displayed as blue segment. Horizontal axis is mean of the star ratings treated as metric values.
In the scatter plot above, notice the many non-monotonicities; that is, as the means of ordinal-as-metric values increase along the horizontal axis, the latent mu's do not consistently increase on the vertical axis. In other words, the latent mu's are telling a different story than the ordinal-as-metric means.

Two movies with nearly equal ordinal-as-metric means, but with very different latent means in the ordered-probit model:

 Upper row shows ordered-probit fit; lower row shows t test with unequal variances. Notice the blue dots from the ordered-probit model fit the data much better than the blue normal distributions of the ordinal-as-metric model. (Case 19 is Ekaterina: The rise of Catherine the Great, and Case 26 is John Grisham's The Rainmaker.)
Do we conclude that the movies (above) are rated about the same, or that movie 19 is rated much better than movie 26? I think we have to conclude that movie 19 is rated much better than movie 26 because the ordered-probit model is a much better description of the data.

Two movies with ordinal-as-metric means that are significantly different in one direction but the latent means in the ordered-probit model are quite different in the opposite direction:

 Upper row shows ordered-probit fit; lower row shows t test with unequal variances. Notice the blue dots from the ordered-probit model fit the data much better than the blue normal distributions of the ordinal-as-metric model. (Case 10 is Miss Sloane, and Case 26 is John Grisham's The Rainmaker.)
Do we conclude that movie 26 is rated better than movie 10, or the other way around? I think we have to conclude that movie 10 is rated better than movie 26 because the ordered probit model is a much better description of the data.

This isn't (only) about movies: The point is that ordinal data from any source should not be treated as metric. Pretending that a rating of "1" is numeric 1.0, and rating "2" is 2.0, and rating "3" is 3.0, and so forth, is usually nonsensical because it's assuming metric information in the data that simply is not there. Treating the data as normally-distributed metric values is often a terrible description of the data. Instead, use an ordinal model for ordinal data. The ordinal model will describe the data better, and sometimes yield rather different implications than the ordinal-as-metric description.

For more information, see Chapter 23 of DBDA2E, and this manuscript titled, "Analyzing ordinal data with metric models: What could possibly go wrong?"

## Sunday, December 17, 2017

### The skew-normal distribution in JAGS

For a project I'm working on, I decided to try the skew-normal distribution to describe residual noise. JAGS does not have the skew-normal built in, so I used the Bernoulli ones trick to express the skew-normal in a JAGS model specification. This blog post shows how, and also demonstrates that when skew is near zero the autocorrelation can be severe and the posterior distribution has an interesting boomerang shape.

First, an example of simulated data from a skew-normal distribution, along with the recovered parameter values from Bayesian inference.
 Gray histogram shows the data; blue curves are a smattering of skew-normals from the posterior.
The generating parameters had values location=1, scale=2, and skew=3. The data had fairly large N=2,000 so that the estimated parameter values might have reasonably high precision. Here are the MCMC diagnostics, along with a pairwise plot of the parameters in the posterior distribution:

These graphs (above) were the result from 12,000 MCMC steps thinned by 5. They have not yet achieved the heuristic desired ESS of 10,000 but at least it's clear that letting the chains run longer would do the job. It's also clear that the data-generating parameter values were recovered fairly well, but interestingly the location is pulled in the direction of the skew, which lets the scale and skew be a bit smaller than their generating values.

Now another example: The same as before but this time with skew=0, that is, normally distributed data being fit with a skew-normal distribution. In this case, the autocorrelation in the MCMC chains is nasty, and the pairwise posterior distribution shows a boomerang shape:

You can see in the diagnostic plots that the ESS is tiny, despite 12,000 steps thinned by 5. It looks like the chains have settled into the correct parameter values -- after all the posterior-predictive curves superimposed on the data look good -- but it will take a ton more steps to get a smooth representation of the posterior distribution.

Pursuing this topic showed me yet again how Present Self reinvents the wheels of Past Self. I didn't recall ever trying to do this particular application before, so I searched the web to see if anyone had previously posted something about the skew normal distribution in JAGS. One of the first links in the search results was this one, a discussion thread. I skimmed through it and thought it was just the sort of thing I could adapt for my present purposes. I thought the name of the initial poster looked familiar, and then I realized the name of the responder was very familiar -- it was mine. Sigh.

The JAGS model specification, using the Bernoulli ones trick. The Bernoulli ones trick is explained in DBDA2E in Section 8.6.1, pp. 214-215. This script will take several minutes to run for N=2,000; perhaps reduce N on a first try.

# housekeeping
graphics.off() # This closes all of R's graphics windows.
rm(list=ls())  # Careful! This clears all of R's memory!

if ( !("sn" %in% installed.packages()[,"Package"]) ) { install.packages("sn") }
library("sn")
source("DBDA2E-utilities.R") # for diagMCMC(), openGraph(), etc.!

# Generate data:

# Specify generating parameter values:
locat = 1 # location
scale = 2 # scale
skew = 3 # skew

# Construct file name root based on generating parameter values:
fileNameRoot = paste0("SkewNormalPlay","-",locat,"-",scale,"-",skew)
saveType = "png"

# Create random data from skew-normal distribution:
N = 2000
set.seed(47405)
y = rsn( N , xi=locat , omega=scale , alpha=skew )
# Scaling constant for use in JAGS Bernoulli ones trick:
dsnMax = 1.1*max(dsn( seq(-10,10,length=1001) , xi=locat,omega=scale,alpha=skew ))
# Assemble data for JAGS:
dataList = list(
y = y ,
N = N ,
ones = rep(1,length(y)) ,
C = dsnMax # constant for keeping scaled dsn < 1
)

# Define the JAGS model using Bernoulli ones trick
# as explained in DBDA2E Section 8.6.1 pp. 214-215.
modelString = "
model {
for ( i in 1:N ) {
dsn[i] <- ( (2/scale)
* dnorm( (y[i]-locat)/scale , 0 , 1 )
* pnorm( skew*(y[i]-locat)/scale , 0 , 1 ) )
spy[i] <- dsn[i] / C
ones[i] ~ dbern( spy[i] )
}
scale ~ dgamma(1.105,0.105)
locat ~ dnorm(0,1/10^2)
skew ~ dnorm(0,1/10^2)
}
" # close quote for modelString
writeLines( modelString , con="TEMPmodel.txt" )

# Run the chains:
runJagsOut <- run.jags( method="parallel" ,
model="TEMPmodel.txt" ,
monitor=c("scale","locat","skew") ,
data=dataList ,
#inits=initsList ,
n.chains=3 ,
burnin=500 ,
sample=ceiling(12000/3) ,
thin=5 ,
summarise=FALSE ,
plots=FALSE )
codaSamples = as.mcmc.list( runJagsOut ) # from rjags package
save( codaSamples , file=paste0(fileNameRoot,"Mcmc.Rdata") )
mcmcMat = as.matrix( codaSamples )

# Examine the chains:
# Convergence diagnostics:
diagMCMC( codaObject=codaSamples , parName="scale" ,
saveName=fileNameRoot , saveType=saveType )
diagMCMC( codaObject=codaSamples , parName="locat" ,
saveName=fileNameRoot , saveType=saveType )
diagMCMC( codaObject=codaSamples , parName="skew" ,
saveName=fileNameRoot , saveType=saveType )

# Examine correlation of parameters in posterior distribution:
openGraph()
pairs( mcmcMat , col="skyblue" )
saveGraph( file=paste0(fileNameRoot,"-Pairs") , type=saveType )

# plot data with posterior predictions
openGraph(height=4,width=7)
histInfo = hist( dataList$y , probability=TRUE , breaks=31 , xlab="Data Value" , main="Data with Posterior Pred." , col="gray" , border="white" ) nCurves = 30 curveIdxVec = round(seq(1,nrow(mcmcMat),length=nCurves)) xComb = seq( min(histInfo$breaks) , max(histInfo\$breaks) , length=501 )
for ( curveIdx in curveIdxVec ) {
lines( xComb , dsn( xComb ,
xi=mcmcMat[curveIdx,"locat"] ,
omega=mcmcMat[curveIdx,"scale"] ,
alpha=mcmcMat[curveIdx,"skew"] ) ,
col="skyblue" )
}
saveGraph( file=paste0(fileNameRoot,"-Data-PostPred") , type=saveType )

Perhaps this is a case in which Stan would work more efficiently. Read about Stan in Chapter 14 of DBDA2E. Not only would Hamiltonian Monte Carlo drastically reduce autocorrelation, but Stan also has the skew-normal built in. Perhaps a future blog post... if Future Self remembers Present Self...