Chapter 11 Check All That Applies (CATA)

Check All That Apply (CATA) data is in its raw form binary indicating whether a judge finds a product to have the attribute (1) or not (0).

Usually, such data is organized in a matrix where each row corresponds to the evaluation of one product by one judge. And the coloumns are then the attributes.

Say you for instance have 26 judges/consummers and 4 products, and further that all products are evaluated by all judges once on 13 attributes. Your data matrix would then have 104 rows and 13 coloums (with responses) and additionally coloumns indicating judge, product, record id, date, etc.

11.1 An example from Beer profiling

Six different commercial beers from Danish craft brewers, evaluated by \(160\) consumers on a range of different questions:

  • Background information: a range of questions, including appropriateness ratings for 27 sensory descriptors on a 7-points scale (e.g. how appropriate do you think it is for a beer to be bitter?). The two semantic anchors were 1 = not at all appropriate and 7 = extremely appropriate. This dataset is called beercata.
  • Hedonics: Their liking/hedonic responses of the beer on a 7-point Likert scale (1-7). This dataset is called beerliking.

11.2 Two versions of the data

  • RAW data with each row being responses from one evaluation
  • Agglomerated to counts, with each row being one product

The agglomerated version is computed by:

beercatasum <- beercata %>% 
  gather(attrib, val, S_Flowers:S_Vinous) %>% 
  group_by(Beer,attrib) %>%
  dplyr::summarize(n = sum(val)) %>% 
… and visualized by for instance a barplot.

# summary counts over attrobite
beercatasum %>% 
  gather(attrib, n, S_Flowers:S_Vinous) %>% 
  ggplot(data = ., aes(attrib,n, fill = Beer)) + 
  geom_bar(stat = 'identity', position = position_dodge()) + 

11.3 PCA

A PCA on the agglomerated counts, reveal the attributes associated with the individual products:

mdlPCA <- prcomp(beercatasum[,-1], scale. = T)
ggbiplot::ggbiplot(mdlPCA, labels = beercatasum$Beer)

The attributes Bean, Caramel, Warming, Aromatic etc is associated to the beer Brown ale, while Berrie, Dessert, Pungent, etc. is characteristic of Wheat IPA [Maybe add some more narrative]

11.4 Cochranes Q-test

Cochranes Q-test is a statistical test for the comparison of several products, where the response is binary, and there is repeated responses across several judges. We need a package (RVAideeeeMemoire).

For one response variable: S_Flowers

m <- cochran.qtest(S_Flowers ~ Beer | Consumer.ID,
                   data = beercata)

##  Cochran's Q test
The p.value is strongly significant, indicating that we cannot assumme the same level of S_Flower in all beers. I.e. the beers seems different based on this characteristics. This is in agreement with the barplot above, where S_Flower is high in NY Lager and really low for Brown ale.

11.4.1 Post hoc contrasts

As we observe differences based on this attribute, we pursue the question on which products sticks out? And are there products which are similar? This is done by pairwise comparisons:

PT = pairwiseMcnemar(S_Flowers ~ Beer | Consumer.ID,
                     data   = beercata,
                     test   = "permutation",
                     method = "fdr",
                     digits = 3)
PT$Pairwise %>% 
  arrange(-abs(as.numeric(Z))) %>% 
The table is sorted with the most different pairs at the top, and the least different at the bottom. Hence most products are different, while Porse Bock and Ravnsborg Red are fairly alike.

11.4.2 For all Attributes

We use tidyverse and broom for this, but need a function capable of handling Cochranes Q-test outputs.

tidy.RVtest <- function(m){
  r <- data.frame(statistic = m$statistic,df = m$parameter,
                  p.value= m$p.value,
                  method = m$method.test)

tb_cochran <- beercata %>% 
  gather(attrib, val, S_Flowers:S_Vinous) %>% 
  group_by(attrib) %>%
  do(cochran.qtest(val ~ Beer | Consumer.ID,
                   data = .) %>% tidy)

tb_cochran %>% 
This output indicates that S_Beans is the most discriminatory attribute, while S_Pungent is the least.

11.4.3 PLSDA

This needs more love.

mdl <- plsda(data.frame(beercata[,3:29]),factor(beercata$Beer),ncomp = 3)

scores <- mdl$scores %>% 
  unclass %>% %>% 

loadings <- mdl$loadings %>% 
  unclass %>% %>% 
  rownames_to_column('attrib') %>% 
  mutate(attrib2 = substr(attrib,3,50)) # lets remove the S_

g1 <- ggplot(data = loadings, aes(`Comp 1`, `Comp 2`, label = attrib2)) + 
  # geom_point() + 

g2 <- ggplot(data = scores, aes(`Comp 1`, `Comp 2`, color = Beer)) + 
  # geom_point() + 
  stat_ellipse(level = 0.5)

g1 + g2

X <- beercata[,3:29]
clss <- factor(beercata$Beer)
judge <- beercata$Consumer.ID
k <- 3
A <- 30

mdl0 <- plsda(X,clss,ncomp = k)
lds0 <- mdl0$loadings %>% 
  unclass %>% %>% 
  rownames_to_column('attrib') %>% 

unjudge <- unique(judge)
nindiv <- length(unjudge)

LOADS <- data.frame()
for (i in 1:A){
  ic <- judge %in% sample(unjudge)[1:round(nindiv/2)]
  mdlSH <- plsda(X[ic,],clss[ic],ncomp = k)
  df_flip <- data.frame(sng = sign(diag(t(mdl0$loadings) %*% mdlSH$loadings))) %>% 
  lds <- mdlSH$loadings %>% 
    unclass %>% %>% 
    rownames_to_column('attrib') %>% 
    gather(cmp,val,-attrib) %>% 
    left_join(df_flip, by = 'cmp') %>% 
    mutate(SHiter = i, 
           val = val*sng)  
  LOADS <- bind_rows(LOADS,lds)
fc <- (1 / A)*((A - 1)/A)
sdloads <- LOADS %>% 
  left_join(lds0, by = c('attrib','cmp')) %>% 
  group_by(attrib,cmp) %>% 
  dplyr::summarise(sd = sum((val-val0)^2) *fc) %>% 
  mutate(cmp = paste('sd',cmp,sep = '')) %>% 
loadsSH <- lds0 %>% 
  spread(cmp,val0) %>% 
  left_join(sdloads, by = 'attrib')

ggplot(data = loadsSH, aes(x0 = `Comp 1`,y0 = `Comp 2`,a = `sdComp 1`,b = `sdComp 2`,angle = 0)) + geom_ellipse()

11.5 Exerecise

Take 5-10 minutes to look at the publication to get an overview

11.5.1 Exercise 1: PCA on consumer background

From this exercise you should be able to describe who your consumers are.

Make the data available:


Calculate a PCA model including the Variables 7 ( Interest in food ) to 39 ( App_Vinous ). Remember to standardize/scale the variables

mdlPCA <- prcomp(beerdemo[,7:39],scale. = T)

Plot the scores and loadings in a biplot and look for groupings of the consumers in the scores.

Group and color according to the background information not used in the model (Gender, Age,..)

ggbiplot(mdlPCA, groups = beerdemo$Gender, ellipse = T)

Describe what you find.

11.5.2 Exercise 2: PCA on CATA counts

From this exercise you should be able to describe your samples (beers) from the CATA counts. Collated (summed) for each beer of CATA score from all consumers.

Setup the collated version as described above.

beercatasum <- beercata %>% 
  gather(attrib, ...

Calculate a PCA model including all Variables and all Objects.

PCAmdl <- prcomp(beercatasum, scale. = T)

Plot the scores and describe the groupings of the samples. Plot the loadings and describe the correlations between the variables.


Use this biplot to find out which samples are described by which words.

11.5.3 Exercise 3: PCA on liking

From this exercise you should be able to describe the liking of the beer samples and see how the consumers do this.

Calculate a PCA model including all Variables and all Objects.

include_these <- complete.cases(beerliking)
PCAliking <- prcomp(beerliking[include_these,-1], scale. = T)

Plot a biplot or loading plot, and use the loadings and describe the correlations between the variables (liking of beers in this case).


Plot the scores and describe the groupings of the samples by colouring the score plot according to the consumer background variables. Note that the 160 rows in both datasets match each-other, so we can glue the demo information directly onto the liking model. If that was not the case, matching using left_join() or inner_join() would be nessesary before analysis.

ggbiplot(PCAliking,groups = beerdemo$Age[include_these], ellipse = T)

Any trends? For instance, how is liking related to the individual consumer diversity of beer (Beer types/month)?

… Some code to get all 7-scale demo information plots. You may want to export and view in a pdf viewer for zooming etc.

gall <- cbind(PCAliking$x[,1:2], beerdemo[include_these,]) %>% 
  gather(var,val,`Interest in food`:App_Vinous) %>% 
  ggplot(data = ., aes(PC1,PC2, color = factor(val))) + 
  geom_point() + 
  stat_ellipse() + 
ggsave(filename = 'anicebigfigure.pdf',gall, height = 20, width = 20)

11.5.4 Exercise 4: PLS on CATA counts and liking

From this exercise you should be able to conclude what drives the liking of your samples (beers).

For each beer, the collated CATA counts is the predictors, and the averaged liking is the response.

likingsum <- beerliking %>% 
  gather(Beer, liking, -`Consumer ID`) %>% 
  group_by(Beer) %>% 
  dplyr::summarise(lik = mean(liking, na.rm = T))

Check that the rows are ordered in the same way:

## [1] "Brown Ale"     "NY Lager"      "Porse Bock"    "Ravnsborg Red" "River Beer"    "Wheat IPA"
CATAlik <- list()
CATAlik$CATA <- scale(as.matrix(beercatasum[,-1]))
CATAlik$lik <- scale(likingsum$lik)
rownames(CATAlik$lik) <- rownames(CATAlik$CATA) <- beercatasum$Beer

Calculate a PLS model where CATA features are predictors and liking is response for all Objects.

catalik.pls <- plsr(lik ~ CATA, ncomp = 2, data = CATAlik, validation = "LOO")
corrplot(catalik.pls, labels = colnames(beercatasum)[-1])


Plot the loadings and study which X variables are important for the liking score. Advanced: Plot the Regression coefficients (scaled) and try to interpret the meaning of this plot (Hint: use your findings from the loadings plot).

11.5.5 Exercise 5: Mixed modelling on the liking

Dataset: Beer_XYZmatrix.xlsx, sheet “Z and Y liking” Import the datasheet in to R Studio. Check to see if all variables have the correct description/denomination (factor, numerical etc.) Are there any significant product differences for the liking? If so, what does the Tukey tell us? How does this fit with what you have done in the PCA/PLS exercises. Is the liking in general affected by the age, gender, household size or beer knowledge? What is the effect? Try to think of a plot that can show the significant differences. Do men and women score the samples significantly different in liking? Calculate the sample/gender differences in averages, try to use Pivot Tables in Excel.

11.5.6 Exercise 6: Comparing CATA binary data and counts

Dataset: Beer_XYZmatrix.xlsx, sheets “X CATA collated, counts” and “Z + Y + X unfolded” If time… Calculate two PCA models: one on the X unfolded matrix (CATA answers in binary codes, more columns and you just choose the ones you need) and one on the CATA counts. Compare the outcome of the two models. Evaluate explained variance Evaluate loadings plots Is this expected when looking at counts and “raw” data What type of information is lost by looking at the CATA counts?

11.5.7 Exercise 7: Cochran’s Q test on CATA binary data

Dataset: Beer_XYZmatrix.xlsx, sheet “X unfolded”

If time… Import the datasheet in to R Studio using the CSV format (save the file as CSV in Excel). Choose 4 relevant CATA attributes (based on your previous results today) to make a Cochran’s Q test for, comment on the results (i.e. the sample differences).