This document contains errata (and non-error important notes) for the textbook Statistics for Linguists: An Introduction Using R (Bodo Winter, 2019, Routledge). Please feel free to suggest other errata by creating a GitHub issue.


Page Text Comment
xv The following R packages need to be installed to be able to execute all code in all chapters Some packages may be unnecessary depending on what you plan to do:
  • swirl is only used for a single exercise in Chapter 1
  • pscl is only used in Chapter 13
  • lme4 isn’t used until Chapter 14
  • afex and MuMIn aren’t used until Chapter 15
  • brms isn’t actually used for code, just referenced in Appendix B

Chapter 1: Introduction to R

Page Text Comment
14 (Code output for mydf) The alignment of this output apparently got messed up in the book publication process, but your output should be nicely lined up between the column headings (like participant) and the values (like louis)
15 Notice one curiosity: the participant column is indicated to be a factor vector, even though you only supplied a character vector! The data.frame() function secretly converted your character vector into factor vector As of R version 4.0, functions that create dataframes (e.g., data.frame(), read.csv()) default to leaving character vectors as-is, rather than converting to factor vectors. When you run str(mydf), the second line will instead read $ participant: chr "louis" "paula" "vincenzo"
This change in R’s default behavior also affects other outputs in this chapter.
16 mydf[mydf$participant == 'vincenzo',] $score Extra space before $. This doesn’t actually affect R’s output (try it yourself both ways!), but typically we write $ without a leading space.
17 nettle <- read.csv('nettle_1999_climate.csv') As Winter mentions, this code only works if your working directory is the same as wherever you’ve downloaded nettle_1999_climate.csv. This is not always a trivial or easy thing for students to navigate! Instead, I typically direct students to load datasets directly from the OSF repository, In the files tab on that site (or on, click materials > data, then right-click the dataset you want and copy the URL; then you can run read.csv() (or read_csv() in Chapter 2) with the URL plus /download/, with quotation marks around the full URL. For example, to load the Nettle (1999) dataset, you can run nettle <- read.csv(''). Of course, this only works if you’re connected to the internet!

Chapter 2: The Tidyverse and Reproducible R Workflows

Page Text Comment
28 (Code output for nettle, showing <fct> under Country) For users of R version 4.0 or later, this will show as <chr>, for short for character vector. This is because R now defaults to leaving character vectors as-is, rather than converting to factor vectors.
29 But wait, didn’t I just tell you that tibbles default to character vectors? Why is the Country column coded as a factor? The culprit here is the base R function read.csv(), which automatically interprets any text column as factor. So, before the data frame was converted into a tibble, the character-to-factor conversion has already happened. Again, this discussion is moot, since now both read.csv() and read_csv() interpret text columns as characters, not factors.
29 Output of nettle <- read_csv('nettle_1999_climate.csv') The output looks different if you’re using the most recent version of readr. The only substantive difference is that readr now parses Langs as a double, not an integer.
38–39 geom_histogram(fill = 'peachpuff3') The book is in black and white, so the color doesn’t show up in the book’s version. It should if you run the code yourself.
44–45 In addition, there are code chunks, which always begin with three ''' (backward ticks, or the grave accent symbol).

# R code goes in here
The wrong character apparently got substituted in the publishing process. The symbol is ```, which is on the key to the left of 1. R markdown won’t know what to do with '''

Chapter 3: Descriptive statistics

Page Text Comment
53 The corresponding histogram is shown in Figure 3.1a (for an explanation of histograms, see Chapter 1.12). Figure 3.1a is a barplot, not a histogram.
56 (Footnote 2) This book uses N and n interchangeably (including in this footnote). In other texts, N refers to the size of a population and n to the size of a sample of that population.
64 war <- read_csv(' warriner_2013_emotional_valence.csv') There is an extra space after the first quotation mark. This space needs to be removed or else the code will yield an error.
65 (Code output at the bottom of the page) Content warning: This code output contains the most negative words in the dataframe, and there are some potentially triggering words in here.

Chapter 4: Introduction to the linear model

Page Text Comment
74 E4.7: \(y = b_{0} + b_{1} * x + e\) The error term is sometimes notated as epsilon (\(\epsilon\)), but notated as \(e\), it shouldn’t be confused with the natural logarithm base (also notated \(e\)), which is discussed in section 5.4.
75 Figure 4.6 shows the SSE as a function of different slope values. Should be Figure 4.5b, not Figure 4.6
77 Conversely, 32% of the variation in response durations is due to chance Should be 28% (100%$-$74%), not 32%.
78 plot(x., y, pch = 19) Extra dot. Should be
plot(x, y, pch = 19)
81–2 # A tibble: 50 x 2
# ... with 40 more rows
The preceding code will yield a tibble with 61 rows, not 50. So your output should be
# A tibble: 61 x 2
# ... with 50 more rows
83 (Code output starting with r.squared) The alignment of this output apparently got messed up in the book publication process, but your output should be nicely lined up between the labels (like r.squared) and the quantities (like 0.9283634)
83 The resultant plot will look similar to Figure 4.9 There is no Figure 4.9! Your plot will look similar to the following (but not identical, because we didn’t set a random seed):

Chapter 5: Correlation, linear, and nonlinear transformations

Page Text Comment
89–90 (All of section 5.3) The biggest question I usually get about this chapter is “wait, what’s the connection between correlation and transformations?” And honestly…I don’t think it makes sense to smush these two concepts into a single chapter, because they don’t really make sense together (beyond the fact that they’re both statistical concepts). So if it’s less confusing to you, treat section 5.3 as a kind of mini-separate-chapter within the larger chapter.
92 The log10 of 1000 is 3, which is a difference of 998 [between the logarithm and the raw number]. Should be 997, not 998.
97 (Code output for tidy(ELP_mdl)) Your output will probably have a lot fewer decimal places.
100 First, compare xmdl to xmdl_c. There is no change in the slope, but the intercept is different in the centered model. In both models, the intercept is the prediction for \(x\) = 0, but \(x\) = 0 corresponds to the average frequency in the centered model. Second, compare xmdl_c and xmdl_z. The intercepts are the same because, for both models, the predictor has been centered. However, the slope has changed because a change in one unit is now a change in 1 standard deviation. All the instances of xmdl should be ELP_mdl (including with the _c suffix)

Chapter 6: Multiple regression

Page Text Comment
103 E6.1: \(y = b_{0} + b_{1} * x + b_{2} * x + e\) This equation suggests that \(b_{1}\) and \(b_{2}\) are multiplied by the same predictor \(x\). In the case of multiple regression, each predictor has its own coefficient (as in E6.3, bottom of page), so a more accurate form would be \(y = b_{0} + b_{1} * x_{1} + b_{2} * x_{2} + e\)
106 (Code output for tidy(icon_mdl)) Your output will probably have a lot fewer decimal places.
106–7 Footnote 2 discussion of rounding Two points of clarification about rounding: (1) Rounding should only be done at the end of an analysis; any earlier, and you’re losing precision in your calculations, running the risk of a rounding error compounding over the course of many calculations. (2) Rounding should only be done when you want to display numbers as numbers, not when feeding numbers into a plot.
107 (Footnote 3) A log frequency of 0 corresponds to a raw word frequency of 1, since 100=1 Missing superscript. Should be
A log frequency of 0 corresponds to a raw word frequency of 1, since 100=1
111 for (i in 1:9) plot(rnorm(50), rnorm(50)) If your Plots pane in RStudio is too small, running this code will yield "Error in : figure margins too large". If so, just make Plots pane wider/taller.

Chapter 7: Categorical predictors

Page Text Comment
120 (Code output at the bottom of the page) In more recent dplyr versions, we also get a warning message:
`summarise()` ungrouping output (override with `.groups` argument)

Chapter 8: Interactions and nonlinear effects

Page Text Comment
140 # Groups: Phon, Sem [4] Your output should omit this line
142 However, these are not the average row-wise or column-wise differences, as one also has to include the interaction term (highlighted in bold [in Table 8.1]) in calculating these averages. The interaction term isn’t bolded in Table 8.1. This is the interaction term that’s being referred to:
78.4 + 8.0 + (−7.8)+(−4.6)

Chapter 9: Inferential statistics 1

Page Text Comment
161 (Figure 9.2) According to the equation for Cohen’s d (E9.1), the d values in this figure should be 1, 2, and 6 (there’s probably some rounding in the M values that we’re not shown)
162 Cohen's d
d estimate: 1.037202 (large)
95 percent confidence interval:
inf sup
0.5142663 1.5601377
Using the latest version of the effsize package (0.8.1), I get the following instead:
Cohen's d
d estimate: -1.070784 (large)
95 percent confidence interval:
lower upper
-1.5955866 -0.5459824
163 (Equation E9.4) This equation oversimplifies things a little too much. A more general form would be the following:
\(CI = [estimate-CV*SE, estimate+CV*SE]\)
  • \(estimate\) is some sample estimate—which could be a mean (\(\bar{x}\)), a regression coefficient (\(b\)), or some other estimate of a population parameter.
    • Equation E9.4 uses the sample mean (\(\bar{x}\)) here, but you can use CIs to estimate population parameters other than the population mean.
  • \(CV\) is the critical value: the value of the sampling distribution corresponding to the confidence level you specify.
    • Equation E9.4 uses 1.96 here, but this is the critical value specifically for a normally-distributed sampling distribution and a 95% confidence level (because 95% of z-scores fall within 1.96 SDs of the mean)
  • \(SE\) is the standard error, or \(\frac{sd}{\sqrt{n}}\) (\(sd\) being the sample standard deviation)
167 For t = 1.5, the p-value is p = 0.14 This is true with a sample size of 100. The t-distribution changes shape with different sample sizes (technically, with different “degrees of freedom”, which are closely related to sample size). Smaller samples mean the t-distribution has ‘heavier tails’, which translates to greater p-values for smaller sample sizes (holding t constant). As sample sizes approach infinity, the t-distribution approaches the shape of the normal distribution.
168 The critical value turns out to be t = 1.98 in this case. Again, “this case” refers to a sample size of 100 (see above comment). For sample sizes of 10, 30, 100, 300, and 1000, the critical values for an \(\alpha\) level of 0.05 are roughly 2.26, 2.05, 1.98, 1.97, and 1.96 (respectively). Note that the corresponding \(\alpha\) critical value for a normal distribution is 1.96, underscoring the point that larger sample sizes make the t-distribution more like the normal distribution.

Chapter 10: Inferential statistics 2

No errata found.

Chapter 11: Inferential statistics 3

Page Text Comment
180 (Footnote 1)
When p-values focus on the SER column are very small numbers
Typo. Should be:
When p-values are very small numbers
182 For the SER predictor, this interval is [0.53−1.96*0.04, 0.53+1.96*0.04], which yields the 95% confidence interval [45, 61] (with a little rounding). The latter confidence interval should be [0.45, 0.61]
191 CI_tib <- mutate(sense_preds,
LB = fits – 1.96 * SEs, # lower bound
UB = fits + 1.96 * SEs)
sense_preds should be CI_tib.
The ‘minus sign’ is actually an en dash (–)
193 These mappings assign the y-minimum of the error bar to the lower-bound column from the sense_preds tibble (LB), and the y-maximum of the error bar to the upper bound (UB) LB and UB should be lwr and upr, respectively
194 levels >= sense_order >= should be =
196 (Code output at the top of the page) The Log10Freq column shouldn’t be duplicated
196 geom_ribbon(aes(ymin = LB, ymax = UB), LB and UB should be lwr and upr, respectively

Chapter 12: Generalized linear models 1

Page Text Comment
202 Even if you supply very extreme values such as 10,000 or –10,000 to plogis(), it will always return a number between 0 and 1. plogis() doesn’t always return a number between 0 and 1 in the version of R we’re using (4.0.3). plogis(17) returns 1, and while 1 - plogis(17) is nonzero, 1 - plogis(37) returns 0, and qlogis(1 - plogis(37)) returns -Inf.
203 The whole point of talking about log odds is that this puts probabilities onto a continuous scale. This is confusing because probabilities are also continuous. Better stated: Log-odds put probabilities onto an unbounded scale, which is more amenable to regression.
203 The logistic function is the inverse of the log odd function. FYI, in R, the logistic function is plogis(); its inverse, the logit or log-odds function, is qlogis(). So for example, plogis(0) is 0.5, and qlogis(0.5) is 0.
209 levels(dative$RealizationOfRecipient) levels(df$myCol) will throw an error if myCol is a character vector, which you can resolve by first converting to a factor: levels(factor(df$myCol)). There’s no error with the dative dataframe because it’s stored in the languageR package with factor columns.
214 (Footnote 4) This footnote alludes to the distinction between different levels of measurement, in particular the difference between ordinal scales and interval/ratio scales.