This document contains errata (and non-error important notes) for the textbook Statistics for Linguists: An Introduction Using R (Bodo Winter, 2019, Routledge). Please feel free to suggest other errata by creating a GitHub issue.

Preface

Page	Text	Comment
xv	The following R packages need to be installed to be able to execute all code in all chapters	Some packages may be unnecessary depending on what you plan to do: `swirl` is only used for a single exercise in Chapter 1 `pscl` is only used in Chapter 13 `lme4` isn’t used until Chapter 14 `afex` and `MuMIn` aren’t used until Chapter 15 `brms` isn’t actually used for code, just referenced in Appendix B

Chapter 1: Introduction to R

Page	Text	Comment
14	(Code output for `mydf`)	The alignment of this output apparently got messed up in the book publication process, but your output should be nicely lined up between the column headings (like `participant`) and the values (like `louis`)
15	Notice one curiosity: the `participant` column is indicated to be a factor vector, even though you only supplied a character vector! The `data.frame()` function secretly converted your character vector into factor vector	As of R version 4.0, functions that create dataframes (e.g., `data.frame()`, `read.csv()`) default to leaving character vectors as-is, rather than converting to factor vectors. When you run `str(mydf)`, the second line will instead read `$ participant: chr "louis" "paula" "vincenzo"` This change in R’s default behavior also affects other outputs in this chapter.
16	`mydf[mydf$participant == 'vincenzo',] $score`	Extra space before `$`. This doesn’t actually affect R’s output (try it yourself both ways!), but typically we write `$` without a leading space.
17	`nettle <- read.csv('nettle_1999_climate.csv')`	As Winter mentions, this code only works if your working directory is the same as wherever you’ve downloaded `nettle_1999_climate.csv`. This is not always a trivial or easy thing for students to navigate! Instead, I typically direct students to load datasets directly from the OSF repository, https://osf.io/34mq9/. In the files tab on that site (or on https://osf.io/34mq9/files/osfstorage), click materials > data, then right-click the dataset you want and copy the URL; then you can run `read.csv()` (or `read_csv()` in Chapter 2) with the URL plus `/download/`, with quotation marks around the full URL. For example, to load the Nettle (1999) dataset, you can run `nettle <- read.csv('https://osf.io/ptq7u/download')`. Of course, this only works if you’re connected to the internet!

Chapter 2: The Tidyverse and Reproducible R Workflows

Page	Text	Comment
28	(Code output for `nettle`, showing `<fct>` under `Country`)	For users of R version 4.0 or later, this will show as `<chr>`, for short for character vector. This is because R now defaults to leaving character vectors as-is, rather than converting to factor vectors.
29	But wait, didn’t I just tell you that tibbles default to character vectors? Why is the Country column coded as a factor? The culprit here is the base R function `read.csv()`, which automatically interprets any text column as factor. So, before the data frame was converted into a tibble, the character-to-factor conversion has already happened.	Again, this discussion is moot, since now both `read.csv()` and `read_csv()` interpret text columns as characters, not factors.
29	Output of `nettle <- read_csv('nettle_1999_climate.csv')`	The output looks different if you’re using the most recent version of `readr`. The only substantive difference is that `readr` now parses `Langs` as a double, not an integer.
38–39	`geom_histogram(fill = 'peachpuff3')`	The book is in black and white, so the color doesn’t show up in the book’s version. It should if you run the code yourself.
44–45	In addition, there are code chunks, which always begin with three `'''` (backward ticks, or the grave accent symbol). … `'''{r}` `# R code goes in here` `'''`	The wrong character apparently got substituted in the publishing process. The symbol is ```, which is on the key to the left of `1`. R markdown won’t know what to do with `'''`

Chapter 3: Descriptive statistics

Page	Text	Comment
53	The corresponding histogram is shown in Figure 3.1a (for an explanation of histograms, see Chapter 1.12).	Figure 3.1a is a barplot, not a histogram.
56	(Footnote 2)	This book uses N and n interchangeably (including in this footnote). In other texts, N refers to the size of a population and n to the size of a sample of that population.
64	`war <- read_csv(' warriner_2013_emotional_valence.csv')`	There is an extra space after the first quotation mark. This space needs to be removed or else the code will yield an error.
65	(Code output at the bottom of the page)	Content warning: This code output contains the most negative words in the dataframe, and there are some potentially triggering words in here.

Chapter 4: Introduction to the linear model

Page	Text	Comment
74	E4.7: \(y = b_{0} + b_{1} * x + e\)	The error term is sometimes notated as epsilon (\(\epsilon\)), but notated as \(e\), it shouldn’t be confused with the natural logarithm base (also notated \(e\)), which is discussed in section 5.4.
75	Figure 4.6 shows the SSE as a function of different slope values.	Should be Figure 4.5b, not Figure 4.6
77	Conversely, 32% of the variation in response durations is due to chance	Should be 28% (100%$-$72%), not 32%.
78	`plot(x., y, pch = 19)`	Extra dot. Should be `plot(x, y, pch = 19)`
81–2	`# A tibble: 50 x 2` (…) `# ... with 40 more rows`	The preceding code will yield a tibble with 61 rows, not 50. So your output should be `# A tibble: 61 x 2` (…) `# ... with 50 more rows`
83	(Code output starting with `r.squared`)	The alignment of this output apparently got messed up in the book publication process, but your output should be nicely lined up between the labels (like `r.squared`) and the quantities (like `0.9283634`)
83	The resultant plot will look similar to Figure 4.9	There is no Figure 4.9! Your plot will look similar to the following (but not identical, because we didn’t set a random seed):

Chapter 5: Correlation, linear, and nonlinear transformations

Page	Text	Comment
89–90	(All of section 5.3)	The biggest question I usually get about this chapter is “wait, what’s the connection between correlation and transformations?” And honestly…I don’t think it makes sense to smush these two concepts into a single chapter, because they don’t really make sense together (beyond the fact that they’re both statistical concepts). So if it’s less confusing to you, treat section 5.3 as a kind of mini-separate-chapter within the larger chapter.
92	The log₁₀ of 1000 is 3, which is a difference of 998 [between the logarithm and the raw number].	Should be 997, not 998.
97	(Code output for `tidy(ELP_mdl)`)	Your output will probably have a lot fewer decimal places.
100	First, compare `xmdl` to `xmdl_c`. There is no change in the slope, but the intercept is different in the centered model. In both models, the intercept is the prediction for \(x\) = 0, but \(x\) = 0 corresponds to the average frequency in the centered model. Second, compare `xmdl_c` and `xmdl_z`. The intercepts are the same because, for both models, the predictor has been centered. However, the slope has changed because a change in one unit is now a change in 1 standard deviation.	All the instances of `xmdl` should be `ELP_mdl` (including with the `_c` suffix)

Chapter 6: Multiple regression

Page	Text	Comment
103	E6.1: \(y = b_{0} + b_{1} * x + b_{2} * x + e\)	This equation suggests that \(b_{1}\) and \(b_{2}\) are multiplied by the same predictor \(x\). In the case of multiple regression, each predictor has its own coefficient (as in E6.3, bottom of page), so a more accurate form would be \(y = b_{0} + b_{1} * x_{1} + b_{2} * x_{2} + e\)
104	In this model, 900ms is the prediction for a word with 0 log frequency and 0 word length.	The prediction should be 750ms, not 900ms. (900ms was the prediction for the model excluding the word length coefficient.)
106	(Code output for `tidy(icon_mdl)`)	Your output will probably have a lot fewer decimal places.
106–7	Footnote 2 discussion of rounding	Two points of clarification about rounding: (1) Rounding should only be done at the end of an analysis; any earlier, and you’re losing precision in your calculations, running the risk of a rounding error compounding over the course of many calculations. (2) Rounding should only be done when you want to display numbers as numbers, not when feeding numbers into a plot.
107	(Footnote 3) A log frequency of 0 corresponds to a raw word frequency of 1, since 100=1	Missing superscript. Should be A log frequency of 0 corresponds to a raw word frequency of 1, since 10⁰=1
111	`for (i in 1:9) plot(rnorm(50), rnorm(50))`	If your Plots pane in RStudio is too small, running this code will yield `"Error in plot.new() : figure margins too large".` If so, just make Plots pane wider/taller.

Chapter 7: Categorical predictors

Page	Text	Comment
120	(Code output at the bottom of the page)	In more recent `dplyr` versions, we also get a warning message: `summarise()` ungrouping output (override with `.groups` argument)
127	(Equation E7.3)	taste valence should be smell valence, and smell valence should be taste valence. That is, smell valence, as the first ordered factor level, should correspond to +1 in the model coefficients, and taste valence should correspond to -1.

Chapter 8: Interactions and nonlinear effects

Page	Text	Comment
140	`# Groups: Phon, Sem [4]`	Your output should omit this line
142	However, these are not the average row-wise or column-wise differences, as one also has to include the interaction term (highlighted in bold [in Table 8.1]) in calculating these averages.	The interaction term isn’t bolded in Table 8.1. This is the interaction term that’s being referred to: 78.4 + 8.0 + (−7.8)+(−4.6)

Chapter 9: Inferential statistics 1

Page	Text	Comment
161	(Figure 9.2)	According to the equation for Cohen’s d (E9.1), the d values in this figure should be 1, 2, and 6 (there’s probably some rounding in the M values that we’re not shown)
162	`Cohen's d` `d estimate: 1.037202 (large)` `95 percent confidence interval:` `inf sup` `0.5142663 1.5601377`	Using the latest version of the `effsize` package (0.8.1), I get the following instead: `Cohen's d` `d estimate: -1.070784 (large)` `95 percent confidence interval:` `lower upper` `-1.5955866 -0.5459824`
163	(Equation E9.4)	This equation oversimplifies things a little too much. A more general form would be the following: \(CI = [estimate-CVSE, estimate+CVSE]\) \(estimate\) is some sample estimate—which could be a mean (\(\bar{x}\)), a regression coefficient (\(b\)), or some other estimate of a population parameter. Equation E9.4 uses the sample mean (\(\bar{x}\)) here, but you can use CIs to estimate population parameters other than the population mean. \(CV\) is the critical value: the value of the sampling distribution corresponding to the confidence level you specify. Equation E9.4 uses 1.96 here, but this is the critical value specifically for a normally-distributed sampling distribution and a 95% confidence level (because 95% of z-scores fall within 1.96 SDs of the mean) \(SE\) is the standard error, or \(\frac{sd}{\sqrt{n}}\) (\(sd\) being the sample standard deviation)
167	For t = 1.5, the p-value is p = 0.14	This is true with a sample size of 100. The t-distribution changes shape with different sample sizes (technically, with different “degrees of freedom”, which are closely related to sample size). Smaller samples mean the t-distribution has ‘heavier tails’, which translates to greater p-values for smaller sample sizes (holding t constant). As sample sizes approach infinity, the t-distribution approaches the shape of the normal distribution.
168	The critical value turns out to be t = 1.98 in this case.	Again, “this case” refers to a sample size of 100 (see above comment). For sample sizes of 10, 30, 100, 300, and 1000, the critical values for an \(\alpha\) level of 0.05 are roughly 2.26, 2.05, 1.98, 1.97, and 1.96 (respectively). Note that the corresponding \(\alpha\) critical value for a normal distribution is 1.96, underscoring the point that larger sample sizes make the t-distribution more like the normal distribution.
169	`x <- rep(c('A', 'B'), eac= n)`	`eac=` should be `each=`

Chapter 10: Inferential statistics 2

No errata found.

Chapter 11: Inferential statistics 3

Page	Text	Comment
180	(Footnote 1) When p-values focus on the SER column are very small numbers	Typo. Should be: When p-values are very small numbers
182	For the SER predictor, this interval is [0.53−1.960.04, 0.53+1.960.04], which yields the 95% confidence interval [45, 61] (with a little rounding).	The latter confidence interval should be [0.45, 0.61]
191	`CI_tib <- mutate(sense_preds,` `LB = fits – 1.96 * SEs, # lower bound` `UB = fits + 1.96 * SEs)`	`sense_preds` should be `CI_tib.` The ‘minus sign’ is actually an en dash (–)
193	These mappings assign the y-minimum of the error bar to the lower-bound column from the sense_preds tibble (`LB`), and the y-maximum of the error bar to the upper bound (`UB`)	`LB` and `UB` should be `lwr` and `upr`, respectively
194	`levels >= sense_order`	`>=` should be `=`
196	(Code output at the top of the page)	The `Log10Freq` column shouldn’t be duplicated
196	`geom_ribbon(aes(ymin = LB, ymax = UB),`	`LB` and `UB` should be `lwr` and `upr`, respectively

Chapter 12: Generalized linear models 1

Page	Text	Comment
202	Even if you supply very extreme values such as 10,000 or –10,000 to `plogis()`, it will always return a number between 0 and 1.	`plogis()` doesn’t always return a number between 0 and 1 in the version of R we’re using (4.0.3). `plogis(17)` returns `1`, and while `1 - plogis(17)` is nonzero, `1 - plogis(37)` returns `0`, and `qlogis(1 - plogis(37))` returns `-Inf`.
203	The whole point of talking about log odds is that this puts probabilities onto a continuous scale.	This is confusing because probabilities are also continuous. Better stated: Log-odds put probabilities onto an unbounded scale, which is more amenable to regression.
203	The logistic function is the inverse of the log odd function.	FYI, in R, the logistic function is `plogis()`; its inverse, the logit or log-odds function, is `qlogis()`. So for example, `plogis(0)` is 0.5, and `qlogis(0.5)` is 0.
209	`levels(dative$RealizationOfRecipient)`	`levels(df$myCol)` will throw an error if `myCol` is a character vector, which you can resolve by first converting to a factor: `levels(factor(df$myCol))`. There’s no error with the `dative` dataframe because it’s stored in the `languageR` package with factor columns.
214	(Footnote 4)	This footnote alludes to the distinction between different levels of measurement, in particular the difference between ordinal scales and interval/ratio scales.

Chapter 15: Mixed models 2

Page	Text	Comment
259	If this were true, this would mean that higher intercepts always (deterministically) go together with higher frequency slopes (the correlation is indicated to be negative).	Negative should be positive.