This document contains errata (and non-error important notes) for the textbook Statistics for Linguists: An Introduction Using R (Bodo Winter, 2019, Routledge). Please feel free to suggest other errata by creating a GitHub issue.
Page | Text | Comment |
---|---|---|
xv | The following R packages need to be installed to be able to execute all code in all chapters | Some packages may be unnecessary depending on what you plan to do:
|
Page | Text | Comment |
---|---|---|
14 | (Code output for mydf ) |
The alignment of this output apparently got messed up in the book
publication process, but your output should be nicely lined up between
the column headings (like participant ) and the values (like
louis ) |
15 | Notice one curiosity: the participant column is
indicated to be a factor vector, even though you only supplied a
character vector! The data.frame() function secretly
converted your character vector into factor vector |
As of R version 4.0, functions that create dataframes (e.g.,
data.frame() , read.csv() ) default to leaving
character vectors as-is, rather than converting to factor vectors. When
you run str(mydf) , the second line will instead read
$ participant: chr "louis" "paula" "vincenzo" This change in R’s default behavior also affects other outputs in this chapter. |
16 | mydf[mydf$participant == 'vincenzo',] $score |
Extra space before $ . This doesn’t actually affect R’s
output (try it yourself both ways!), but typically we write
$ without a leading space. |
17 | nettle <- read.csv('nettle_1999_climate.csv') |
As Winter mentions, this code only works if your working directory
is the same as wherever you’ve downloaded
nettle_1999_climate.csv . This is not always a trivial or
easy thing for students to navigate! Instead, I typically direct
students to load datasets directly from the OSF repository, https://osf.io/34mq9/. In
the files tab on that site (or on https://osf.io/34mq9/files/osfstorage), click
materials > data, then right-click the dataset you
want and copy the URL; then you can run read.csv() (or
read_csv() in Chapter 2) with the URL plus
/download/ , with quotation marks around the full URL. For
example, to load the Nettle (1999) dataset, you can run
nettle <- read.csv('https://osf.io/ptq7u/download') . Of
course, this only works if you’re connected to the internet! |
Page | Text | Comment |
---|---|---|
28 | (Code output for nettle , showing
<fct> under Country ) |
For users of R version 4.0 or later, this will show as
<chr> , for short for character vector. This
is because R now defaults to leaving character vectors as-is, rather
than converting to factor vectors. |
29 | But wait, didn’t I just tell you that tibbles default to character
vectors? Why is the Country column coded as a factor? The culprit here
is the base R function read.csv() , which automatically
interprets any text column as factor. So, before the data frame was
converted into a tibble, the character-to-factor conversion has already
happened. |
Again, this discussion is moot, since now both
read.csv() and read_csv() interpret text
columns as characters, not factors. |
29 | Output of
nettle <- read_csv('nettle_1999_climate.csv') |
The output looks different if you’re using the most recent version
of readr . The only substantive difference is that
readr now parses Langs as a double, not an
integer. |
38–39 | geom_histogram(fill = 'peachpuff3') |
The book is in black and white, so the color doesn’t show up in the book’s version. It should if you run the code yourself. |
44–45 | In addition, there are code chunks, which always begin with three
''' (backward ticks, or the grave accent
symbol).… '''{r} # R code goes in here ''' |
The wrong character apparently got substituted in the publishing
process. The symbol is ``` , which is on the key to the left
of 1 . R markdown won’t know what to do with
''' |
Page | Text | Comment |
---|---|---|
53 | The corresponding histogram is shown in Figure 3.1a (for an explanation of histograms, see Chapter 1.12). | Figure 3.1a is a barplot, not a histogram. |
56 | (Footnote 2) | This book uses N and n interchangeably (including in this footnote). In other texts, N refers to the size of a population and n to the size of a sample of that population. |
64 | war <- read_csv(' warriner_2013_emotional_valence.csv') |
There is an extra space after the first quotation mark. This space needs to be removed or else the code will yield an error. |
65 | (Code output at the bottom of the page) | Content warning: This code output contains the most negative words in the dataframe, and there are some potentially triggering words in here. |
Page | Text | Comment |
---|---|---|
74 | E4.7: \(y = b_{0} + b_{1} * x + e\) | The error term is sometimes notated as epsilon (\(\epsilon\)), but notated as \(e\), it shouldn’t be confused with the natural logarithm base (also notated \(e\)), which is discussed in section 5.4. |
75 | Figure 4.6 shows the SSE as a function of different slope values. | Should be Figure 4.5b, not Figure 4.6 |
77 | Conversely, 32% of the variation in response durations is due to chance | Should be 28% (100%$-$72%), not 32%. |
78 | plot(x., y, pch = 19) |
Extra dot. Should beplot(x, y, pch = 19) |
81–2 | # A tibble: 50 x 2 (…) # ... with 40 more rows |
The preceding code will yield a tibble with 61 rows, not 50. So your
output should
be# A tibble: 61 x 2 (…) # ... with 50 more rows |
83 | (Code output starting with r.squared ) |
The alignment of this output apparently got messed up in the book
publication process, but your output should be nicely lined up between
the labels (like r.squared ) and the quantities (like
0.9283634 ) |
83 | The resultant plot will look similar to Figure 4.9 | There is no Figure 4.9! Your plot will look similar to the following (but not identical, because we didn’t set a random seed): |
Page | Text | Comment |
---|---|---|
89–90 | (All of section 5.3) | The biggest question I usually get about this chapter is “wait, what’s the connection between correlation and transformations?” And honestly…I don’t think it makes sense to smush these two concepts into a single chapter, because they don’t really make sense together (beyond the fact that they’re both statistical concepts). So if it’s less confusing to you, treat section 5.3 as a kind of mini-separate-chapter within the larger chapter. |
92 | The log10 of 1000 is 3, which is a difference of 998 [between the logarithm and the raw number]. | Should be 997, not 998. |
97 | (Code output for tidy(ELP_mdl) ) |
Your output will probably have a lot fewer decimal places. |
100 | First, compare xmdl to xmdl_c . There is no
change in the slope, but the intercept is different in the centered
model. In both models, the intercept is the prediction for \(x\) = 0, but \(x\) = 0 corresponds to the average
frequency in the centered model. Second, compare xmdl_c and
xmdl_z . The intercepts are the same because, for both
models, the predictor has been centered. However, the slope has changed
because a change in one unit is now a change in 1 standard
deviation. |
All the instances of xmdl should be
ELP_mdl (including with the _c suffix) |
Page | Text | Comment |
---|---|---|
103 | E6.1: \(y = b_{0} + b_{1} * x + b_{2} * x + e\) | This equation suggests that \(b_{1}\) and \(b_{2}\) are multiplied by the same predictor \(x\). In the case of multiple regression, each predictor has its own coefficient (as in E6.3, bottom of page), so a more accurate form would be \(y = b_{0} + b_{1} * x_{1} + b_{2} * x_{2} + e\) |
104 | In this model, 900ms is the prediction for a word with 0 log frequency and 0 word length. | The prediction should be 750ms, not 900ms. (900ms was the prediction for the model excluding the word length coefficient.) |
106 | (Code output for tidy(icon_mdl) ) |
Your output will probably have a lot fewer decimal places. |
106–7 | Footnote 2 discussion of rounding | Two points of clarification about rounding: (1) Rounding should only be done at the end of an analysis; any earlier, and you’re losing precision in your calculations, running the risk of a rounding error compounding over the course of many calculations. (2) Rounding should only be done when you want to display numbers as numbers, not when feeding numbers into a plot. |
107 | (Footnote 3) A log frequency of 0 corresponds to a raw word frequency of 1, since 100=1 | Missing superscript. Should be A log frequency of 0 corresponds to a raw word frequency of 1, since 100=1 |
111 | for (i in 1:9) plot(rnorm(50), rnorm(50)) |
If your Plots pane in RStudio is too small, running this code will
yield "Error in plot.new() : figure margins too large". If
so, just make Plots pane wider/taller. |
Page | Text | Comment |
---|---|---|
120 | (Code output at the bottom of the page) | In more recent dplyr versions, we also get a warning
message:
` summarise()` ungrouping output (override with `.groups` argument) |
127 | (Equation E7.3) | taste valence should be smell valence, and smell valence should be taste valence. That is, smell valence, as the first ordered factor level, should correspond to +1 in the model coefficients, and taste valence should correspond to -1. |
Page | Text | Comment |
---|---|---|
140 | # Groups: Phon, Sem [4] |
Your output should omit this line |
142 | However, these are not the average row-wise or column-wise differences, as one also has to include the interaction term (highlighted in bold [in Table 8.1]) in calculating these averages. | The interaction term isn’t bolded in Table 8.1. This is the
interaction term that’s being referred to: 78.4 + 8.0 + (−7.8)+(−4.6) |
Page | Text | Comment |
---|---|---|
161 | (Figure 9.2) | According to the equation for Cohen’s d (E9.1), the d values in this figure should be 1, 2, and 6 (there’s probably some rounding in the M values that we’re not shown) |
162 | Cohen's d d estimate: 1.037202 (large) 95 percent confidence interval: inf sup 0.5142663 1.5601377 |
Using the latest version of the effsize package
(0.8.1), I get the following instead:Cohen's d d estimate: -1.070784 (large) 95 percent confidence interval: lower upper -1.5955866 -0.5459824 |
163 | (Equation E9.4) | This equation oversimplifies things a little too much. A more
general form would be the following: \(CI = [estimate-CV*SE, estimate+CV*SE]\)
|
167 | For t = 1.5, the p-value is p = 0.14 | This is true with a sample size of 100. The t-distribution changes shape with different sample sizes (technically, with different “degrees of freedom”, which are closely related to sample size). Smaller samples mean the t-distribution has ‘heavier tails’, which translates to greater p-values for smaller sample sizes (holding t constant). As sample sizes approach infinity, the t-distribution approaches the shape of the normal distribution. |
168 | The critical value turns out to be t = 1.98 in this case. | Again, “this case” refers to a sample size of 100 (see above comment). For sample sizes of 10, 30, 100, 300, and 1000, the critical values for an \(\alpha\) level of 0.05 are roughly 2.26, 2.05, 1.98, 1.97, and 1.96 (respectively). Note that the corresponding \(\alpha\) critical value for a normal distribution is 1.96, underscoring the point that larger sample sizes make the t-distribution more like the normal distribution. |
169 | x <- rep(c('A', 'B'), eac= n) |
eac= should be each= |
No errata found.
Page | Text | Comment |
---|---|---|
180 | (Footnote 1) When p-values focus on the SER column are very small numbers |
Typo. Should be: When p-values are very small numbers |
182 | For the SER predictor, this interval is [0.53−1.96*0.04, 0.53+1.96*0.04], which yields the 95% confidence interval [45, 61] (with a little rounding). | The latter confidence interval should be [0.45, 0.61] |
191 | CI_tib <- mutate(sense_preds, LB = fits – 1.96 * SEs, # lower bound UB = fits + 1.96 * SEs) |
sense_preds should be CI_tib. The ‘minus sign’ is actually an en dash (–) |
193 | These mappings assign the y-minimum of the error bar to the
lower-bound column from the sense_preds tibble (LB ), and
the y-maximum of the error bar to the upper bound (UB ) |
LB and UB should be lwr and
upr , respectively |
194 | levels >= sense_order |
>= should be = |
196 | (Code output at the top of the page) | The Log10Freq column shouldn’t be duplicated |
196 | geom_ribbon(aes(ymin = LB, ymax = UB), |
LB and UB should be lwr and
upr , respectively |
Page | Text | Comment |
---|---|---|
202 | Even if you supply very extreme values such as 10,000 or –10,000 to
plogis() , it will always return a number between 0 and
1. |
plogis() doesn’t always return a
number between 0 and 1 in the version of R we’re using (4.0.3).
plogis(17) returns 1 , and while
1 - plogis(17) is nonzero, 1 - plogis(37)
returns 0 , and qlogis(1 - plogis(37)) returns
-Inf . |
203 | The whole point of talking about log odds is that this puts probabilities onto a continuous scale. | This is confusing because probabilities are also continuous. Better stated: Log-odds put probabilities onto an unbounded scale, which is more amenable to regression. |
203 | The logistic function is the inverse of the log odd function. | FYI, in R, the logistic function is plogis() ; its
inverse, the logit or log-odds function, is qlogis() . So
for example, plogis(0) is 0.5, and qlogis(0.5)
is 0. |
209 | levels(dative$RealizationOfRecipient) |
levels(df$myCol) will throw an error if
myCol is a character vector, which you can resolve by first
converting to a factor: levels(factor(df$myCol)) . There’s
no error with the dative dataframe because it’s stored in
the languageR package with factor columns. |
214 | (Footnote 4) | This footnote alludes to the distinction between different levels of measurement, in particular the difference between ordinal scales and interval/ratio scales. |
Page | Text | Comment |
---|---|---|
259 | If this were true, this would mean that higher intercepts always (deterministically) go together with higher frequency slopes (the correlation is indicated to be negative). | Negative should be positive. |