Learning resources

These resources are here to complement and supplement the learning you do in class and through the textbook. Please make liberal use of them! I will update this page with more resources as the semester progresses.

On this page

General support
Textbooks
1. R for data science
Topics
Software tools
1. Text editors
2. Speech data
Datasets

General support

Textbooks

R for data science

The 1st edition of R4DS by Hadley Wickham and Garrett Grolemund, published 2017
- Crowdsourced exercise solutions for the 1st edition
The 2nd edition of R4DS by Hadley Wickham, Mine Çetinkaya-Rundel, and Garrett Grolemund, published 2023 (the “upstream” version not edited by Dan)
- Work-in-progress exercise solutions for 2nd edition, by Mine Çetinkaya-Rundel

Topics

Tidyverse

Posit cheatsheets: 1–2-page reference cards for various packages created by Posit, including tidyverse packages (even more here)

Git & GitHub

Git download & installation
LSA 2019 Reproducible Research Workshop tutorials by Pitt’s Na-Rae Han (forked and lightly adapted by Dan): Part 1 Intro to Git, Part 2 Linking Git with GitHub
YouTube video How to get started with Git and GitHub
Git - the simple guide
Git cheatsheet

Markdown

Quarto

Quarto cheatsheet
Quarto tutorials: Step-by-step tutorials on Quarto
R Markdown: The predecessor to Quarto

Data visualization

Data wrangling

Hadley Wickham’s J Stat Soft paper Tidy data: a more theoretical approach to the problem of data tidiness
Project-oriented workflow: A blog post by Jenny Bryan famous for the line

If the first line of your R script is

setwd("C:\Users\jenny\path\that\only\I\have")

I will come into your office and SET YOUR COMPUTER ON FIRE 🔥

Regular expressions

Regular expression tutorials
Regex crossword
stringr cheatsheet, with regex on page 2

Open Access, data publishing

Justin Kitzes. 2018. The Basic Reproducible Workflow Template. In Justin Kitzes, Daniel Turek, Fatma Deniz (Eds.) The Practice of Reproducible Research.
Pitt LibGuides Copyright and Intellectual Property Toolkit
D-Scholarship @ Pitt: Institutional Repository at the University of Pittsburgh

Web scraping

rvest package
SelectorGadget: Tool for helping choose CSS selectors
- Article on how to use SelectorGadget in conjunction with rvest
CSS diner: Fun website for learning/practicing CSS selectors

Text processing

Text Mining with R textbook covers “tidy text” data structure, sentiment analysis, frequency (TF-IDF), ngrams and correlations, topic modeling, and 3 case studies
- Companion package: tidytext. Uses “tidy data” principles and plays nice with tidyverse packages.
quanteda package for “quantitative analysis of textual data”. Fairly widely used, and also good for “tidy data” principles
spacyr wraps the popular and powerful Python spaCy library
- More powerful than quanteda, but relies on installing a bunch of stuff (incl. Python!)
- Parsing, lemmatizing, dependencies, entity extraction
- Interfaces nicely with tidytext
Older-school R packages for doing text analysis: tm and openNLP.
- These don’t use “tidy data” principles, but might have more community support resources (e.g., StackOverflow)

Machine learning

James et al 2021 textbook An introduction to statistical learning: with Applications in R (2nd ed.). Website includes full text (!), R Markdown files for each chapter, and other resources
- Compared to Boehmke & Greenwell 2020, this is a more comprehensive text with more mathematical foundations. (This is probably the one you cite.)
Boehmke & Greenwell 2020 textbook Hands-on machine learning with R
- Compared to James et al 2021, this is more accessible.
caret package

Data formats

TEI: A Gentle Introduction to XML
json.org resources:
- Introducing JSON
- JSON example (vs. XML)

Software tools

Text editors

Sublime Text (any OS)
Notepad++ (Windows only)
- User defined languages for custom syntax highlighting

Speech data

Praat: Speech analysis and synthesis software
Will Styler’s popular Using Praat for Linguistic Research handbook
Praat scripting tutorials:
- Original
- Phonetics on speed
- Daniel Riggs’s tutorial—has files for download!
- Section 11 (starting on p. 55) of Will Styler’s Using Praat for Linguistic Research handbook (files can be downloaded here)
Notepad++ syntax highlighting file for Praat scripting language (somewhat out of date, but still very helpful)
Praat script collections:
Montreal Forced Aligner: Automatic segmental alignment (GitHub)
Kaldi ASR toolkit
LaBB-CAT corpus analysis tool
- Demo corpus with worksheets for self-directed exploration of LaBB-CAT’s functionality
- R package and Python library for interacting with corpora hosted on LaBB-CAT

Datasets

Directories

Pitt LibGuides for linguistics data
Linguistic Data Consortium (LDC)
Linguistic Linked Open Data
General-purpose data repositories:
- Open Science Foundation (OSF)
- Kaggle
Increasingly, authors are publishing their data alongside peer-reviewed articles, so that can be a good place to look.

Text corpora

Natural Language Toolkit (NLTK) corpora index (link, GitHub repo)
Pitt English Language Institute Corpus (PELIC)
EnronSent corpus

Speech corpora

Derived data

Datasets from SPeech Across Dialects of English (SPADE) project
World Atlas of Language Structures (WALS) Online
Datasets from published research. Some examples:
- Egger et al. 2020. Improving the robustness of infant lexical processing speed measures. Behavior Research Methods (article, data)
- Koenecke et al. 2020. Racial disparities in automated speech recognition. PNAS (article, data)
- Sonderegger et al. 2017. The medium-term dynamics of accents on reality television. Language (article, data)
- Villarreal et al. 2021. Gender separation and the Speech Community: Rhoticity in early 20th century Southland New Zealand English. Language Variation and Change (article, data)
Datasets from textbook Statistics for linguists: An introduction using R (Bodo Winter, 2019)