Learning resources
These resources are here to complement and supplement the learning you do in class and through the textbook. Please make liberal use of them! I will update this page with more resources as the semester progresses.
On this page
General support
Textbooks
R for data science
- The 1st edition of R4DS by Hadley Wickham and Garrett Grolemund, published 2017
- Crowdsourced exercise solutions for the 1st edition
- The 2nd edition of R4DS by Hadley Wickham, Mine Ăetinkaya-Rundel, and Garrett Grolemund, published 2023 (the âupstreamâ version not edited by Dan)
- Work-in-progress exercise solutions for 2nd edition, by Mine Ăetinkaya-Rundel
Topics
Tidyverse
- Posit cheatsheets: 1â2-page reference cards for various packages created by Posit, including tidyverse packages (even more here)
Git & GitHub
- Git download & installation
- LSA 2019 Reproducible Research Workshop tutorials by Pittâs Na-Rae Han (forked and lightly adapted by Dan): Part 1 Intro to Git, Part 2 Linking Git with GitHub
- YouTube video How to get started with Git and GitHub
- Git - the simple guide
- Git cheatsheet
Markdown
Quarto
- Quarto cheatsheet
- Quarto tutorials: Step-by-step tutorials on Quarto
- R Markdown: The predecessor to Quarto
Data visualization
- Intro to
ggplot2
by Joey Stanley - Making vowel plots in R by Joey Stanley
ggplot2
cheatsheet
Data wrangling
- Hadley Wickhamâs J Stat Soft paper Tidy data: a more theoretical approach to the problem of data tidiness
- Project-oriented workflow: A blog post by Jenny Bryan famous for the line
If the first line of your R script is
setwd("C:\Users\jenny\path\that\only\I\have")
I will come into your office and SET YOUR COMPUTER ON FIRE đ„
Regular expressions
- Regular expression tutorials
- Regex crossword
stringr
cheatsheet, with regex on page 2
Open Access, data publishing
- Justin Kitzes. 2018. The Basic Reproducible Workflow Template. In Justin Kitzes, Daniel Turek, Fatma Deniz (Eds.) The Practice of Reproducible Research.
- Pitt LibGuides Copyright and Intellectual Property Toolkit
- D-Scholarship @ Pitt: Institutional Repository at the University of Pittsburgh
Web scraping
rvest
package- SelectorGadget: Tool for helping choose CSS selectors
- Article on how to use SelectorGadget in conjunction with
rvest
- Article on how to use SelectorGadget in conjunction with
- CSS dinner: Fun website for learning/practicing CSS selectors
Text processing
- Text Mining with R textbook covers âtidy textâ data structure, sentiment analysis, frequency (TF-IDF), ngrams and correlations, topic modeling, and 3 case studies
- Companion package:
tidytext
. Uses âtidy dataâ principles and plays nice withtidyverse
packages.
- Companion package:
quanteda
package for âquantitative analysis of textual dataâ. Fairly widely used, and also good for âtidy dataâ principlesspacyr
wraps the popular and powerful Python spaCy library- More powerful than
quanteda
, but relies on installing a bunch of stuff (incl. Python!) - Parsing, lemmatizing, dependencies, entity extraction
- Interfaces nicely with
tidytext
- More powerful than
- Older-school R packages for doing text analysis:
tm
andopenNLP
.- These donât use âtidy dataâ principles, but might have more community support resources (e.g., StackOverflow)
Machine learning
- James et al 2021 textbook An introduction to statistical learning: with Applications in R (2nd ed.). Website includes full text (!), R Markdown files for each chapter, and other resources
caret
package
Data formats
- TEI: A Gentle Introduction to XML
- json.org resources:
- Introducing JSON
- JSON example (vs. XML)
Software tools
Text editors
- Sublime Text (any OS)
- Notepad++ (Windows only)
- User defined languages for custom syntax highlighting
Speech data
- Praat: Speech analysis and synthesis software
- Will Stylerâs popular Using Praat for Linguistic Research handbook
- Praat scripting tutorials:
- Original
- Phonetics on speed
- Daniel Riggsâs tutorialâhas files for download!
- Section 11 (starting on p. 55) of Will Stylerâs Using Praat for Linguistic Research handbook (files can be downloaded here)
- Notepad++ syntax highlighting file for Praat scripting language (somewhat out of date, but still very helpful)
- Praat script collections:
- Montreal Forced Aligner: Automatic segmental alignment (GitHub)
- Kaldi ASR toolkit
- LaBB-CAT corpus analysis tool
- Demo corpus with worksheets for self-directed exploration of LaBB-CATâs functionality
- R package and Python library for interacting with corpora hosted on LaBB-CAT
Datasets
Directories
- Pitt LibGuides for linguistics data
- Linguistic Data Consortium (LDC)
- Linguistic Linked Open Data
- General-purpose data repositories:
- Increasingly, authors are publishing their data alongside peer-reviewed articles, so that can be a good place to look.
Text corpora
- Natural Language Toolkit (NLTK) corpora index (link, GitHub repo)
- Pitt English Language Institute Corpus (PELIC)
- EnronSent corpus
Speech corpora
Derived data
- Datasets from SPeech Across Dialects of English (SPADE) project
- World Atlas of Language Structures (WALS) Online
- Datasets from published research. Some examples:
- Egger et al. 2020. Improving the robustness of infant lexical processing speed measures. Behavior Research Methods (article, data)
- Koenecke et al. 2020. Racial disparities in automated speech recognition. PNAS (article, data)
- Sonderegger et al. 2017. The medium-term dynamics of accents on reality television. Language (article, data)
- Villarreal et al. 2021. Gender separation and the Speech Community: Rhoticity in early 20th century Southland New Zealand English. Language Variation and Change (article, data)
- Datasets from textbook Statistics for linguists: An introduction using R (Bodo Winter, 2019)