Learning resources

These resources are here to complement and supplement the learning you do in class and through the textbook. Please make liberal use of them! I will update this page with more resources as the semester progresses.

On this page
  1. General support
  2. Textbooks
    1. R for data science
  3. Topics
    1. Tidyverse
    2. Git & GitHub
    3. Markdown
    4. Quarto
    5. Data visualization
    6. Data wrangling
    7. Regular expressions
    8. Open Access, data publishing
    9. Web scraping
    10. Text processing
    11. Machine learning
    12. Data formats
  4. Software tools
    1. Text editors
    2. Speech data
  5. Datasets
    1. Directories
    2. Text corpora
    3. Speech corpora
    4. Derived data

General support

Textbooks

R for data science

  • The 1st edition of R4DS by Hadley Wickham and Garrett Grolemund, published 2017
  • The 2nd edition of R4DS by Hadley Wickham, Mine Çetinkaya-Rundel, and Garrett Grolemund, published 2023 (the “upstream” version not edited by Dan)

Topics

Tidyverse

  • Posit cheatsheets: 1–2-page reference cards for various packages created by Posit, including tidyverse packages (even more here)

Git & GitHub

Markdown

Quarto

Data visualization

Data wrangling

  • Hadley Wickham’s J Stat Soft paper Tidy data: a more theoretical approach to the problem of data tidiness
  • Project-oriented workflow: A blog post by Jenny Bryan famous for the line

    If the first line of your R script is

    setwd("C:\Users\jenny\path\that\only\I\have")

    I will come into your office and SET YOUR COMPUTER ON FIRE đŸ”„

Regular expressions

Open Access, data publishing

Web scraping

Text processing

  • Text Mining with R textbook covers “tidy text” data structure, sentiment analysis, frequency (TF-IDF), ngrams and correlations, topic modeling, and 3 case studies
    • Companion package: tidytext. Uses “tidy data” principles and plays nice with tidyverse packages.
  • quanteda package for “quantitative analysis of textual data”. Fairly widely used, and also good for “tidy data” principles
  • spacyr wraps the popular and powerful Python spaCy library
    • More powerful than quanteda, but relies on installing a bunch of stuff (incl. Python!)
    • Parsing, lemmatizing, dependencies, entity extraction
    • Interfaces nicely with tidytext
  • Older-school R packages for doing text analysis: tm and openNLP.
    • These don’t use “tidy data” principles, but might have more community support resources (e.g., StackOverflow)

Machine learning

Data formats

Software tools

Text editors

Speech data

Datasets

Directories

Text corpora

Speech corpora

Derived data