Midterm

Every student should master a basic data-science toolkit that will be applicable to a wide array of potential research projects. The midterm gives you an opportunity to practice and apply the skills we’ve discussed so far. It takes the form of an extended exercise using real data collected for a recent project; you’ll read the data into R, perform data-wrangling, report summary information, and create plots. It is rather open-ended, in two senses. First, there may be multiple valid ways to write code to accomplish the same task. Second, beyond some basic requirements, it’s up to you to decide what parts of the data you want to dig further into.

Data and background

Recent research in American dialectology has suggested that regional accents may be diminishing over time, especially in terms of how speakers produce vowels. Some studies have suggested that this may be due to greater mobility. The data you’ll be working with is part of a study that seeks to address these questions:

Are regional accents in the US diminishing over time?
How does mobility affect language variation?

The file, crowdsourced-audio.xlsx, contains real data from a crowdsourced self-recording task, collected by a colleague of mine in collaboration with a US-based news organization. (For the sake of protecting speakers’ confidentiality, I’m leaving both the colleague and the news org anonymous, and I’ve redacted references to the news org in the file.) English speakers from across the US filled out an online survey that took them through several prompts, including recording themselves speaking, and asked them to provide demographic information. 948 speakers completed the survey, and this file is the data export from the survey software.

Your task

Your task is to create code that, when combined with the actual recordings, will allow my colleague to answer these research questions. In particular, you’ll do the following:

Read the data into R
Whip it into shape
- Rename unruly column names
- Whittle down the set of columns to only those that provide meaningful information
- Reformat data values as needed
  - For example, the column “If you moved around a lot growing up, you could list the places here” needs to be better formatted in order to be usable
Write data to an appropriate output format
Generate some summary information about the data that might be helpful for understanding the sample
- This should include at least one plot of some data

Assignment

The midterm will be due in ~~two phases~~one phase:

First, you complete it to the best of your ability and submit it by 12 noon on Tuesday, Oct 21 (week 9)
- Open-notes, open-book, open-lectures, open-internet, open-everything-but-one-another.
- Please ask me questions (via Slack DM) if you get really stuck (= you spend 10+ minutes on an individual issue)
Then, you review and revise your answers based on my comments and the midterm answer key by 12 noon on Tuesday, Nov 4 (week 11)
- ~~What you submit is your revised answers, plus a metacognitive reflection (see below)~~
- New (Oct 21): The answer key is now available (https://github.com/Data-Sci-2025/Midterm/tree/main/answer-key). Note that this is just one possible way to approach the data, and that the person who completed it had much more time and experience, so this shouldn’t be a strict measuring stick.

Submission

Our class GitHub organization has a midterm repository, which you will fork. For both submissions, you will submit the following:

data-wrangling.qmd and data-wrangling.md (the latter is a GitHub-flavored Markdown file rendered from the former)
- This is your primary product: the code that you’ve written to import the data, whip it into shape, generate plots, and write your output file(s)
- Your file should clearly describe what the code is doing, highlighting and defending decisions you’ve made about data-wrangling (e.g., your scheme for renaming columns)
- It should make use of Markdown formatting and layout affordances (rich text, headings) and it should include session info
At least one output .csv or .Rds file, each corresponding to a tibble created in your data-wrangling code
- This is only for the ‘final’ tibble(s) you create, not any intermediary tibbles or other R objects that only exist to get you to the end goal
- If you decide your data is best represented as a relational database, then you should submit a file for each tibble (with file names that make it obvious what info is in what file). Otherwise, submit a file for your single tibble.
- .csv files can’t store all of the information that can be in an R tibble (e.g., list-columns, factors). If your output uses data structures that can’t be stored in .csvs, create an .Rds

Second submission

Update: we’re forgoing the second submission so folks have more time to focus on projects.

For the second submission only, you’ll additionally submit:

reflection.md (and optionally reflection.qmd if you want reflection.md to embed R code)
- This a metacognitive reflection describing your progress as a learner in this course, using the evidence provided by reviewing your midterm answers
- Aim for 1–2 paragraphs
- There is no set structure, but you may want to think about questions like:
  - What do you understand better about coding itself (both specific R things and also more general coding strategies) now than you did at the start of the semester?
  - What was the hardest thing for you about this midterm?
  - What do you still need to improve on?
  - How will what you’ve learned from this midterm help you approach uncertain data or analysis tasks in your project?

Other submission notes

You will not contribute your fork to the upstream remote, so you don’t need to rename any files.
You are encouraged to create R Notebooks in the process of figuring out your ‘final’ code, as long as your repo has the required files
Because your midterm repo comes from our class GitHub organization repo, it’ll be visible to other students; please do not look at your fellow students’ midterm repositories before you have completed your first submission.
The midterm will be graded for effort, because I know enough about all of you by now that I can safely trust you all to work hard.
- Focus on the code first and foremost. Your narrative (especially defending your decisions) is a distant second in importance, and the presentation (i.e., Markdown formatting) least important

This is private data made available to us for the educational purpose of practicing data science techniques. Please do not share it beyond this class!

Last words of advice

Assume that some googling might be necessary
You will need one package we haven’t practiced in class for a specific task.
- Part of the skillset you’ll need to develop as a data scientist is finding the right tools
Other than this one package, you probably won’t need to go outside the tidyverse (ggplot2, dplyr, tidyr, readr, purrr, tibble, stringr, forcats). However, you might find yourself using functions we haven’t discussed within those packages.
- Each package has a homepage at <package-name>.tidyverse.org (e.g., https://ggplot2.tidyverse.org/)
- To see the full list of functions in a package, run (e.g.) ?purrr in RStudio and click “index” at the bottom of the help pane
- Don’t forget the cheatsheets!
I highly recommend using a separate “scratchpad” Quarto document to fiddle around with different ways of writing code! It’s really useful to look at what you’ve already tried and what that code yielded.
- Once you’re confident about a piece of code, ‘lock it in’ by adding it to your main data-wrangling.qmd document.
- The Midterm repository’s .gitignore already includes scratchpad.qmd and scratchpad.md
Again, please ask me questions (via Slack DM) if you get really stuck (= you spend 10+ minutes on an individual issue)
You’ve got this!!!