Midterm
Every student should master a basic data-science toolkit that will be applicable to a wide array of potential research projects. The midterm gives you an opportunity to practice and apply the skills weâve discussed so far. It takes the form of an extended exercise using real data collected for a recent project; youâll read the data into R, perform data-wrangling, report summary information, and create plots. It is rather open-ended, in two senses. First, there may be multiple valid ways to write code to accomplish the same task. Second, beyond some basic requirements, itâs up to you to decide what parts of the data you want to dig further into.
Data and background
Recent research in American dialectology has suggested that regional accents may be diminishing over time, especially in terms of how speakers produce vowels. Some studies have suggested that this may be due to greater mobility. The data youâll be working with is part of a study that seeks to address these questions:
- Are regional accents in the US diminishing over time?
- How does mobility affect language variation?
The file, crowdsourced-audio.xlsx
, contains real data from a crowdsourced self-recording task, collected by a colleague of mine in collaboration with a US-based news organization. (For the sake of protecting speakersâ confidentiality, Iâm leaving both the colleague and the news org anonymous, and Iâve redacted references to the news org in the file.) English speakers from across the US filled out an online survey that took them through several prompts, including recording themselves speaking, and asked them to provide demographic information. 948 speakers completed the survey, and this file is the data export from the survey software.
Your task
Your task is to create code that, when combined with the actual recordings, will allow my colleague to answer these research questions. In particular, youâll do the following:
- Read the data into R
- Whip it into shape
- Rename unruly column names
- Whittle down the set of columns to only those that provide meaningful information
- Reformat data values as needed
- For example, the column âIf you moved around a lot growing up, you could list the places hereâ needs to be better formatted in order to be usable
- Write data to an appropriate output format
- Generate some summary information about the data that might be helpful for understanding the sample
- This should include at least one plot of some data
Assignment
The midterm will be due in two phases:
- First, you complete it to the best of your ability and submit it by 12 noon on Tuesday, Oct 21 (week 9)
- Open-notes, open-book, open-lectures, open-internet, open-everything-but-one-another.
- Please ask me questions (via Slack DM) if you get really stuck (= you spend 10+ minutes on an individual issue)
- Then, you review and revise your answers based on my comments by 12 noon on Tuesday, Nov 4 (week 11)
- What you submit is your revised answers, plus a metacognitive reflection (see below)
Submission
Our class GitHub organization has a midterm repository, which you will fork. For both submissions, you will submit the following:
data-wrangling.qmd
anddata-wrangling.md
(the latter is a GitHub-flavored Markdown file rendered from the former)- This is your primary product: the code that youâve written to import the data, whip it into shape, generate plots, and write your output file(s)
- Your file should clearly describe what the code is doing, highlighting and defending decisions youâve made about data-wrangling (e.g., your scheme for renaming columns)
- It should make use of Markdown formatting and layout affordances (rich text, headings) and it should include session info
- At least one output
.csv
or.Rds
file, each corresponding to a tibble created in your data-wrangling code- This is only for the âfinalâ tibble(s) you create, not any intermediary tibbles or other R objects that only exist to get you to the end goal
- If you decide your data is best represented as a relational database, then you should submit a file for each tibble (with file names that make it obvious what info is in what file). Otherwise, submit a file for your single tibble.
.csv
files canât store all of the information that can be in an R tibble (e.g., list-columns, factors). If your output uses data structures that canât be stored in.csv
s, create an.Rds
Second submission
For the second submission only, youâll additionally submit:
reflection.md
(and optionallyreflection.qmd
if you wantreflection.md
to embed R code)- This a metacognitive reflection describing your progress as a learner in this course, using the evidence provided by reviewing your midterm answers
- Aim for 1â2 paragraphs
- There is no set structure, but you may want to think about questions like:
- What do you understand better about coding itself (both specific R things and also more general coding strategies) now than you did at the start of the semester?
- What was the hardest thing for you about this midterm?
- What do you still need to improve on?
- How will what youâve learned from this midterm help you approach uncertain data or analysis tasks in your project?
Other submission notes
- You will not contribute your fork to the upstream remote, so you donât need to rename any files.
- You are encouraged to create R Notebooks in the process of figuring out your âfinalâ code, as long as your repo has the required files
- Because your midterm repo comes from our class GitHub organization repo, itâll be visible to other students; please do not look at your fellow studentsâ midterm repositories before you have completed your first submission.
- The midterm will be graded for effort, because I know enough about all of you by now that I can safely trust you all to work hard.
- Focus on the code first and foremost. Your narrative (especially defending your decisions) is a distant second in importance, and the presentation (i.e., Markdown formatting) least important
This is private data made available to us for the educational purpose of practicing data science techniques. Please do not share it beyond this class!
Last words of advice
- Assume that some googling might be necessary
- You will need one package we havenât practiced in class for a specific task.
- Part of the skillset youâll need to develop as a data scientist is finding the right tools
- Other than this one package, you probably wonât need to go outside the
tidyverse
(ggplot2
,dplyr
,tidyr
,readr
,purrr
,tibble
,stringr
,forcats
). However, you might find yourself using functions we havenât discussed within those packages.- Each package has a homepage at
<package-name>.tidyverse.org
(e.g., https://ggplot2.tidyverse.org/) - To see the full list of functions in a package, run (e.g.)
?purrr
in RStudio and click âindexâ at the bottom of the help pane - Donât forget the cheatsheets!
- Each package has a homepage at
- I highly recommend using a separate âscratchpadâ Quarto document to fiddle around with different ways of writing code! Itâs really useful to look at what youâve already tried and what that code yielded.
- Once youâre confident about a piece of code, âlock it inâ by adding it to your main
data-wrangling.qmd
document. - The Midterm repositoryâs
.gitignore
already includesscratchpad.qmd
andscratchpad.md
- Once youâre confident about a piece of code, âlock it inâ by adding it to your main
- Again, please ask me questions (via Slack DM) if you get really stuck (= you spend 10+ minutes on an individual issue)
- Youâve got this!!!