Data Science for Research in Linguistics (LING 2020), Fall 2025
University of Pittsburgh
Instructor: Dan Villarreal, PhD
Class time: Tues/Thurs 2:30–3:45pm
Class location: Cathedral of Learning 2818 (Linguistics Conference Room)
Office hours: Tues/Thurs 4–5pm (Dan’s office, CL 2806)
Textbook: See below
On this page
Overview and course objectives
Data science is a fast-growing and diverse discipline that combines programming, statistics and machine learning, scientific communication, and domain expertise. While most data scientists work in industry, a healthy understanding of data science principles and practices has started to become indispensable in many fields of linguistics research. As a result, we will approach data science in this class with an emphasis on the principles and practices that linguists need to know about how to approach data—how to get data into shape, how to extract a story from data, and how to communicate that story.
Students who successfully complete this course will be able to:
- Describe data science practices for data wrangling, analysis, and reporting
- Utilize modern software tools for data science as practiced in linguistics research
- Apply data science practices and software tools to a linguistics research question
Assignments
Your grade in this class will be based on the following assignments:
- To-dos (30%): For most class meetings, you will have some out-of-class tasks to prepare for the class meeting (or respond to topics we’ve just discussed). These tasks may include enagaging with readings, exploring concepts on your own, and discussing classmates’ projects. I anticipate about 18 to-dos over the course of the semester. To-dos are graded for effort.
- Class participation (20%): Every student is expected to be an active participant in our class meetings and shared online spaces, as we will form a community of learners in this class. Participation is graded for effort. I am generally pretty understanding when it comes to excusing absences for reasonable circumstances (family emergencies, urgent physical or mental health needs, etc.). If you need to miss class (or arrive late/leave early), please message me on Slack with as much advance notice as possible.
-
Midterm (10%): Every student should master a basic data-science toolkit that will be applicable to a wide array of potential research projects. The midterm will be take-home and open-everything. It will be a two-phase assignment, with the first submission due 12 noon Tuesday, October 21 (week 9) and your revised answers due 12 noon Tuesday, November 4 (week 11). The midterm will be graded for effort.
- Final project (40%): You will carry out a data science project that pertains to a linguistics research question of your choosing, using data available ‘in the wild’. (This is a necessary constraint.) Ideally, this project will be related to research you intend to do for a thesis/dissertation, comps paper, etc. Along the way, you will complete several project milestones to keep you on track, culminating in a class presentation in week 15 and a polished GitHub repository due in finals week. The project will be graded for proficiency in data wrangling, analysis, and presentation.
Between to-dos, project milestones, and midterm submissions, there will always be some sort of assignment between classes. These assignments will be due by 12 noon the day of class so I can look over your responses in preparation for class. Assignments are linked on the schedule.
This course will move fast and cover a lot of topics (including programming topics) in a short amount of time. As such, you will be expected to spend a large amount of time outside of class practicing on your own; good self-study habits are essential to successfully absorbing the course material.
However, while this course is demanding in breadth of topical familiarity, I do not expect you to necessarily achieve a breadth of topical mastery. This will be increasingly the case as the semester proceeds and we get into more-specific tools that aren’t applicable to your own research. Given the massive range of data science tools out there, learning to prioritize and hone your skills selectively is an essential competency you will need to develop.
Percentage grades will be converted to letter grades using the usual grading scale: A+ ≥ 97 > A ≥ 94 > A−, etc.
Requirements
- Prerequisite: LING 1810/2010 (Stats for Linguistics) or the equivalent.
- Hardware: The software applications below should install and run from your own personal laptop, which you are expected to bring to every class meeting.
- Your laptop should run one of these OSs: Mac OS X (10.6 or later), Windows (10 or 11), and Linux (any distribution).
- Mobile and cloud-based OSs are not supported—tablets and Chromebooks are not suitable platforms for this class.
- Make sure you’ve got ample space on your hard drive to download software and data. You should ideally have at least 30 Gb free and at least 25% of disk space after downloading the required software below
- Your laptop should run one of these OSs: Mac OS X (10.6 or later), Windows (10 or 11), and Linux (any distribution).
- Software: We will primarily be working in the statistical programming language R, in particular the tidyverse dialect, through the free RStudio IDE. Please download the following:
- Textbook: In the first part of the semester we will be leaning heavily on the free textbook R for data science, 2nd edition (original by Hadley Wickham, Mine Çetinkaya-Rundel, & Garrett Grolemund, forked and lightly edited by Dan). If you’d like a physical copy of the book, you can order it on Bookshop.org.
Please update your R to version 4.5.1 so you get the latest and greatest features!
If you’ve used RStudio with a previous version of R, you’ll also need to point RStudio to the latest version; go to Tools > Global Options and change the directory under “R Version”.
You don’t need to install any packages right now; the textbook will include instructions on which packages to install when. However, if you want to port over your installed packages from your old version of R, do the following:
- In your old version of R, run
dput(rownames(installed.packages()))
in the console- Copy the result to your clipboard
- Update R
- In R 4.5.1, run
install.packages(PASTE HERE)
.
Clever you, finding an HTML easter egg! You might have noticed the green bird on the cover of R for data science, which is also our course’s avatar. That’s a kākāpō, a critically endangered New Zealand parrot; as of the start of the semester, there are only 238 kākāpō left. Please consider making a donation to New Zealand’s Kākāpō Recovery—I’ll match any student donations.
Communication
Email is bad. All communication will take place through our class’s Slack workspace (signup link), which you can access via a browser, a desktop app, and/or a mobile app. (You’ll need to set up a free Slack account if you don’t already have one.) Our Slack workspace is your means for asking questions about course content or logistics, and it’s where I’ll post important info about the class. You are responsible for making sure you are able to receive and read Slack messages in a timely fashion, by doing at least one of the following:
- Downloading the Slack desktop or mobile app
- Setting up desktop notifications for the browser version of Slack
Questions about course content or policies should be posted to the #q-and-a channel so the entire class can benefit from your question. For sensitive information (e.g., family emergencies), DM me.
You can post questions to Slack at any time. (Please note that I generally don’t check Slack at night or on weekends.) I will respond to Slack questions ASAP during office hours (see above). Of course, you can also attend office hours in-person.
Course policies
The openness principle, your work, and privacy
Data science, as a young interdisciplinary field, has fully embraced the principle of open, transparent and collaborative mode of scientific inquiry. The overwhelming popularity of GitHub speaks volumes to this ethos. In this class, we too will adopt the principle of openness in most everything we do. However, I recognize that the course is foremost a learning platform and therefore there exist certain expectations of privacy in the work students submit. In order to balance the two considerations, the course will adopt the following tiered privacy policy:
- Fully public: Individual students’ projects will be completely open: projects will be developed and submitted via a GitHub public repository in our class GitHub organization, where they are open to the world to view. Some To-dos may also fall under this designation.
- Private: Our class Canvas site will be utilized for the aspects of course work that should remain strictly private between a student and the instructor. Those include posting of grades and feedback on student performance.
Course materials will largely follow the same format. Most materials will be available through our class GitHub organization, and lecture recordings will be available on Canvas.
Late work
Assignments (To-dos and project milestones) are due by 12 noon the day of class. The nature of this course is that much of our work in class meetings will build directly on these assignments, so punctuality is a must.
- Late To-dos will incur a 50% penalty if submitted up to 24 hours late, with no To-dos accepted after that. Remember that To-dos are graded for effort, so as long as you start early enough, you should have no problem getting full credit. Still, things happen, so I will drop your lowest To-do score (i.e., you have one free lateness or free miss).
- Late project milestones will incur a 20% penalty if submitted up to 24 hours late, with an additional 10% penalty for each late day thereafter.
Collaboration on assignments
If done properly, working together on assignments leads to better learning outcomes for all parties involved. Improper collaboration, however, negatively affects learning. In this class, group work is allowed (and even encouraged!)—provided that the following conditions are met:
- Equal contribution: One student’s contribution must not exceed 150% of another’s.
- Individual work before a study group: If you work with other students out of class, do not show up to the study group without having worked on the assignment on your own beforehand.
- Individual work after a study group: Do not write up your assignments while working in group, which leads to copying other’s answers. Always finish up your answers by yourself afterwards, using your own words.
- Do NOT pass files: Do not, under any circumstances, send or receive script files prior to submitting your work.
- Disclosure: You must disclose any classmates you worked with.
- Not on the midterm: Your initial answers to the midterm must be your own work.
See also the generative AI policy below.
Academic integrity
Students in this course will be expected to comply with the University of Pittsburgh’s Policy on Academic Integrity. Any student suspected of violating this obligation for any reason during the semester will be required to participate in the procedural process, initiated at the instructor level, as outlined in the University Guidelines on Academic Integrity. This may include, but is not limited to, the confiscation of the examination of any individual suspected of violating University Policy. Furthermore, no student may bring any unauthorized materials to an exam, including dictionaries and programmable calculators.
To learn more about Academic Integrity, visit the Academic Integrity Guide for an overview of the topic. For hands-on practice, complete the Academic Integrity Modules.
Generative AI
This class’s assignments have been designed to help you develop as a data scientist, including as a coder. But in the year 2025, what it means to code is changing. As a society, we are all learning about how generative AI coding tools (e.g., ChatGPT) can be used productively, effectively, and ethically to inform our coding process. As such, in this class, the use of generative AI is allowed within specific contexts and only if such use is properly acknowledged. Specifically:
- Generative AI is permitted only…
- for longer assignments (midterm, project), and
- if you include an “Acknowledgement of AI Use” statement that:
- specifies which technology was used (ChatGPT, GPT-3, etc.), and on what date
- include a permalink to the relevant chat (instructions for ChatGPT, Google Gemini), so instructors can understand how the information was generated
- explain how the output was used in your works
You do not have to provide a formal in-text citation for AI systems, and you only need to include an “Acknowledgement of AI Use” if you used generative AI.
The use of AI outside of contexts where the instructor specifies its use, or failure to acknowledge any use of AI technologies in your work, will be considered an academic integrity violation and addressed according to Pitt’s Academic Integrity Policy. You are the author of your work for the course and authorship means you take responsibility for your words and code, regardless of which tools you use.
This policy was informed by Pitt’s Writing Institute. The landscape around generative AI is complex and ever-changing, so please don’t hesitate to ask if you have any questions about this policy.
Disability resources
If you have (or think you may have) a disability for which you are or may be requesting an accommodation, you are encouraged to contact both your instructor and Disability Resources and Services (DRS), 140 William Pitt Union, (412) 648-7890, drsrecep@pitt.edu, (412) 228-5347 for P3 ASL users, as early as possible in the term. DRS will verify your disability and determine reasonable accommodations for this course. Many disabilities are invisible, so do not hesitate to contact DRS if you have questions.
Your well-being matters
College/Graduate school can be an exciting and challenging time for students. Taking time to maintain your well-being and seek appropriate support can help you achieve your goals and lead a fulfilling life. It can be helpful to remember that we all benefit from assistance and guidance at times, and there are many resources available to support your well-being while you are at Pitt. You are encouraged to visit Thrive@Pitt to learn more about well-being and the many campus resources available to help you thrive.
If you or anyone you know experiences overwhelming academic stress, persistent difficult feelings and/or challenging life events, you are strongly encouraged to seek support. In addition to reaching out to friends and loved ones, consider connecting with a faculty member you trust for assistance connecting to helpful resources.
The University Counseling Center is also here for you. You can call 412-648-7930 at any time to connect with a clinician. If you or someone you know is feeling suicidal, please call the University Counseling Center at any time at 412-648-7930. You can also contact Resolve Crisis Network at 888-796-8226. If the situation is life threatening, call Pitt Police at 412-624-2121 or dial 911.
Please take care of yourself and your fellow students.
The only constant is change.
This syllabus is subject to change, because data science tools are always changing and every set of students is a little different. I will be tweaking the course’s content and logistics to best serve our particular group. When (not if!) the syllabus changes, I’ll communicate as much as I can as soon as I can.
Class schedule
In the Reading column:
- Chapter numbers on their own refer to R for data science
- TMwR + chapter numbers refer to Text mining with R
Wk | Date | Unit | Topic | Reading | Due (before class @ 12n) To-do Project milestone |
---|---|---|---|---|---|
1 | Aug 26 | Intro | Intro, Git | ||
Aug 28 | Git, GitHub | Intro, 2 | #1 | ||
2 | Sep 2 | Data visualization with ggplot2 | 1, 6 | #2 | |
Sep 4 | Quarto | 28 | #3 | ||
3 | Sep 9 | Data wrangling | Data wrangling with dplyr | 3 | #4 |
Sep 11 | Tidy data with tidyr | 5 | #5 | ||
4 | Sep 16 | Relational data (the dplyr family of join functions) | 19 | ||
Sep 18 | Data import & export with readr | 7 | |||
5 | Sep 23 | Data structures | 23 | ||
Sep 25 | Iteration with purrr | 26 | |||
6 | Sep 30 | Strings with stringr | 14 | ||
Oct 2 | Regular expressions Midterm released | 15 | |||
7 | Oct 7 | Intermission | Decolonizing open methods | TBA | |
Oct 9 | Review project plans; auto-coding fairness | TBA | |||
8 | Oct 14 | Large language models | TBA | ||
Oct 16 | Midterm check-in | ||||
9 | Oct 21 | Text data | Tidy text format with tidytext | TMwR 1 | Midterm 1st submission |
Oct 23 | Word and document frequency, lemmatization | TMwR 3 | |||
10 | Oct 28 | Web scraping | HTML format, rvest basics | 24 | |
Oct 30 | Web-scraping use case | TBA | |||
11 | Nov 4 | Machine learning | ML concepts | TBA | Midterm 2nd submission |
Nov 6 | ML algorithms | TBA | |||
12 | Nov 11 | Flex time | TBD based on final projects | ||
Nov 13 | TBD based on final projects | ||||
13 | Nov 18 | TBD based on final projects | |||
Nov 20 | TBD based on final projects | ||||
14 | Nov 25 | No class (Thanksgiving break) | 🦃🦃🦃🦃🦃🦃🦃🦃🦃🦃🦃🦃🦃🦃🦃🦃🦃🦃 | ||
Nov 27 | 🦃🦃🦃🦃🦃🦃🦃🦃🦃🦃🦃🦃🦃🦃🦃🦃🦃🦃 | ||||
15 | Dec 2 | Project presentations | Presentations | ||
Dec 4 | Presentations |