University of Pittsburgh
Instructor: Dan Villarreal, PhD
Class time: Tues/Thurs 1–2:15pm
Class location: First two weeks are virtual on Zoom; after that, you may attend class in-person in Posvar Hall 5405 or remotely on Zoom.
Office hours: W 3–5pm. You may attend office hours remotely on Zoom or via Slack chat (see Communication). No appointment necessary.
Data science is a fast-growing and diverse discipline that combines programming, statistics and machine learning, scientific communication, and domain expertise. While most data scientists work in industry, a healthy understanding of data science principles and practices has started to become indispensable in many fields of linguistics research. As a result, we will approach data science in this class with an emphasis on the principles and practices that linguists need to know about how to approach data—how to get data into shape, how to extract a story from data, and how to communicate that story.
Students who successfully complete this course will be able to:
Your grade in this class will be based on the following assignments:
Between to-dos and project milestones, there will always be some sort of assignment between classes. These assignments will be due an hour before class (12pm). Assignments will be linked on the schedule.
This course will move fast and cover a lot of
topics (including programming topics) in a short amount of time. As
such, you will be expected to spend a large amount of time outside
of class practicing on your own; good self-study habits are essential to
successfully absorbing the course material.
However, while this
course is demanding in breadth of topical familiarity, I do not expect
you to necessarily achieve a breadth of topical mastery. Or more
plainly: Don’t freak out if it feels like the class is moving too fast
for you, because everyone will take different things out of it! Given
the massive range of data science tools out there, learning to
prioritize and hone your skills selectively is an essential competency
you will need to develop.
Percentage grades will be converted to letter grades using the usual grading scale: A+ ≥ 97 > A ≥ 94 > A−, etc.
install.packages("tidyverse")
into your R
console) Please update your R to version
4.1.1 so you get the latest and greatest features!
You can easily
update your packages by (1) prior to updating R, running
dput(rownames(installed.packages()))
in the console; (2) copying the
result to your clipboard; (3) updating R itself; (4) and running
install.packages()
with the list of packages pasted between (
and
)
.
If you’ve used RStudio with a previous version of R, you’ll also
need to point RStudio to the latest version; go to Tools > Global
Options and changing the directory under “R Version”.
If you’re reading this, you might have noticed the green bird on the cover of R for data science, which is also our course’s avatar. That’s a kākāpō, a critically endangered New Zealand parrot; as of the start of the semester, there are only 201 kākāpō left. Please consider making a donation to New Zealand’s Kākāpō Recovery.
Email is bad. All communication will take place through our class’s Slack workspace, which you can access via a browser, a desktop app, and/or a mobile app. (You’ll need to set up a free Slack account if you don’t already have one.) Our Slack workspace is your means for asking questions about course content or logistics, and it’s where I’ll post important info about the class. If you like, you can set Slack to notify you of new messages via email. I personally find the Slack mobile app quite useful.
Slack does have a DM feature, which you should use if you want to discuss sensitive information with me (e.g., family emergencies). However, for non-sensitive info, I’d much prefer that you post in the #q-and-a channel so the entire class can benefit from your question.
I’ll respond to Slack questions ASAP during office hours (every Wednesday from 3–5pm), which you can also attend via Zoom. However, questions can be posted to Slack at any time, regardless of whether it’s in the “office hours” timeslot.
During our class meetings, you are encouraged to use the Slack #in-class-chat channel to ask questions, respond to discussion prompts, and anything else you would normally use Zoom chat for. This is especially useful for students who are too shy to speak up or just prefer a written mode to engage with the class.
Data science, as a young interdisciplinary field, has fully embraced the principle of open, transparent and collaborative mode of scientific inquiry. The overwhelming popularity of GitHub speaks volumes to this ethos. In this class, we too will adopt the principle of openness in most everything we do. However, I recognize that the course is foremost a learning platform and therefore there exist certain expectations of privacy in the work students submit. In order to balance the two considerations, the course will adopt the following tiered privacy policy:
Course materials will largely follow the same format. Most materials will be available through our class GitHub organization, and lecture recordings will be available on Canvas.
Assignments (To-dos and project milestones) are due an hour before each class (12pm). The nature of this course is that much of our work in class meetings will build directly on these assignments, so punctuality is a must.
If done properly, working together on assignments leads to better learning outcomes for all parties involved. Improper collaboration, however, negatively affects learning. In this class, group work is allowed (and even encouraged!)—provided that the following conditions are met:
Students in this course will be expected to comply with the University of Pittsburgh’s Policy on Academic Integrity. Any student suspected of violating this obligation for any reason during the semester will be required to participate in the procedural process, initiated at the instructor level, as outlined in the University Guidelines on Academic Integrity. This may include, but is not limited to, the confiscation of the examination of any individual suspected of violating University Policy. Furthermore, no student may bring any unauthorized materials to an exam, including dictionaries and programmable calculators. To learn more about Academic Integrity, visit the Academic Integrity Guide for an overview of the topic. For hands-on practice, complete the Understanding and Avoiding Plagiarism tutorial.
If you have a disability for which you are or may be requesting an accommodation, you are encouraged to contact both your instructor and Disability Resources and Services (DRS), 140 William Pitt Union, (412) 648- 7890, drsrecep@pitt.edu, (412) 228-5347 for P3 ASL users, as early as possible in the term. DRS will verify your disability and determine reasonable accommodations for this course.
During this pandemic, it is extremely important that you abide by the public health regulations, the University of Pittsburgh’s health standards and guidelines, and Pitt’s Health Rules. These rules have been developed to protect the health and safety of all of us. Universal face covering is required in all classrooms and in every building on campus, without exceptions, regardless of vaccination status. This means you must wear a face covering that properly covers your nose and mouth when you are in the classroom. If you do not comply, you will be asked to leave class. It is your responsibility have the required face covering when entering a university building or classroom. For the most up-to-date information and guidance, please visit coronavirus.pitt.edu and check your Pitt email for updates before each class.
If you are required to isolate or quarantine, become sick, or are unable to come to class, contact me ASAP to discuss arrangements.
The University of Pittsburgh does not tolerate any form of discrimination, harassment, or retaliation based on disability, race, color, religion, national origin, ancestry, genetic information, marital status, familial status, sex, age, sexual orientation, veteran status or gender identity or other factors as stated in the University’s Title IX policy. The University is committed to taking prompt action to end a hostile environment that interferes with the University’s mission. For more information about policies, procedures, and practices, see http://diversity.pitt.edu/affirmativeaction/policies-procedures-and-practices.
I ask that everyone in the class strive to help ensure that other members of this class can learn in a supportive and respectful environment. If there are instances of the aforementioned issues, please contact the Title IX Coordinator, by calling 412-648-7860, or e-mailing titleixcoordinator@pitt.edu. Reports can also be filed online. You may also choose to report this to a faculty/staff member; they are required to communicate this to the University’s Office of Diversity and Inclusion. If you wish to maintain complete confidentiality, you may also contact the University Counseling Center (412-648-7930).
College and grad school can be an exciting and challenging time for students. Taking time to care for yourself and seeking appropriate support can help you achieve your academic and professional goals.
It can be helpful to remember that we all benefit from assistance and guidance at times, and there are many resources available to support your well-being while you are at Pitt. If you or anyone you know experiences overwhelming academic stress, persistent difficult feelings and/or challenging life events, you are strongly encouraged to seek support. In addition to reaching out to friends and loved ones, consider connecting with a faculty member you trust for assistance connecting to helpful resources. The University Counseling Center is also here for you. You can call 412-648-7930 at any time to connect with a clinician.
If you or someone you know is feeling suicidal, please call the University Counseling Center at any time at 412-648-7930. You can also contact Resolve Crisis Network at 888-796-8226. If the situation is life threatening, call Pitt Police at 412-624-2121 or dial 911.
Please take care of yourself and your fellow students.
This syllabus is subject to change, because—well, because these are some mighty uncertain times. Regardless of how things are going out in the wider world, every group of students is a little different, so I’ll be tweaking the course’s content and logistics to best serve our particular group. If—when—the syllabus changes, I’ll communicate as much as I can as soon as I can, keeping in mind that the pandemic situation is constantly evolving. In the meantime, let’s all practice patience and understanding, and we’ll all get through this in one piece.
Week | Date |
Due (before class @ 12pm) To-doProject milestone |
Unit | Topic |
---|---|---|---|---|
1 | Aug 31 | Intro | Intro, Git | |
Sep 2 | #1 | GitHub, R for DS | ||
2 | Sep 7 | #2 | Data wrangling | Data visualization |
Sep 9 | #3 | Data transformation | ||
3 | Sep 14 | #4 | Review: Data visualization & transformation | |
Sep 16 | #5 | Tidy data | ||
4 | Sep 21 | #6 | Data import/Relational data | |
Sep 23 | #7 | Strings, regular expressions | ||
5 | Sep 28 | #8 | Git conflicts | |
Sep 30 | More regex | |||
6 | Oct 5 | Project ideas | Programming for data science | |
Oct 7 | #9 | Data publishing | Open access, data publishing (guest speaker Lauren Collister) | |
7 | Oct 12 | Project plan | Speech data | Praat scripting (self-study) |
Oct 14 | #10 | Praat & FastTrack | ||
8 | Oct 19 | #11 | ASR, forced alignment | |
Oct 21 | #12 | Text data | Web scraping | |
9 | Oct 26 | |||
Oct 28 | Progress report 1 | Text analysis | ||
10 | Nov 2 | #13 | Modeling | Mixed-effects regression |
Nov 4 | Model comparison | |||
11 | Nov 9 | Machine learning | Classification | |
Nov 11 | Classification methods | |||
12 | Nov 16 | Progress report 2 | More ML | |
Nov 18 | ||||
13 | Nov 23 | No class (Thanksgiving) | 🦃🦃🦃🦃🦃🦃🦃🦃🦃🦃🦃🦃🦃🦃🦃🦃🦃🦃 | |
Nov 25 | ||||
14 | Nov 30 | Flex time | Project workday | |
Dec 2 | Progress report 3 | Project half-workday; ASR-assisted transcription | ||
15 | Dec 7 | Project presentations | Presentations: Rossina, Shaohua, Miroo | |
Dec 9 | Presentations: Yan, Angela, Joe, (Katherine if time) | |||
Finals | Dec 15 | Projects due @ 2pm |