University of Pittsburgh
Instructor: Dan Villarreal, PhD
Class time: Tues/Thurs 1–2:15pm
Class location: Cathedral of Learning 317. (Or remotely on Zoom if I get COVID.)
Office hours: Tues/Thurs 11am–12n. Dan’s office (CL 2806, hang a right if you’ve got your back to Shelome’s office), or remotely on Zoom (ping Dan on Slack first), or via Slack chat (see Communication). No appointment necessary.
Textbook: See below
Data science is a fast-growing and diverse discipline that combines programming, statistics and machine learning, scientific communication, and domain expertise. While most data scientists work in industry, a healthy understanding of data science principles and practices has started to become indispensable in many fields of linguistics research. As a result, we will approach data science in this class with an emphasis on the principles and practices that linguists need to know about how to approach data—how to get data into shape, how to extract a story from data, and how to communicate that story.
Students who successfully complete this course will be able to:
Your grade in this class will be based on the following assignments:
Between to-dos and project milestones, there will always be some sort of assignment between classes. These assignments will be due an hour before class (12pm). Assignments will be linked on the schedule.
This course will move fast and cover a lot of
topics (including programming topics) in a short amount of time. As
such, you will be expected to spend a large amount of time outside
of class practicing on your own; good self-study habits are essential to
successfully absorbing the course material.
However, while this
course is demanding in breadth of topical familiarity, I do not expect
you to necessarily achieve a breadth of topical mastery. This will be
increasingly the case as the semester proceeds and we get into
more-specific tools that aren’t applicable to your own research. Given
the massive range of data science tools out there, learning to
prioritize and hone your skills selectively is an essential competency
you will need to develop.
Percentage grades will be converted to letter grades using the usual grading scale: A+ ≥ 97 > A ≥ 94 > A−, etc.
install.packages("tidyverse")
into your R console Please update your R to version
4.2.1 so you get the latest and greatest features!
You can easily
update your packages by (1) prior to updating R, running
dput(rownames(installed.packages()))
in the console; (2) copying the
result to your clipboard; (3) updating R itself; (4) and running
install.packages(PASTE HERE)
.
If you’ve used RStudio with a
previous version of R, you’ll also need to point RStudio to the latest
version; go to Tools > Global Options and change the directory under
“R Version”.
Clever you, finding an HTML easter egg! You might have noticed the green bird on the cover of R for data science, which is also our course’s avatar. That’s a kākāpō, a critically endangered New Zealand parrot; as of the start of the semester, there are only 252 kākāpō left. Please consider making a donation to New Zealand’s Kākāpō Recovery—I’ll double-match any student donations (if you donate $10, I’ll donate $20).
Email is bad. All communication will take place through our class’s Slack workspace (signup link), which you can access via a browser, a desktop app, and/or a mobile app. (You’ll need to set up a free Slack account if you don’t already have one.) Our Slack workspace is your means for asking questions about course content or logistics, and it’s where I’ll post important info about the class. If you like, you can set Slack to notify you of new messages via email. I personally find the Slack mobile app quite useful.
Slack does have a DM feature, which you should use if you want to discuss sensitive information with me (e.g., family emergencies). However, for non-sensitive info, I’d much prefer that you post in the #q-and-a channel so the entire class can benefit from your question.
I’ll respond to Slack questions ASAP during office hours (Tuesday/Thursday from 11am–12n), which you can also attend in person (CL 2806, hang a right if you’ve got your back to Shelome’s office). However, questions can be posted to Slack at any time, regardless of whether it’s in the “office hours” timeslot; please note that I generally don’t check Slack on weekends.
During our class meetings, you are encouraged to use the Slack #in-class-chat channel to ask questions, respond to discussion prompts, make a pun to lighten the mood, and anything else you would normally use Zoom chat for. This is especially useful for students who are too shy to speak up or just prefer a written mode to engage with the class.
Data science, as a young interdisciplinary field, has fully embraced the principle of open, transparent and collaborative mode of scientific inquiry. The overwhelming popularity of GitHub speaks volumes to this ethos. In this class, we too will adopt the principle of openness in most everything we do. However, I recognize that the course is foremost a learning platform and therefore there exist certain expectations of privacy in the work students submit. In order to balance the two considerations, the course will adopt the following tiered privacy policy:
Course materials will largely follow the same format. Most materials will be available through our class GitHub organization, and lecture recordings will be available on Canvas.
Assignments (To-dos and project milestones) are due an hour before each class (12pm). The nature of this course is that much of our work in class meetings will build directly on these assignments, so punctuality is a must.
If done properly, working together on assignments leads to better learning outcomes for all parties involved. Improper collaboration, however, negatively affects learning. In this class, group work is allowed (and even encouraged!)—provided that the following conditions are met:
Students in this course will be expected to comply with the University of Pittsburgh’s Policy on Academic Integrity. Any student suspected of violating this obligation for any reason during the semester will be required to participate in the procedural process, initiated at the instructor level, as outlined in the University Guidelines on Academic Integrity. This may include, but is not limited to, the confiscation of the examination of any individual suspected of violating University Policy. Furthermore, no student may bring any unauthorized materials to an exam, including dictionaries and programmable calculators.
To learn more about Academic Integrity, visit the Academic Integrity Guide for an overview of the topic. For hands-on practice, complete the Understanding and Avoiding Plagiarism tutorial.
If you have a disability for which you are or may be requesting an accommodation, you are encouraged to contact both your instructor and Disability Resources and Services (DRS), 140 William Pitt Union, (412) 648-7890, drsrecep@pitt.edu, (412) 228-5347 for P3 ASL users, as early as possible in the term. DRS will verify your disability and determine reasonable accommodations for this course.
During this pandemic, it is extremely important that you abide by the public health regulations, the University of Pittsburgh’s health standards and guidelines, and Pitt’s Health Rules. These rules have been developed to protect the health and safety of all of us. The University’s requirements for face coverings will at a minimum be consistent with CDC guidance and masks are required indoors (campus buildings and shuttles) on campuses in which COVID-19 Community Levels are High. This means that when COVID-19 Community Levels are High, you must wear a face covering that properly covers your nose and mouth when you are in the classroom. If you do not comply, you will be asked to leave class. It is your responsibility to have the required face covering when entering a university building or classroom. Masks are optional indoors for campuses in which county levels are Medium or Low. Be aware of your Community Level as it changes each Thursday. Read answers to frequently asked questions regarding face coverings. For the most up-to-date information and guidance, please visit the Power of Pitt site and check your Pitt email for updates before each class.
If you are required to isolate or quarantine, become sick, or are unable to come to class, contact me ASAP to discuss arrangements. You don’t necessarily need to test positive to be covered under this policy (e.g., you’re testing negative but you have symptoms and a known exposure). If I get COVID, we’ll move to online instruction.
The University of Pittsburgh does not tolerate any form of discrimination, harassment, or retaliation based on disability, race, color, religion, national origin, ancestry, genetic information, marital status, familial status, sex, age, sexual orientation, veteran status or gender identity or other factors as stated in the University’s Title IX policy. The University is committed to taking prompt action to end a hostile environment that interferes with the University’s mission. For more information about policies, procedures, and practices, visit the Civil Rights & Title IX Compliance web page.
I ask that everyone in the class strive to help ensure that other members of this class can learn in a supportive and respectful environment. If there are instances of the aforementioned issues, please contact the Title IX Coordinator, by calling 412-648-7860, or e-mailing titleixcoordinator@pitt.edu. Reports can also be filed online. You may also choose to report this to a faculty/staff member; they are required to communicate this to the University’s Office of Diversity and Inclusion. If you wish to maintain complete confidentiality, you may also contact the University Counseling Center (412-648-7930).
College/Graduate school can be an exciting and challenging time for students. Taking time to maintain your well-being and seek appropriate support can help you achieve your goals and lead a fulfilling life. It can be helpful to remember that we all benefit from assistance and guidance at times, and there are many resources available to support your well-being while you are at Pitt. You are encouraged to visit Thrive@Pitt to learn more about well-being and the many campus resources available to help you thrive.
If you or anyone you know experiences overwhelming academic stress, persistent difficult feelings and/or challenging life events, you are strongly encouraged to seek support. In addition to reaching out to friends and loved ones, consider connecting with a faculty member you trust for assistance connecting to helpful resources.
The University Counseling Center is also here for you. You can call 412-648-7930 at any time to connect with a clinician. If you or someone you know is feeling suicidal, please call the University Counseling Center at any time at 412-648-7930. You can also contact Resolve Crisis Network at 888-796-8226. If the situation is life threatening, call Pitt Police at 412-624-2121 or dial 911.
Please take care of yourself and your fellow students.
This syllabus is subject to change, because this course is still pretty new and every set of students is a little different. I will be tweaking the course’s content and logistics to best serve our particular group. When (not if!) the syllabus changes, I’ll communicate as much as I can as soon as I can. Fortunately, the pandemic situation seems to have stabilized, so this semester should be less chaotic than we’re used to!
Week | Date |
Due (before class @ 12pm) To-doProject milestone |
Unit | Topic |
---|---|---|---|---|
1 | Aug 30 | Intro | Intro, RStudio, Git | |
Sep 1 | \#1 | Git, GitHub | ||
2 | Sep 6 | \#2 | GitHub | |
Sep 8 | \#3 | Markdown, R Markdown | ||
3 | Sep 13 | \#4 | Data wrangling | `ggplot2` & `dplyr` |
Sep 15 | \#5 | Tidy data | ||
4 | Sep 20 | \#6 | Relational data | |
Sep 22 | \#7 | Review; Intro to data structures | ||
5 | Sep 27 | \#8 | Iteration | |
Sep 29 | \#9 | Data import & export: Rectangular data | ||
6 | Oct 4 | Project ideas | Data import & export: Hierarchical data | |
Oct 6 | \#10 | Strings & regular expressions | ||
7 | Oct 11 | Midterm review; Discuss project plans | ||
Oct 13 | Project plan | Data ethics | Intro to data ethics; Discuss project plans | |
8 | Oct 18 | Midterm 1st submission | ASR bias | |
Oct 20 | \#11 | Language models (presenters: Gianina & Sen) | ||
9 | Oct 25 | \#12 | Accountability to communities (presenters: Soobin & Mack) | |
Oct 27 | Midterm 2nd submission | Text data | Tidy text format | |
10 | Nov 1 | Word and document frequency | ||
Nov 3 | Progress report 1 | Data sharing | Data Sharing for Linguists | |
11 | Nov 8 | \#13 | Follow-up Q&A with Lauren Collister | |
Nov 10 | \#14 | Web scraping | Web scraping | |
12 | Nov 15 | Progress report 2 | ||
Nov 17 | \#15 | |||
13 | Nov 22 | No class (Thanksgiving) | 🦃🦃🦃🦃🦃🦃🦃🦃🦃🦃🦃🦃🦃🦃🦃🦃🦃🦃 | |
Nov 24 | ||||
14 | Nov 29 | Flex time (likely project workdays) | Project workday | |
Dec 1 | Progress report 3 | |||
15 | Dec 6 | Project presentations | Soobin, Katherine | |
Dec 8 | Sen, Mack, Gianina | |||
Finals | Dec 15 | Projects due @ 2:15pm |