Homepage

Homepage

University of Pittsburgh

Instructor: Dan Villarreal, PhD

Class time: Tues/Thurs 1–2:15pm

Class location: First two weeks are virtual on Zoom; after that, you may attend class in-person in Posvar Hall 5405 or remotely on Zoom.

Office hours: W 3–5pm. You may attend office hours remotely on Zoom or via Slack chat (see Communication). No appointment necessary.

Overview and course objectives

Data science is a fast-growing and diverse discipline that combines programming, statistics and machine learning, scientific communication, and domain expertise. While most data scientists work in industry, a healthy understanding of data science principles and practices has started to become indispensable in many fields of linguistics research. As a result, we will approach data science in this class with an emphasis on the principles and practices that linguists need to know about how to approach data—how to get data into shape, how to extract a story from data, and how to communicate that story.

Students who successfully complete this course will be able to:

  1. Describe data science practices for data wrangling, analysis, and reporting
  2. Utilize modern software tools for data science as practiced in linguistics research
  3. Apply data science practices and software tools to a linguistics research question

Assignments

Your grade in this class will be based on the following assignments:

Between to-dos and project milestones, there will always be some sort of assignment between classes. These assignments will be due an hour before class (12pm). Assignments will be linked on the schedule.

This course will move fast and cover a lot of topics (including programming topics) in a short amount of time. As such, you will be expected to spend a large amount of time outside of class practicing on your own; good self-study habits are essential to successfully absorbing the course material.
However, while this course is demanding in breadth of topical familiarity, I do not expect you to necessarily achieve a breadth of topical mastery. Or more plainly: Don’t freak out if it feels like the class is moving too fast for you, because everyone will take different things out of it! Given the massive range of data science tools out there, learning to prioritize and hone your skills selectively is an essential competency you will need to develop.

Percentage grades will be converted to letter grades using the usual grading scale: A+ ≥ 97 > A ≥ 94 > A−, etc.

Requirements

Please update your R to version 4.1.1 so you get the latest and greatest features!
You can easily update your packages by (1) prior to updating R, running dput(rownames(installed.packages())) in the console; (2) copying the result to your clipboard; (3) updating R itself; (4) and running install.packages() with the list of packages pasted between ( and ).
If you’ve used RStudio with a previous version of R, you’ll also need to point RStudio to the latest version; go to Tools > Global Options and changing the directory under “R Version”.

If you’re reading this, you might have noticed the green bird on the cover of R for data science, which is also our course’s avatar. That’s a kākāpō, a critically endangered New Zealand parrot; as of the start of the semester, there are only 201 kākāpō left. Please consider making a donation to New Zealand’s Kākāpō Recovery.

Communication

Email is bad. All communication will take place through our class’s Slack workspace, which you can access via a browser, a desktop app, and/or a mobile app. (You’ll need to set up a free Slack account if you don’t already have one.) Our Slack workspace is your means for asking questions about course content or logistics, and it’s where I’ll post important info about the class. If you like, you can set Slack to notify you of new messages via email. I personally find the Slack mobile app quite useful.

Slack does have a DM feature, which you should use if you want to discuss sensitive information with me (e.g., family emergencies). However, for non-sensitive info, I’d much prefer that you post in the #q-and-a channel so the entire class can benefit from your question.

I’ll respond to Slack questions ASAP during office hours (every Wednesday from 3–5pm), which you can also attend via Zoom. However, questions can be posted to Slack at any time, regardless of whether it’s in the “office hours” timeslot.

During our class meetings, you are encouraged to use the Slack #in-class-chat channel to ask questions, respond to discussion prompts, and anything else you would normally use Zoom chat for. This is especially useful for students who are too shy to speak up or just prefer a written mode to engage with the class.

Course policies

The openness principle, your work, and privacy

Data science, as a young interdisciplinary field, has fully embraced the principle of open, transparent and collaborative mode of scientific inquiry. The overwhelming popularity of GitHub speaks volumes to this ethos. In this class, we too will adopt the principle of openness in most everything we do. However, I recognize that the course is foremost a learning platform and therefore there exist certain expectations of privacy in the work students submit. In order to balance the two considerations, the course will adopt the following tiered privacy policy:

  1. Fully public: Individual students’ projects will be completely open: projects will be developed and submitted via a GitHub public repository, where they are open to the world to view. Some To-dos may also fall under this designation.
  2. Private to the world, shared within class: Submission of most To-dos as well as distribution of data sets with limited access will be done via a private GitHub repository. Students’ work therefore will be fully visible to their classmates but won’t be shared with the world. This promotes collaborative learning among students while ensuring a reasonable level of privacy for their work.
  3. Private: Our class Canvas site will be utilized for the aspects of course work that should remain strictly private between a student and the instructor. Those include posting of grades and feedback on student performance.

Course materials will largely follow the same format. Most materials will be available through our class GitHub organization, and lecture recordings will be available on Canvas.

Late work

Assignments (To-dos and project milestones) are due an hour before each class (12pm). The nature of this course is that much of our work in class meetings will build directly on these assignments, so punctuality is a must.

Collaboration on Assignments

If done properly, working together on assignments leads to better learning outcomes for all parties involved. Improper collaboration, however, negatively affects learning. In this class, group work is allowed (and even encouraged!)—provided that the following conditions are met:

  1. Equal contribution: One student’s contribution must not exceed 150% of another’s.
  2. Individual work before a study group: If you work with other students out of class, do not show up to the study group without having worked on the assignment on your own beforehand.
  3. Individual work after a study group: Do not write up your assignments while working in group, which leads to copying other’s answers. Always finish up your answers by yourself afterwards, using your own words.
  4. Do NOT pass files: Do not, under any circumstances, send or receive script files prior to submitting your work.
  5. Disclosure: You must disclose any classmates you worked with.

Academic integrity

Students in this course will be expected to comply with the University of Pittsburgh’s Policy on Academic Integrity. Any student suspected of violating this obligation for any reason during the semester will be required to participate in the procedural process, initiated at the instructor level, as outlined in the University Guidelines on Academic Integrity. This may include, but is not limited to, the confiscation of the examination of any individual suspected of violating University Policy. Furthermore, no student may bring any unauthorized materials to an exam, including dictionaries and programmable calculators. To learn more about Academic Integrity, visit the Academic Integrity Guide for an overview of the topic. For hands-on practice, complete the Understanding and Avoiding Plagiarism tutorial.

Disability resources

If you have a disability for which you are or may be requesting an accommodation, you are encouraged to contact both your instructor and Disability Resources and Services (DRS), 140 William Pitt Union, (412) 648- 7890, drsrecep@pitt.edu, (412) 228-5347 for P3 ASL users, as early as possible in the term. DRS will verify your disability and determine reasonable accommodations for this course.

Coronavirus health and safety

During this pandemic, it is extremely important that you abide by the public health regulations, the University of Pittsburgh’s health standards and guidelines, and Pitt’s Health Rules. These rules have been developed to protect the health and safety of all of us. Universal face covering is required in all classrooms and in every building on campus, without exceptions, regardless of vaccination status. This means you must wear a face covering that properly covers your nose and mouth when you are in the classroom. If you do not comply, you will be asked to leave class. It is your responsibility have the required face covering when entering a university building or classroom. For the most up-to-date information and guidance, please visit coronavirus.pitt.edu and check your Pitt email for updates before each class.

If you are required to isolate or quarantine, become sick, or are unable to come to class, contact me ASAP to discuss arrangements.

Diversity and inclusion

The University of Pittsburgh does not tolerate any form of discrimination, harassment, or retaliation based on disability, race, color, religion, national origin, ancestry, genetic information, marital status, familial status, sex, age, sexual orientation, veteran status or gender identity or other factors as stated in the University’s Title IX policy. The University is committed to taking prompt action to end a hostile environment that interferes with the University’s mission. For more information about policies, procedures, and practices, see http://diversity.pitt.edu/affirmativeaction/policies-procedures-and-practices.

I ask that everyone in the class strive to help ensure that other members of this class can learn in a supportive and respectful environment. If there are instances of the aforementioned issues, please contact the Title IX Coordinator, by calling 412-648-7860, or e-mailing titleixcoordinator@pitt.edu. Reports can also be filed online. You may also choose to report this to a faculty/staff member; they are required to communicate this to the University’s Office of Diversity and Inclusion. If you wish to maintain complete confidentiality, you may also contact the University Counseling Center (412-648-7930).

Your well-being matters

College and grad school can be an exciting and challenging time for students. Taking time to care for yourself and seeking appropriate support can help you achieve your academic and professional goals.

It can be helpful to remember that we all benefit from assistance and guidance at times, and there are many resources available to support your well-being while you are at Pitt. If you or anyone you know experiences overwhelming academic stress, persistent difficult feelings and/or challenging life events, you are strongly encouraged to seek support. In addition to reaching out to friends and loved ones, consider connecting with a faculty member you trust for assistance connecting to helpful resources. The University Counseling Center is also here for you. You can call 412-648-7930 at any time to connect with a clinician.

If you or someone you know is feeling suicidal, please call the University Counseling Center at any time at 412-648-7930. You can also contact Resolve Crisis Network at 888-796-8226. If the situation is life threatening, call Pitt Police at 412-624-2121 or dial 911.

Please take care of yourself and your fellow students.

The only constant is change.

This syllabus is subject to change, because—well, because these are some mighty uncertain times. Regardless of how things are going out in the wider world, every group of students is a little different, so I’ll be tweaking the course’s content and logistics to best serve our particular group. If—when—the syllabus changes, I’ll communicate as much as I can as soon as I can, keeping in mind that the pandemic situation is constantly evolving. In the meantime, let’s all practice patience and understanding, and we’ll all get through this in one piece.

Class schedule

Week Date Due (before class @ 12pm)
To-doProject milestone
Unit Topic
1 Aug 31 Intro Intro, Git
Sep 2 #1 GitHub, R for DS
2 Sep 7 #2 Data wrangling Data visualization
Sep 9 #3 Data transformation
3 Sep 14 #4 Review: Data visualization & transformation
Sep 16 #5 Tidy data
4 Sep 21 #6 Data import/Relational data
Sep 23 #7 Strings, regular expressions
5 Sep 28 #8 Git conflicts
Sep 30 More regex
6 Oct 5 Project ideas Programming for data science
Oct 7 #9 Data publishing Open access, data publishing (guest speaker Lauren Collister)
7 Oct 12 Project plan Speech data Praat scripting (self-study)
Oct 14 #10 Praat & FastTrack
8 Oct 19 #11 ASR, forced alignment
Oct 21 #12 Text data Web scraping
9 Oct 26
Oct 28 Progress report 1 Text analysis
10 Nov 2 #13 Modeling Mixed-effects regression
Nov 4 Model comparison
11 Nov 9 Machine learning Classification
Nov 11 Classification methods
12 Nov 16 Progress report 2 More ML
Nov 18
13 Nov 23 No class (Thanksgiving) 🦃🦃🦃🦃🦃🦃🦃🦃🦃🦃🦃🦃🦃🦃🦃🦃🦃🦃
Nov 25
14 Nov 30 Flex time Project workday
Dec 2 Progress report 3 Project half-workday; ASR-assisted transcription
15 Dec 7 Project presentations Presentations: Rossina, Shaohua, Miroo
Dec 9 Presentations: Yan, Angela, Joe, (Katherine if time)
Finals Dec 15 Projects due @ 2pm