Homepage

Homepage

University of Pittsburgh

Instructor: Dan Villarreal, PhD

Class time: Tues/Thurs 1–2:15pm

Class location: Cathedral of Learning 317. (Or remotely on Zoom if I get COVID.)

Office hours: Tues/Thurs 11am–12n. Dan’s office (CL 2806, hang a right if you’ve got your back to Shelome’s office), or remotely on Zoom (ping Dan on Slack first), or via Slack chat (see Communication). No appointment necessary.

Textbook: See below

Overview and course objectives

Data science is a fast-growing and diverse discipline that combines programming, statistics and machine learning, scientific communication, and domain expertise. While most data scientists work in industry, a healthy understanding of data science principles and practices has started to become indispensable in many fields of linguistics research. As a result, we will approach data science in this class with an emphasis on the principles and practices that linguists need to know about how to approach data—how to get data into shape, how to extract a story from data, and how to communicate that story.

Students who successfully complete this course will be able to:

  1. Describe data science practices for data wrangling, analysis, and reporting
  2. Utilize modern software tools for data science as practiced in linguistics research
  3. Apply data science practices and software tools to a linguistics research question

Assignments

Your grade in this class will be based on the following assignments:

Between to-dos and project milestones, there will always be some sort of assignment between classes. These assignments will be due an hour before class (12pm). Assignments will be linked on the schedule.

This course will move fast and cover a lot of topics (including programming topics) in a short amount of time. As such, you will be expected to spend a large amount of time outside of class practicing on your own; good self-study habits are essential to successfully absorbing the course material.
However, while this course is demanding in breadth of topical familiarity, I do not expect you to necessarily achieve a breadth of topical mastery. This will be increasingly the case as the semester proceeds and we get into more-specific tools that aren’t applicable to your own research. Given the massive range of data science tools out there, learning to prioritize and hone your skills selectively is an essential competency you will need to develop.

Percentage grades will be converted to letter grades using the usual grading scale: A+ ≥ 97 > A ≥ 94 > A−, etc.

Requirements

Please update your R to version 4.2.1 so you get the latest and greatest features!
You can easily update your packages by (1) prior to updating R, running dput(rownames(installed.packages())) in the console; (2) copying the result to your clipboard; (3) updating R itself; (4) and running install.packages(PASTE HERE).
If you’ve used RStudio with a previous version of R, you’ll also need to point RStudio to the latest version; go to Tools > Global Options and change the directory under “R Version”.

Clever you, finding an HTML easter egg! You might have noticed the green bird on the cover of R for data science, which is also our course’s avatar. That’s a kākāpō, a critically endangered New Zealand parrot; as of the start of the semester, there are only 252 kākāpō left. Please consider making a donation to New Zealand’s Kākāpō Recovery—I’ll double-match any student donations (if you donate $10, I’ll donate $20).

Communication

Email is bad. All communication will take place through our class’s Slack workspace (signup link), which you can access via a browser, a desktop app, and/or a mobile app. (You’ll need to set up a free Slack account if you don’t already have one.) Our Slack workspace is your means for asking questions about course content or logistics, and it’s where I’ll post important info about the class. If you like, you can set Slack to notify you of new messages via email. I personally find the Slack mobile app quite useful.

Slack does have a DM feature, which you should use if you want to discuss sensitive information with me (e.g., family emergencies). However, for non-sensitive info, I’d much prefer that you post in the #q-and-a channel so the entire class can benefit from your question.

I’ll respond to Slack questions ASAP during office hours (Tuesday/Thursday from 11am–12n), which you can also attend in person (CL 2806, hang a right if you’ve got your back to Shelome’s office). However, questions can be posted to Slack at any time, regardless of whether it’s in the “office hours” timeslot; please note that I generally don’t check Slack on weekends.

During our class meetings, you are encouraged to use the Slack #in-class-chat channel to ask questions, respond to discussion prompts, make a pun to lighten the mood, and anything else you would normally use Zoom chat for. This is especially useful for students who are too shy to speak up or just prefer a written mode to engage with the class.

Course policies

The openness principle, your work, and privacy

Data science, as a young interdisciplinary field, has fully embraced the principle of open, transparent and collaborative mode of scientific inquiry. The overwhelming popularity of GitHub speaks volumes to this ethos. In this class, we too will adopt the principle of openness in most everything we do. However, I recognize that the course is foremost a learning platform and therefore there exist certain expectations of privacy in the work students submit. In order to balance the two considerations, the course will adopt the following tiered privacy policy:

  1. Fully public: Individual students’ projects will be completely open: projects will be developed and submitted via a GitHub public repository in our class GitHub organization, where they are open to the world to view. Some To-dos may also fall under this designation.
  2. Private to the world, shared within class: Submission of most To-dos as well as distribution of data sets with limited access will be done via a private GitHub repository. Students’ work therefore will be fully visible to their classmates but won’t be shared with the world. This promotes collaborative learning among students while ensuring a reasonable level of privacy for their work.
  3. Private: Our class Canvas site will be utilized for the aspects of course work that should remain strictly private between a student and the instructor. Those include posting of grades and feedback on student performance.

Course materials will largely follow the same format. Most materials will be available through our class GitHub organization, and lecture recordings will be available on Canvas.

Late work

Assignments (To-dos and project milestones) are due an hour before each class (12pm). The nature of this course is that much of our work in class meetings will build directly on these assignments, so punctuality is a must.

Collaboration on Assignments

If done properly, working together on assignments leads to better learning outcomes for all parties involved. Improper collaboration, however, negatively affects learning. In this class, group work is allowed (and even encouraged!)—provided that the following conditions are met:

  1. Equal contribution: One student’s contribution must not exceed 150% of another’s.
  2. Individual work before a study group: If you work with other students out of class, do not show up to the study group without having worked on the assignment on your own beforehand.
  3. Individual work after a study group: Do not write up your assignments while working in group, which leads to copying other’s answers. Always finish up your answers by yourself afterwards, using your own words.
  4. Do NOT pass files: Do not, under any circumstances, send or receive script files prior to submitting your work.
  5. Disclosure: You must disclose any classmates you worked with.
  6. Not on the midterm: Your initial answers to the midterm must be your own work.

Academic integrity

Students in this course will be expected to comply with the University of Pittsburgh’s Policy on Academic Integrity. Any student suspected of violating this obligation for any reason during the semester will be required to participate in the procedural process, initiated at the instructor level, as outlined in the University Guidelines on Academic Integrity. This may include, but is not limited to, the confiscation of the examination of any individual suspected of violating University Policy. Furthermore, no student may bring any unauthorized materials to an exam, including dictionaries and programmable calculators.

To learn more about Academic Integrity, visit the Academic Integrity Guide for an overview of the topic. For hands-on practice, complete the Understanding and Avoiding Plagiarism tutorial.

Disability resources

If you have a disability for which you are or may be requesting an accommodation, you are encouraged to contact both your instructor and Disability Resources and Services (DRS), 140 William Pitt Union, (412) 648-7890, drsrecep@pitt.edu, (412) 228-5347 for P3 ASL users, as early as possible in the term. DRS will verify your disability and determine reasonable accommodations for this course.

Coronavirus health and safety

During this pandemic, it is extremely important that you abide by the public health regulations, the University of Pittsburgh’s health standards and guidelines, and Pitt’s Health Rules. These rules have been developed to protect the health and safety of all of us. The University’s requirements for face coverings will at a minimum be consistent with CDC guidance and masks are required indoors (campus buildings and shuttles) on campuses in which COVID-19 Community Levels are High. This means that when COVID-19 Community Levels are High, you must wear a face covering that properly covers your nose and mouth when you are in the classroom. If you do not comply, you will be asked to leave class. It is your responsibility to have the required face covering when entering a university building or classroom. Masks are optional indoors for campuses in which county levels are Medium or Low. Be aware of your Community Level as it changes each Thursday. Read answers to frequently asked questions regarding face coverings. For the most up-to-date information and guidance, please visit the Power of Pitt site and check your Pitt email for updates before each class.

If you are required to isolate or quarantine, become sick, or are unable to come to class, contact me ASAP to discuss arrangements. You don’t necessarily need to test positive to be covered under this policy (e.g., you’re testing negative but you have symptoms and a known exposure). If I get COVID, we’ll move to online instruction.

Diversity and inclusion

The University of Pittsburgh does not tolerate any form of discrimination, harassment, or retaliation based on disability, race, color, religion, national origin, ancestry, genetic information, marital status, familial status, sex, age, sexual orientation, veteran status or gender identity or other factors as stated in the University’s Title IX policy. The University is committed to taking prompt action to end a hostile environment that interferes with the University’s mission. For more information about policies, procedures, and practices, visit the Civil Rights & Title IX Compliance web page.

I ask that everyone in the class strive to help ensure that other members of this class can learn in a supportive and respectful environment. If there are instances of the aforementioned issues, please contact the Title IX Coordinator, by calling 412-648-7860, or e-mailing titleixcoordinator@pitt.edu. Reports can also be filed online. You may also choose to report this to a faculty/staff member; they are required to communicate this to the University’s Office of Diversity and Inclusion. If you wish to maintain complete confidentiality, you may also contact the University Counseling Center (412-648-7930).

Your well-being matters

College/Graduate school can be an exciting and challenging time for students. Taking time to maintain your well-being and seek appropriate support can help you achieve your goals and lead a fulfilling life. It can be helpful to remember that we all benefit from assistance and guidance at times, and there are many resources available to support your well-being while you are at Pitt. You are encouraged to visit Thrive@Pitt to learn more about well-being and the many campus resources available to help you thrive.

If you or anyone you know experiences overwhelming academic stress, persistent difficult feelings and/or challenging life events, you are strongly encouraged to seek support. In addition to reaching out to friends and loved ones, consider connecting with a faculty member you trust for assistance connecting to helpful resources.

The University Counseling Center is also here for you. You can call 412-648-7930 at any time to connect with a clinician. If you or someone you know is feeling suicidal, please call the University Counseling Center at any time at 412-648-7930. You can also contact Resolve Crisis Network at 888-796-8226. If the situation is life threatening, call Pitt Police at 412-624-2121 or dial 911.

Please take care of yourself and your fellow students.

We’re building the plane as we’re flying it.

This syllabus is subject to change, because this course is still pretty new and every set of students is a little different. I will be tweaking the course’s content and logistics to best serve our particular group. When (not if!) the syllabus changes, I’ll communicate as much as I can as soon as I can. Fortunately, the pandemic situation seems to have stabilized, so this semester should be less chaotic than we’re used to!

Class schedule

Week Date Due (before class @ 12pm)
To-doProject milestone
Unit Topic
1 Aug 30 Intro Intro, RStudio, Git
Sep 1 \#1 Git, GitHub
2 Sep 6 \#2 GitHub
Sep 8 \#3 Markdown, R Markdown
3 Sep 13 \#4 Data wrangling `ggplot2` & `dplyr`
Sep 15 \#5 Tidy data
4 Sep 20 \#6 Relational data
Sep 22 \#7 Review; Intro to data structures
5 Sep 27 \#8 Iteration
Sep 29 \#9 Data import & export: Rectangular data
6 Oct 4 Project ideas Data import & export: Hierarchical data
Oct 6 \#10 Strings & regular expressions
7 Oct 11 Midterm review; Discuss project plans
Oct 13 Project plan Data ethics Intro to data ethics; Discuss project plans
8 Oct 18 Midterm 1st submission ASR bias
Oct 20 \#11 Language models (presenters: Gianina & Sen)
9 Oct 25 \#12 Accountability to communities (presenters: Soobin & Mack)
Oct 27 Midterm 2nd submission Text data Tidy text format
10 Nov 1 Word and document frequency
Nov 3 Progress report 1 Data sharing Data Sharing for Linguists
11 Nov 8 \#13 Follow-up Q&A with Lauren Collister
Nov 10 \#14 Web scraping Web scraping
12 Nov 15 Progress report 2
Nov 17 \#15
13 Nov 22 No class (Thanksgiving) 🦃🦃🦃🦃🦃🦃🦃🦃🦃🦃🦃🦃🦃🦃🦃🦃🦃🦃
Nov 24
14 Nov 29 Flex time (likely project workdays) Project workday
Dec 1 Progress report 3
15 Dec 6 Project presentations Soobin, Katherine
Dec 8 Sen, Mack, Gianina
Finals Dec 15 Projects due @ 2:15pm