Syllabus - Core Principles of Data Science / Fall 2024

Course Description

Modern technology has led to the generation of unprecedented amounts of data, prompting the need to train researchers to leverage data for decision-making in public health and medicine. This course assumes no prior knowledge and serves as a gentle, practical introduction to data wrangling, visualizing, and modeling data using the R statistical programming language. We also emphasize the importance of reproducible research and effective data science communication.

Learning Objectives

Upon successful completion of this course, you should be able to:

Write reproducible code using the statistical programming language R
Clean and wrangle data for downstream analysis
Perform exploratory data analysis, including visualizations
Apply machine learning models for regression and classification
Interpret and communicate key results

What is the structure of this course?

This course will be held synchronously and in-person. We encourage students to attend class and participate, but attendance is not required. All lectures and lab sessions will be recorded and available on the course Canvas site.

Grades will be based on:

4 homework assignments (40%)
1 take-home midterm (25%)
1 final project (35%)

Homeworks 🔗

All homeworks will involve writing code and communicating results. Students must submit both the RMarkdown file (.Rmd file) AND the knitted HTML file (.html file) associated with each assignment in their individual repository. A private repository for each assignment will be created for each student and will only be visible to the student and course teaching staff.

Each student is given two late days per homework assignment. A late day extends the individual homework deadline by 24 hours without penalty. No more than two late days may be used on any one assignment. Late days are intended to provide flexibility; they can be used for any reason, no questions asked. Students don’t get any bonus points for not using late days. Late days can only be used for individual homework deadlines; all other deadlines (e.g., project milestones, midterm exam) are firm.

Even if students exceed their late days, we will accept their homework but will deduct 10% (10 points) for each extra late day.

Due to the unpredictable nature of COVID-19, students in need of extra time to complete assignments should reach out to Student Affairs at StudentAffairs[at]hsph.harvard.edu. A staff member will work with you and Dr. Mattie to accommodate you. You can also contact Student Affairs if you have a learning disability that requires accommodations. We will ensure you are accommodated as needed.

The teaching fellows (TFs) must be able to knit submitted RMarkdown files. The penalty for not being able to knit a file while grading increases for each subsequent homework assignment, as outlined below. To avoid this, students should make sure to include relative paths to files, data, images, etc., rather than absolute paths (paths specific to your computer). Examples of how to include paths will be given in lecture and lab sessions. Students may also double check with the teaching staff before submitting assignments.

0 points for HW1
5 points for HW2
10 points for HW3
15 points for HW4

Students may ask questions about the assignments during lecture, but questions about grading should be directed to Dr. Mattie via email, outside of lecture and lab sessions.

Students may work together on assignments but all responses to text questions must be in their own words and not copied from another student’s assignment.

In this course, we recognize the growing presence and impact of generative Artificial Intelligence (AI) tools (e.g., ChatGPT, DALL-E, etc.) in academic and professional environments. These tools can be powerful aids in the learning process when used responsibly. The following guidelines are designed to ensure that generative AI is used ethically and effectively in your homework assignments:

Permissible Use

Assistance: You may use generative AI tools to assist in brainstorming ideas, drafting outlines, generating code snippets, or seeking clarification on complex topics.
Supplemental Learning: These tools can serve as supplemental resources to explain concepts and provide additional examples that may aid in your understanding.

Impermissible Use

Plagiarism: You must not directly copy and submit AI-generated content as your own work. This includes text, code, images, or any other type of content that has not been adequately personalized or appropriately cited.
Substitution for Learning: Relying on AI tools to complete assignments in lieu of engaging with the course material is discouraged. Your primary goal should be to develop a deep understanding of the subject matter.

Transparency and Citation

Citation: Cite AI tools appropriately in your bibliography or reference sections in accordance with the citation style prescribed for the course.

Academic Integrity

Originality: Ensure that your submissions reflect your own understanding and synthesis of the material. AI tools should not replace your critical thinking and analytic skills.
Integrity: Any use of AI tools should uphold the principles of academic integrity outlined by Harvard TH Chan’s academic policies.

By adhering to these guidelines, you will be able to harness the benefits of generative AI while maintaining the highest standards of academic integrity and personal learning. If you have any questions regarding the appropriate use of these tools, please feel free to reach out to Dr. Mattie for clarification.

Take-home Midterm 🔗

A take-home midterm will be distributed in the form of an RMarkdown file in October (date TBD) to test comprehension of course material. The exam will consist of multiple-choice questions that may or may not require writing code, coding questions and short answer questions. All code used and text answers must be submitted using the RMarkdown file. Students will have 1 week to work on the exam and must submit the exam via their individual GitHub repository by 11:59pm on the deadline (TBD). Students are encouraged to use lecture slides and code, lab material, homework assignments and the Internet to work on the exam but may not work or consult with other students. The teaching staff will be available to answer any questions concerning the exam. Students may not use any form of generative AI on the exam.

Final Project 🔗

Students will work in small groups on a month-long data science project. The goal of the project is to go through the complete data science process to answer an assigned prompt. You will be given a dataset and series of questions to answer. You will design your visualizations, provide summary statistics, build machine learning models, and communicate results. A full description is available on the course website.

Course Readings

Students are encouraged to read the lecture documents and other resources available on the course Canvas site and the course GitHub repository.

Optional Readings: Suggested:

R for Data Science (2nd edition, 2023; open access)
Storytelling with Data (available via Harvard Hollis)
ggplot2: Elegant Graphics for Data Analysis (3rd edition; open access)
Happy Git and GitHub for the useR (mainly chapter 12; open access)
An Introduction to Statistical Learning with R (aka ISR; free download)
R for Health Data Science (open access)

Course Website

GitHub Page