rbtl - Data Science Lifecycle

Lars Schöbitz

Global Health Engineering - ETH Zurich

2022-04-29

Welcome back!

You got your data!

via GIPHY

What’s happening next?

Today

  1. Classroom tools
  2. Data Science Lifecycle
  3. R basics: Functions, Arguments, Objects, Operators
  4. Live Coding Exercise
  5. Pair Programming Exercise
  6. Homework and Project Report

Learning Objectives

  1. Learners can import their data from a CSV file to a team repository on GitHub
  2. Learners can list the six elements of the data science lifecycle
  3. Learners know three different ways of getting support in solving coding problems online

Classroom tools

Live Coding Exercises

  • Instructor writes and narrates code out loud
  • Intstructor explains elements and principles that are relevant
  • Code is displayed on projector screen
  • Learners join by writing and executing the same code
  • Learners “code-along” with the instructor

Pair Programming Exercises

  • Two learners work together on one computer
  • One person (the driver) does the typing
  • The other person (the navigator) offers comments and suggestions
  • Roles get switched

Taking Notes Together

  • Questions in shared online document: https://docs.google.com/document/d/1B_fGhU2-p7GdMjDdRq73JXAhAM7VU2JDu6t1foSQ0og/edit

Sticky Notes

  • Use as status flags
  • Orange: Exercise completed
  • Pink: Problem, need support

Data Science Lifecycle

Deep End

via GIPHY

1

2

3

4

5

6

7

R

Packages

base R

sqrt(49)
sum(1, 2)
  • Functions come with R

R Packages

library(dplyr)
  • Installed once in the Console: install.packages("dplyr")
  • Loaded per script

Functions & Arguments

library(dplyr)

filter(.data = gapminder, 
       year == 2007)
  • Function: filter()
  • Argument: .data =
  • Arguments following: year == 2007 What do do with the data

Objects

library(dplyr)

gapminder_yr_2007 <- filter(.data = gapminder, 
       year == 2007)
  • Function: filter()
  • Argument: .data =
  • Arguments following: year == 2007 What do do with the data
  • Object: gapminder_yr_2007

Operators

library(dplyr)

gapminder_yr_2007 <- gapminder %>% 
  filter(year == 2007) 
  • Function: filter()
  • Argument: .data =
  • Arguments following: year == 2007 What do do with the data
  • Object: gapminder_yr_2007
  • Assignment operator: <-
  • Pipe operator: %<%

Rules

Rules of dplyr functions:

  • First argument is always a data frame
  • Subsequent arguments say what to do with that data frame
  • Always return a data frame
  • Don’t modify in place

Live Coding Exercise

ae-10-data-science-lifecycle

  1. Head over to the GitHub Organisation for the course.
  2. Find the repo for week 10 that has your GitHub username.
  3. Clone the repo with your username to the RStudio Cloud.
  4. Open the file: ae-10a-lifecycle.qmd
  5. Use your Sticky Notes to let me know when you are ready.

Break

15:00

Solving coding problems

Tipps for search engines

  • Use describe verbs that describe what you want to do
  • Be specific
  • Add R to the search query
  • Add the name of the R package name to the search query
  • Scroll through the top 5 results (don’t just pick the first)

Example: “How to remove a legend from a plot in R ggplot2”

Stack Overflow

What is it?

  • The biggest support network for (coding) problems
  • Can be initimidating at first
  • Upvote system

Workflow

  • First, briefly read the question that was posted
  • Then, read the answer marked as “correct”
  • Then, read one or two more answers with high votes
  • Then, check out the “Linked” posts
  • Always give credit for the solution

Give credit

Give credit

Give credit

ggplot(data = global_waste_data_kg_year,
       mapping = aes(x = income_id, 
                     y = capita_kg_year,
                     color = income_id)) +
  ## Remove legend ref: https://stackoverflow.com/a/35622358/6816220
  theme(legend.position = "none")

Other sources for help

  • RStudio Community Forum: https://community.rstudio.com/
  • Our rbtl Slack channel
  • Documentation websites: https://dplyr.tidyverse.org/
  • Twitter community: #rstats

Minimal reproducible example (reprex)

  • Needed when asking questions online
  • We will practice this in another class
  • Good support information: https://www.tidyverse.org/help/#reprex

Break

10:00

Pair Programming

ae-10-data-science-lifecycle

  1. Team up in pairs
  2. Decide who writes code, and who supports without writing code
  3. Open the file: ae-10b-lifecycle.qmd
  4. Work through the exercises
  5. Use your sticky notes to indicate if you need support
30:00

Homework Assignment

Teams on GitHub

Teams on GitHub

Teams on GitHub

Research Project Template

  • Report template
  • Complete list of items for grading (also as open Google Sheet)
  • Information on intentions for publishing

Discuss and decide

  • Data: Who will upload which data as part of the homework?
  • Results & Discussion:
    • Who writes in file:
      • 03-01-results.qmd?
      • 03-02-results.qmd?
      • 03-03-results.qmd?
03:00

Submission

  • All details in assignment week 10
  • Due: Tuesday, 4th May at 23:59 (2 points)

Thanks! 🌻

Slides created via revealjs and Quarto: https://quarto.org/docs/presentations/revealjs/ Access slides as PDF on GitHub

All material is licensed under Creative Commons Attribution Share Alike 4.0 International.