class: center, middle, inverse, title-slide .title[ # Tidyverse, data manipulation ] --- ## Tidyverse https://www.tidyverse.org - A collection of R packages designed for data science - Provides tools for data manipulation, visualization, and modeling .pull-left[<img src="img/tidyverse-packages.png" height=300 >] .pull-right[ .small[ggplot2: For creating complex and customizable visualizations dplyr: For data manipulation, including filtering, summarizing, and joining datasets tidyr: For reshaping and tidying data readr: For reading and writing rectangular data, such as CSV files purrr: For functional programming with data, allowing iteration and mapping tibble: Modern reimagining of data frames, providing a more user-friendly print method and more stringr: For string manipulation and regular expressions forcats: For working with categorical data (factors) ] ] .small[ https://www.tidyverse.org/ ] --- ## Tidyverse https://www.tidyverse.org - A collection of R packages designed for data science - Provides tools for data manipulation, visualization, and modeling .pull-left[<img src="img/tidyverse-packages.png" height=300 >] .pull-right[ .small[purrr: For functional programming with data, allowing iteration and mapping tibble: Modern reimagining of data frames, providing a more user-friendly print method and more stringr: For string manipulation and regular expressions forcats: For working with factors ] ] --- ## The tidy tools manifesto The `tidyverse` is based on four principles for handling data: 1. Reuse existing data structures 2. Compose simple functions with **the pipe %>%** 3. Embrace functional programming 4. Design for humans The R project for Statistical Computing was built for a different age; the tidyverse is a collection of tools for *our* age .small[ https://tidyverse.tidyverse.org/articles/manifesto.html ] --- ## Tidyverse core packages - **ggplot2** - data visualisation - **dplyr** - data wrangling - **readr** - reading data - **tibble** - modern data frames - **stringr** - string manipulation - **forcats** - dealing with factors - **tidyr** - data tidying - **purrr** - functional programming Each tidyverse package has a website at [PKGNAME].tidyverse.org, check https://ggplot2.tidyverse.org .small[ https://www.tidyverse.org/packages/#core-tidyverse ] --- class: center, middle # Reading in data <!--- ## Base R functions for read-write the data - `scan()` - Read data into a vector or list from the console or file - `read.table()`, `read.csv()`, `read.delim()` - Reads a file in table format and creates a data frame from it, with cases corresponding to lines and variables to fields in the file - `write.table()`, `write.csv()` - Saves the object (data.frame) to a file - `?data.table::fread` for very fast data read into R - "File -> Import Dataset" in RStudio --> --- ## readr - There are some built-in functions for reading in data in text files. These functions are _read-dot-something_, for example, `read.csv()` reads in comma-delimited text data; `read.delim()` reads in tab-delimited text, etc. - `readr` package provides fast and intelligent data reading capabilities. Very similar looking functions, named _read-underscore-something_ -- e.g., `read_csv()` - They're good at guessing the types of data in the columns, they don't do some of the other silly things that the base functions do - Play nicely with `dplyr` - data manipulation package .small[ http://readr.tidyverse.org/ ] --- ## tibbles Data frames are great! Except for - printing them - working with both characters and factors - manipulating multiple columns <!-- ~~You need to remember to set `options(stringsAsFactors = FALSE)`~~ [Not needed from R v.4.0.0](https://stackoverflow.com/questions/61536155/r-4-0-0-and-stringsasfactors) --> - If you want a one-collumn data frame, you need to use `dat[, "column1", drop = FALSE]` tibbles are the data frame alternative simplifying work with data frame-like objects .small[https://tibble.tidyverse.org/] --- ## tibbles - A `tibble`, or `tbl_df`, is a modern reimagining of the `data.frame`, keeping what time has proven to be effective, and throwing out what is not - Tibbles are `data.frames` that are lazy and surly: they do less (i.e., they don't change variable names or types, and don't do partial matching) and complain more (e.g., when a variable does not exist) - This forces you to confront problems earlier, typically leading to cleaner, more expressive code. Tibbles also have an enhanced `print` method which makes them easier to use with large datasets containing complex objects - Hadley Wickham, Chief Scientist at Posit - `glimpse()` into tibble, analog of `str()` --- ## Making the data tidy with `tidyr` - Principles of tidy data - Each _column_ is a _variable_ - Each _row_ is an _observation_ .center[<img src="img/tidy_data.png" width = 500>] .small[ Tidy data paper, http://www.jstatsoft.org/v59/i10/paper ] --- ## Making the data tidy with `tidyr` - Principles of tidy data - Each _column_ is a _variable_ - Each _row_ is an _observation_ .center[<img src="img/tidyr-longer-wider-modified.gif" width = 300>] .small[ https://x.com/dataandme/status/1175010415341985793 ] --- ## Making the data tidy with `tidyr` - `tidyr` - flexible data reshaping - `pivot_longer()` - "lengthens" data, increasing the number of rows and decreasing the number of columns - `pivot_wider()` - "widens" data, increasing the number of columns and decreasing the number of rows Example of converting the wide data into tidy data .center[<img src="img/tidy_data.png" width = 500>] .small[ https://tidyr.tidyverse.org/index.html, vignette("tidy-data"), vignette("pivot") ] --- class: center, middle # Data manipulation with dplyr --- ## Why not base R data subsetting? - Bracket subsetting is useful but can be difficult to read - Various ways of subsetting (by index, names, dollar sign) make interpretation less intuitive --- ## dplyr: data manipulation with R 80% of your work will be data preparation - getting data (from databases, spreadsheets, flat-files) - performing exploratory/diagnostic data analysis - reshaping data - visualizing data --- ## dplyr: data manipulation with R 80% of your work will be data preparation - Filtering rows (to create a subset) - Selecting columns of data (i.e., selecting variables) - Adding new variables - Sorting - Aggregating - Joining --- ## dplyr: A grammar of data manipulation - The `dplyr` package gives you a handful of useful **verbs** for managing data. On their own they don't do anything that base R can't do - Basic `dplyr` verbs - `filter()` rows - `select()` columns - `mutate()` columns - `arrange()` columns - `summarize()` columns - `group_by()` columns - They all take a data frame or tibble as their input for the first argument, and they all return a data frame or tibble as output .small[ https://dplyr.tidyverse.org/ ] --- ## The pipe %>% operator - Pipe `%>%` output of one command into an input of another command - chain commands together. (Think about the "|" operator in Linux) - Imported from `magrittr` package - Read as "then". Take the dataset (or object), _then_ do ... ``` r library(dplyr) round( sqrt(1000), 3) ``` ``` ## [1] 31.623 ``` ``` r 1000 %>% sqrt %>% round(., 3) ``` ``` ## [1] 31.623 ``` --- ## The pipe %>% operator - For example, we can view the head of the `diamonds` data.frame using either of the last two lines of code here: ``` r library(dplyr) library(ggplot2) data(diamonds) # glimpse(diamonds) diamonds %>% glimpse() ``` ``` ## # A tibble: 6 × 10 ## carat cut color clarity depth table price x y z ## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> ## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 ## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 ## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 ## 4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63 ## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 ## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48 ``` --- ## The pipe %>% operator For example, read the last line of code as: "Take the `price` column of the `diamonds` data.frame and _then_ summarize it" ``` r library(dplyr) data(diamonds) head(diamonds) diamonds %>% head summary(diamonds$price) diamonds$price %>% summary(object = .) ``` - (Ctrl/CMD)+SHIFT+M - a shortcut to insert the `%>%` sequence - you can see what it is by clicking the _Tools_ menu in RStudio, then selecting _Keyboard Shortcut Help_ - On Mac, it's CMD-SHIFT-M --- ## The other pipe |> operator - Built-in R operator (from version 4.1.0) - Also flows the data from left to right - Don't have "." data placeholder ``` r 1000 |> sqrt() |> round(3) ``` ``` ## [1] 31.623 ``` ``` r # mtcars |> lm(mpg ~ hp + cyl, data = .) # Won't work ``` .small[https://www.r-bloggers.com/2021/05/the-new-r-pipe/] --- ## dplyr::filter() If you want to filter **rows** of the data where some condition is true, use the `filter()` function. 1. The first argument is the data frame you want to filter, e.g. `filter(mydata, ...`. 2. The second argument is a condition you must satisfy, e.g. `filter(ydat, symbol == "LEU1")`. If you want to satisfy *all* of multiple conditions, you can use the "and" operator, `&`. The "or" operator `|` (the pipe character, usually shift-backslash) will return a subset that meet *any* of the conditions. - `==`: Equal to - `!=`: Not equal to - `>`, `>=`: Greater than, greater than or equal to - `<`, `<=`: Less than, less than or equal to --- ## dplyr::filter() For example, keep only the entries with Ideal cut ``` r df.diamonds_ideal <- filter(diamonds, cut == "Ideal") head(df.diamonds_ideal, n = 3) ``` ``` ## # A tibble: 3 × 10 ## carat cut color clarity depth table price x y z ## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> ## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 ## 2 0.23 Ideal J VS1 62.8 56 340 3.93 3.9 2.46 ## 3 0.31 Ideal J SI2 62.2 54 344 4.35 4.37 2.71 ``` --- ## dplyr::filter() We can achieve this same result using the `%>%` operator ``` r df.diamonds_ideal <- diamonds %>% filter(cut == "Ideal") df.diamonds_ideal %>% head(., n = 3) ``` ``` ## # A tibble: 3 × 10 ## carat cut color clarity depth table price x y z ## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> ## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 ## 2 0.23 Ideal J VS1 62.8 56 340 3.93 3.9 2.46 ## 3 0.31 Ideal J SI2 62.2 54 344 4.35 4.37 2.71 ``` --- ## dplyr::select() - The `filter()` function allows you to return only certain _rows_ matching a condition. The `select()` function returns only certain _columns_. The first argument is the data, and subsequent arguments are the columns you want. - Syntax: `select(data, columns)` ``` r df.diamonds_ideal %>% head select(df.diamonds_ideal, carat, cut, color, price, clarity) df.diamonds_ideal <- df.diamonds_ideal %>% select(., carat, cut, color, price, clarity) ``` --- ## dplyr::mutate() - The `mutate()` function adds new columns to the data that are functions of old columns - It doesn't actually modify the data frame you're operating on, and the result is transient unless you assign it to a new object or reassign it back to itself (generally, not a good practice) - Syntax: `mutate(data, new_column = function(old_columns))` ``` r df.diamonds_ideal %>% head mutate(df.diamonds_ideal, price_per_carat = price/carat) df.diamonds_ideal <- df.diamonds_ideal %>% mutate(price_per_carat = price/carat) ``` --- ## dplyr::arrange() - The `arrange()` function does what it sounds like - sorts things - It takes a `data.frame` or `tbl_df` and arranges (or sorts) by column(s) of interest - The first argument is the data, and subsequent arguments are columns to sort on. Use the `desc()` function to arrange by descending - Syntax: `arrange(data, column_to_sort_by)` ``` r df.diamonds_ideal %>% head arrange(df.diamonds_ideal, price) df.diamonds_ideal %>% arrange(price, price_per_carat) ``` --- ## dplyr::summarize() - The `summarize()` function summarizes multiple values to a single value - The power of `summarize()` comes from a few convenience functions called `n()` and `n_distinct()` that tell you the number of observations or the number of distinct values of a particular variable. - Syntax: `summarize(function_of_variables)` ``` r summarize(df.diamonds_ideal, length = n(), avg_price = mean(price)) df.diamonds_ideal %>% summarize(length = n(), avg_price = mean(price)) ``` --- ## dplyr::group_by() - Summarize *subsets of* columns by custom summary statistics - Syntax: `group_by(data, column_to_group)` ``` r group_by(diamonds, cut, color) %>% summarize(mean(price)) ``` ``` ## `summarise()` has grouped output by 'cut'. You can override using the `.groups` ## argument. ``` ``` ## # A tibble: 35 × 3 ## # Groups: cut [5] ## cut color `mean(price)` ## <ord> <ord> <dbl> ## 1 Fair D 4291. ## 2 Fair E 3682. ## 3 Fair F 3827. ## 4 Fair G 4239. ## 5 Fair H 5136. ## 6 Fair I 4685. ## 7 Fair J 4976. ## 8 Good D 3405. ## 9 Good E 3424. ## 10 Good F 3496. ## # ℹ 25 more rows ``` --- ## The power of pipe %>% - Summarize *subsets of* columns by custom summary statistics ``` r arrange(mutate(arrange(filter(tbl_df(diamonds), cut == "Ideal"), price), price_per_carat = price/carat), price_per_carat) ``` ``` r arrange( mutate( arrange( filter(tbl_df(diamonds), cut == "Ideal"), price), price_per_carat = price/carat), price_per_carat) ``` ``` r diamonds %>% filter(cut == "Ideal") %>% arrange(price) %>% mutate(price_per_carat = price/carat) %>% arrange(price_per_carat) ``` --- ## Counting - Count number of observations for each factor or combination of factors ``` r diamonds %>% count(cut, sort = TRUE) ``` ``` ## # A tibble: 5 × 2 ## cut n ## <ord> <int> ## 1 Ideal 21551 ## 2 Premium 13791 ## 3 Very Good 12082 ## 4 Good 4906 ## 5 Fair 1610 ``` ``` r diamonds %>% group_by(cut) %>% summarise(count = n()) %>% arrange(desc(count)) ``` ``` ## # A tibble: 5 × 2 ## cut count ## <ord> <int> ## 1 Ideal 21551 ## 2 Premium 13791 ## 3 Very Good 12082 ## 4 Good 4906 ## 5 Fair 1610 ``` --- ## Joining data frames - `inner_join(x, y)`: Keep only rows where there are observations in both `x` and `y` - `left_join(x, y)`: Keep all rows from `x`, whether they have a match in `y` or not (unmatched rows are filled with NAs) - `right_join(x, y)`: Keep all rows from `y`, whether they have a match in `x` or not - `full_join(x, y)`: Keep all rows from both `x` and `y`, whether they have a match in the other dataset or not .small[ Review https://ready4r.netlify.app/labbook/part-5-doing-useful-things-with-multiple-tables.html#joining-tables ] --- ## Working with factors tidyverse way `library(forcats)` - `fct_rev()` - Reverse order of factor levels - `fct_reorder()` - Reordering a factor by another variable - `fct_collapse()` - Collapse multiple categories into one category - `fct_lump()` - Collapsing the least/most frequent values of a factor into “other” - `fct_infreq()` - Reordering a factor by the frequency of values - `fct_relevel()` - Changing the order of a factor by hand .small[ https://forcats.tidyverse.org/ ] --- ## But wait... There's more - [tidymodels](https://www.tidymodels.org/start/) - a collection of packages for statistical inference, modeling and machine learning using tidyverse principles. - [rvest](https://rvest.tidyverse.org) - scrape (or harvest) data from web pages. - [dbplyr](https://dbplyr.tidyverse.org/articles/dbplyr.html) and [dtplyr](https://dtplyr.tidyverse.org/articles/translation.html) - two packages that provide interfaces for translations between dplyr and SQL and data.table code .small[ https://www.tidymodels.org https://rvest.tidyverse.org https://dbplyr.tidyverse.org ] --- ## Useful links - RStudio/Help/Cheat Sheets/Data Transformation with dplyr - Teaching the Tidyverse in 2020, by Mine Çetinkaya-Rundel. [Part 1](https://education.rstudio.com/blog/2020/07/teaching-the-tidyverse-in-2020-part-1-getting-started/), [Part 3](https://education.rstudio.com/blog/2020/07/teaching-the-tidyverse-in-2020-part-3-data-wrangling-and-tidying/), [Part 4](https://education.rstudio.com/blog/2020/07/teaching-the-tidyverse-in-2020-part-4-when-to-purrr/) - [Teaching the Tidyverse in 2021](https://www.tidyverse.org/blog/2021/08/teach-tidyverse-2021/), by Mine Çetinkaya-Rundel. - [Tidy Animated Verbs](https://www.garrickadenbuie.com/project/tidyexplain/) - [Reshaping the data](https://uclouvain-cbio.github.io/WSBIM1207/sec-dplyr.html#reshaping-data) - [Join](https://github.com/rstudio-education/remaster-the-tidyverse/tree/master/Data-Wrangling-With-The-Tidyverse/keynotes) from [Remaster the Tidyverse](https://github.com/rstudio-education/remaster-the-tidyverse) class materials built by Garrett Grolemund