class: center, middle, inverse, title-slide .title[ # R preliminaries ] --- ## Assignment operator - We often need to save a function's result or output. For this we use the assignment operator: ` <- `, preferred over ` = ` ``` r scores <- mtcars ``` Now we can use `scores` as an argument to other functions. For example, compute summary statistics for each column in the data: ``` r summary(scores[1:4]) # First four elements ``` ``` ## mpg cyl disp hp ## Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0 ## 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5 ## Median :19.20 Median :6.000 Median :196.3 Median :123.0 ## Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7 ## 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0 ## Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0 ``` ??? Objects and output of functions are stored in variables. For this, we use the assignment operator. It looks like a left-facing arrow. You can use the 'equal sign' operator for assignment, but it is considered bad practice. You can simply type the arrow operator or use a keyboard shortcut. Press `Alt + -` on Windows or `Option + -` on a Mac to quickly insert the arrow operator. After a variable is assigned, you can apply functions to it, like the 'summary' function in the example. Note that in R, the hash sigh is used for comments. --- ## Variables Variables can store various types of data. Scalars, or zero-dimensional variables, are the simplest, they store a single number. Vectors are one-dimensional variables, they store a sequence of numbers. You can access any element in a vector by using square brackets and a positional index. Note positions in R start from 1, so if you are looking for the second element, simply use 2 - **Scalars** (0-dimensional): `a = 42`, `b = a / 7` - **Vectors** (1-dimensional): `b = c(12, 14, 16)` - Access vector element as `b[2]` (returns 14) ??? Variables can store various types of data. Scalars, or zero-dimensional variables, are the simplest, they store a single number. Vectors are one-dimensional variables, they store a sequence of numbers. You can access any element in a vector by using square brackets and a positional index. Note positions in R start from 1, so if you are looking for the second element, simply use 2. --- ## Variables Matrices are two-dimensional variables storing the data in rows and columns. You can subset a matrix using square brackets and the comma-separated indices for rows (which it the first axis) and columns (the second axis). If you want to select the full row or column, simply leave the empty space for the corresponding axis ``` r mtx = matrix(data = c(3, 1, 3, 2, 3, 2), ncol = 3) mtx ``` ``` ## [,1] [,2] [,3] ## [1,] 3 3 3 ## [2,] 1 2 2 ``` ``` r mtx[2, ] # Second row ``` ``` ## [1] 1 2 2 ``` ``` r mtx[1:2, 1:2] # Top left 2x2 submatrix ``` ``` ## [,1] [,2] ## [1,] 3 3 ## [2,] 1 2 ``` ??? Matrices are two-dimensional variables storing the data in rows and columns. You can subset a matrix using square brackets and the comma-separated indices for rows (which it the first axis) and columns (the second axis). If you want to select the full row or column, simply leave the empty space for the corresponding axis. --- ## Variable names - camelCase, snake_case, PascalCase - Be careful not to name your variables as function names. E.g., `c` is a bad variable name because `c()` is a function for combining variables. Check `?Reserved` for the list of reserved words - With auto-completion in RStudio, you don't need to worry about variable name length - make names that are self-explanatory - Check "Google R Style Guide", https://google.github.io/styleguide/Rguide.html) and "The Tidyverse Style Guide", http://adv-r.had.co.nz/Style.html for variable naming best practices .small[ https://betterprogramming.pub/string-case-styles-camel-pascal-snake-and-kebab-case-981407998841 ] ??? A few words how to name your variables. First, R is case-sensitive. Second, R is flexible and allow you to name your variables and functions any way you like. Alphanumeric characters, dots and underscores are permissible. Avoid using special characters and reserved words for variable names, e.g., built-in function names. Use several words connected by underscores to make variable names self-explanatory. With autocompletion in RStudio you don't need to worry about length of your variable names. --- ## Variable types ``` r # numeric: real or decimal numbers, sometimes referred to as “double” integer a <- 2 # character: sometimes referred to as string data, surrounded by double quotes a <- "sun" # Character vector: a sequence of characters a <- c("a", "beer", "2") # logical: Boolean data (TRUE and FALSE) a <- TRUE # Logical vector a <- c(TRUE, FALSE) # Can be used to subset matrices, positions evaluated to TRUE will be selected mtx # Reminder how our matrix looks like ``` ``` ## [,1] [,2] [,3] ## [1,] 3 3 3 ## [2,] 1 2 2 ``` ``` r mtx[a, ] # Select all columns for row indices evaluating to TRUE ``` ``` ## [1] 3 3 3 ``` ??? So far, we discussed numerical variables. But variables can also contain characters, wrapped in double quotes. You can combine multiple characters into a character vector by using the 'c()' function. 'c' stands for concatenate and will concatenate comma-separated characters or numbers. There is also the logical or Boolean data type that can be combined into vectors. Such vectors can be used to subset the data. Data with indices that evaluate to TRUE will be returned. <!-- - complex: complex numbers with real and imaginary parts (e.g., 1 + 4i) --> <!-- - raw: bytes of data (machine-readable, but not human readable) --> --- ## Helper functions for variables ``` r a <- c("a", "b", "c") # Character vector class(a) # Getting its class ``` ``` ## [1] "character" ``` ``` r str(a) # STRucture also shows data type ``` ``` ## chr [1:3] "a" "b" "c" ``` ``` r is.numeric(a) # Checks whether a variable is a number, returns TRUE if matches ``` ``` ## [1] FALSE ``` ``` r is.character(a) ``` ``` ## [1] TRUE ``` ``` r as.numeric(c("2", "a")) # Attempt to convert types, unsuccessful will be NA ``` ``` ## Warning: NAs introduced by coercion ``` ``` ## [1] 2 NA ``` ??? Several functions allow you to investigate class and structure of your variables. Use 'class' to see the data type, and use 'str' (short for structure) to get a more detailed overview. You can also check whether a variable is of special type by using 'is.numeric' or 'is.character' functions. You can also coerce one data type in another using 'as.numeric' or 'as.character'. If a variable cannot be converted, NA, not available, is returned. --- ## Factors - Factor variables are how R represents categorical data - Factors come in two main types, each designed for different kinds of categorical data: - `factor()` - used for nominal data ("Cats", "Cats", "Dogs", "Birds") - `ordered()` - used for ordinal data ("First", "Second", "Second", "Third") ``` r factor(c("Cats", "Cats", "Dogs", "Birds")) ``` ``` ## [1] Cats Cats Dogs Birds ## Levels: Birds Cats Dogs ``` ``` r ordered(c("First", "Second", "Second", "Third")) ``` ``` ## [1] First Second Second Third ## Levels: First < Second < Third ``` ??? Factor variables are how R represents categorical data. Factors come in two main types, each designed for different kinds of categorical data. Nominal factors are most frequently used, where the order of categories doesn't inherently have special meaning. Ordinal factors store data where categories have a meaningful order. --- ## Factors helper functions - `levels()` - get/set levels (unique categories) of a factor. Also, an argument in the `factor()` function allowing to set the order manually - `relevel()` - reorder factor levels. `rev()` - reverse factor levels - `is.factor()`, `as.factor()` ``` r a <- factor(c("Cats", "Cats", "Dogs", "Birds")) a ``` ``` ## [1] Cats Cats Dogs Birds ## Levels: Birds Cats Dogs ``` ``` r relevel(a, ref = "Cats") ``` ``` ## [1] Cats Cats Dogs Birds ## Levels: Cats Birds Dogs ``` ??? Unique categories in a factor are called levels. By default, levels for nominal factors are sorted alphabetically. We can override the default order with the 'relevel' function to set the new first, or reference, level. --- ## Conditional Operators Conditional Operators are used to compare variables and return TRUE/FALSE boolean statements - `==` - Equal to - `!=` - Not equal to - `<` - Less than - `>` - Greater than - `<=` - Less than or equal to - `>=` - Greater than or equal to ``` r c(1, 4) > 3 ``` ``` ## [1] FALSE TRUE ``` ``` r "a" == tolower("A") ``` ``` ## [1] TRUE ``` ??? In R, conditional operators are used to compare values and determine if conditions are met. They are fundamental for decision-making in programming. The primary conditional operators include equal to (double equal sign), not equal to (exclamation mark negates the equal sign), less than, greater than, less than or equal to, greater than or equal to. --- ## Logical Operators Logical operators are used to combine or modify logical statements - `&` - Logical AND - `|` - Logical OR - `!` - Logical NOT ``` r TRUE & FALSE # returns FALSE because both conditions must be true for AND. ``` ``` ## [1] FALSE ``` ``` r TRUE | FALSE # returns TRUE because only one condition needs to be true for OR. ``` ``` ## [1] TRUE ``` ``` r !TRUE # returns FALSE because NOT reverses the logical state. ``` ``` ## [1] FALSE ``` ??? Logical operators help in constructing more complex conditions. They allow to combine multiple conditional statements and return the joint boolean output. They include the ampersand sign, logical and, the pipe sign, logical or, and the exclamation mark, the negation operator --- ## Conditional and logical operators By combining conditional and logical operators, R allows for flexible condition checking ``` r age <- c(22, 17, 30) income <- c(25000, 5000, 60000) is_student <- c(TRUE, FALSE, FALSE) # Eligibility for a loan eligibility <- (age > 18 & income > 30000 & !is_student) eligibility # Outputs a logical vector indicating eligibility ``` ``` ## [1] FALSE FALSE TRUE ``` ??? We can combine various comparisons to get a consensus boolean vector satisfying the comparisons and the logical operators. For example, if we have three vectors of data, we can apply conditional statements to each and then combine them to get the resulting boolean vector satisfying the combination logic --- ## Conditional and logical operators - Like arithmetic operations, logic statements follow the order of preference - Operators `>`, `==`, `!` etc. are evaluated before `&` and `|` - If in doubt, wrap your expressions in parentheses ``` r 3 < 4 & "a" == "b" ``` ``` r (3 < 4) & ("a" == "b") ``` ??? Like arithmetic operations, logic statements follow the order of preference. Operators more than, equal to, not, and others are evaluated before and and pipe operators. If unsure, wrap the conditional statements in parentheses before combining them with the logical operators. --- ## Conditional and logical operators What do you think will happen if we evaluate `0.1 + 0.2 == 0.3`? ``` r 0.1 + 0.2 == 0.3 ``` ``` ## [1] FALSE ``` **Problem:** Computers represent numbers as binary (i.e. base 2) floating-point numbers. Read more https://floating-point-gui.de/basic/ **Solution:** Use `all.equal()` for evaluating floats (i.e fractions). ``` r all.equal(0.1 + 0.2, 0.3) ``` ``` ## [1] TRUE ``` ??? When comparing decimal numbers, conditional comparisons may fail because R compares such numbers as floating point numbers at a certain precision level. R has the all.equal function for such cases. The all.equal function can be applied to compare character vectors and other data types. --- ## Value matching - To see whether an object is contained within (i.e. matches one of) a list of items, use `%in%` ``` r 5 %in% 1:10 ``` ``` ## [1] TRUE ``` ``` r 1:10 %in% 5 ``` ``` ## [1] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE ``` ``` r "x" %in% c("z", "y", "x") # Equivalent to "x" %in% c("x", "y", "z") ``` ``` ## [1] TRUE ``` <!-- - Value matching can be useful to subset R objects ``` r pvals <- c(0.05, 0.04, 0.09, 0.03, 0.12) pvals[pvals <= 0.05] ``` ``` ## [1] 0.05 0.04 0.03 ``` --> ??? To see whether an object is contained within a list of items, use in wrapped in percent sign. This operator is particularly useful when the order of variables to be matched with an object is unknown. This and other conditional operators can be useful to cubset R objects. --- ## Clean up your environment ``` r z <- c(1, 2, 3) ls() ``` ``` ## [1] "a" "age" "eligibility" "income" "is_student" ## [6] "mtx" "pvals" "scores" "z" ``` ``` r rm(z) # Remove one variable ls() ``` ``` ## [1] "a" "age" "eligibility" "income" "is_student" ## [6] "mtx" "pvals" "scores" ``` ``` r # Remove everything from the environment rm(list = ls()) # Not the same as restarting R session ls() ``` ``` ## character(0) ``` Best practice is to use Session/Restart R in RStudio to clean up your environment and loaded packages ??? As you create R objects, your environment can become cluttered. You can use the rm function to remove one or all variables. Note that this function does not remove loaded packaged. The safest way to clean your R session is to restart R using the Session/Restart R menu in RStudio.