class: center, middle, inverse, title-slide .title[ # Summary statistics, if-else statements, loops ] --- ## Summary statistics - Simple statistical functions: `min()`, `max()`, `mean()`, `median()`, `sd()`, `cor()`, `summary()`, `quantile()`) - These, and many other functions, have settings to properly handle NAs, e.g., `mean(x, na.rm = FALSE, ...)` - `unique()` - unique elements in a vector. Combine with `length()` to get the number of unique elements - `table()` - contingency table for a vector (the number of elements per unique level) --- ## Summary statistics ``` r data(mtcars) # simple summary # ?mtcars head(mtcars) ``` ``` ## mpg cyl disp hp drat wt qsec vs am gear carb ## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 ## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 ## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 ## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 ## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 ## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 ``` ``` r mean(mtcars$mpg) # Try median, sd, var, min, max ``` ``` ## [1] 20.09062 ``` ``` r summary(mtcars$mpg) ``` ``` ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 10.40 15.43 19.20 20.09 22.80 33.90 ``` --- ## Summary statistics ``` r quantile(mtcars$mpg, probs = c(.20, .80)) ``` ``` ## 20% 80% ## 15.20 24.08 ``` ``` r cor(mtcars$mpg, mtcars$hp) # sample correlation coeficient ``` ``` ## [1] -0.7761684 ``` ``` r table(mtcars$cyl) ``` ``` ## ## 4 6 8 ## 11 7 14 ``` ``` r table(mtcars$cyl)/length(mtcars$cyl) # normalized by the total number of observations = 32 ``` ``` ## ## 4 6 8 ## 0.34375 0.21875 0.43750 ``` --- ## Handling NAs - NAs create problems in calculations. Many functions have built-in mechanism to handle NAs. - Check for presence of NAs in your data using `is.na()` ``` r vec_with_NAs <- c(1, 2, NA, 3) sum(is.na(vec_with_NAs)) ``` ``` ## [1] 1 ``` - `complete.cases()` on a matrix/data frame returns row-wise logical with TRUE for rows without NAs. Remove all **rows** containing NAs from a data frame ``` r dat <- dat[complete.cases(dat), ] ``` --- ## Random samples and permutations ``` r sample(1:10) ``` ``` ## [1] 3 2 10 7 9 1 6 4 5 8 ``` ``` r sample(letters, 5) ``` ``` ## [1] "h" "q" "m" "n" "c" ``` ``` r sample(1:10, replace = TRUE) ``` ``` ## [1] 7 6 9 9 9 9 10 2 10 7 ``` ``` r set.seed(1) ``` --- ## Statistical distribution functions Prefix each R distribution name with "d" for the density or mass function, "p" for the CDF, "q" for the percentile function (also called the quantile), "r" for the generation of pseudorandom variables | Function | Distribution | |----------|--------------| | (d,p,q,r)norm | Normal | | (d,p,q,r)pois | Poisson | | (d,p,q,r)binom | Binomial | | (d,p,q,r)chisq | Chi-squared | | (d,p,q,r)t | Student's t | | (d,p,q,r)unif | Uniform | --- ## More useful stats functions ``` r t.test(iris$Sepal.Length,iris$Petal.Length) ``` ``` ## ## Welch Two Sample t-test ## ## data: iris$Sepal.Length and iris$Petal.Length ## t = 13.098, df = 211.54, p-value < 2.2e-16 ## alternative hypothesis: true difference in means is not equal to 0 ## 95 percent confidence interval: ## 1.771500 2.399166 ## sample estimates: ## mean of x mean of y ## 5.843333 3.758000 ``` ``` r aov(Sepal.Length~Species,data=iris) ``` ``` ## Call: ## aov(formula = Sepal.Length ~ Species, data = iris) ## ## Terms: ## Species Residuals ## Sum of Squares 63.21213 38.95620 ## Deg. of Freedom 2 147 ## ## Residual standard error: 0.5147894 ## Estimated effects may be unbalanced ``` --- ## More useful stats functions ``` r chisq.test(iris$Petal.Length,iris$Species) ``` ``` ## Warning in chisq.test(iris$Petal.Length, iris$Species): Chi-squared ## approximation may be incorrect ``` ``` ## ## Pearson's Chi-squared test ## ## data: iris$Petal.Length and iris$Species ## X-squared = 271.8, df = 84, p-value < 2.2e-16 ``` ``` r fisher.test(mtcars$gear, mtcars$carb) ``` ``` ## ## Fisher's Exact Test for Count Data ## ## data: mtcars$gear and mtcars$carb ## p-value = 0.2434 ## alternative hypothesis: two.sided ``` --- ## More useful stats functions ``` r lm(Sepal.Length~Sepal.Width, data=iris) # simple linear regression ``` ``` ## ## Call: ## lm(formula = Sepal.Length ~ Sepal.Width, data = iris) ## ## Coefficients: ## (Intercept) Sepal.Width ## 6.5262 -0.2234 ``` ``` r glm(ifelse(Species=="setosa",1,0)~Sepal.Width, family="binomial",data=iris) # logistic regression ``` ``` ## ## Call: glm(formula = ifelse(Species == "setosa", 1, 0) ~ Sepal.Width, ## family = "binomial", data = iris) ## ## Coefficients: ## (Intercept) Sepal.Width ## -15.72 4.79 ## ## Degrees of Freedom: 149 Total (i.e. Null); 148 Residual ## Null Deviance: 191 ## Residual Deviance: 123.8 AIC: 127.8 ``` --- ## Manipulating the `lm` object | Function | Description | |:---------:|:-----------:| | summary | Get a variety of information on the model, including coefficients and p-values for the coefficients | | coefficients | Pull out just the coefficients for a model | | fitted | Get the fitted values from the model (for the data used to fit the model) | | plot | Create plots to help assess model assumptions | residuals | Get the model residuals | --- # Control structures inside R/functions - `if, else` - `for` - `while` - `repeat` - `break` - `next` --- ## If-else statement Conditional code execution ``` r if (condition) { # do something } else { # do something else } ``` - `==`: Equal to - `!=`: Not equal to - `>`, `>=`: Greater than, greater than or equal to - `<`, `<=`: Less than, less than or equal to ``` r x <- 1:15 if (sample(x, 1) <= 10) { print("x is less than 10") } else { print("x is greater than 10") } ``` ``` ## [1] "x is less than 10" ``` --- ## For loop Repetitive code execution ``` r for (i in 1:5) { cat(i) } ``` ``` ## 12345 ``` Compare with ``` r for (i in 1:5) { print(i) } ``` ``` ## [1] 1 ## [1] 2 ## [1] 3 ## [1] 4 ## [1] 5 ``` --- ## More uses of For loops ``` r x <- c("apples", "oranges", "bananas", "strawberries") for (i in x) { cat(i); cat(" ") } ``` ``` ## apples oranges bananas strawberries ``` ``` r for (i in 1:4) { cat(x[i]); cat(" ") } ``` ``` ## apples oranges bananas strawberries ``` ``` r for (i in seq(x)) { cat(x[i]); cat(" ") } ``` ``` ## apples oranges bananas strawberries ``` --- ## Nested For loops ``` r m <- matrix(1:10, 2) m ``` ``` ## [,1] [,2] [,3] [,4] [,5] ## [1,] 1 3 5 7 9 ## [2,] 2 4 6 8 10 ``` ``` r for (i in seq(nrow(m))) { for (j in seq(ncol(m))) { print(m[i, j]) } } ``` ``` ## [1] 1 ## [1] 3 ## [1] 5 ## [1] 7 ## [1] 9 ## [1] 2 ## [1] 4 ## [1] 6 ## [1] 8 ## [1] 10 ``` --- ## while, repeat loops ``` r i <- 1 while (i < 10) { print(i) i <- i + 1 } # Be sure there is a way to exit out of a while loop ``` ``` ## [1] 1 ## [1] 2 ## [1] 3 ## [1] 4 ## [1] 5 ## [1] 6 ## [1] 7 ## [1] 8 ## [1] 9 ``` ``` r repeat { # simulations; generate some value have an expectation if within some range, # then exit the loop if ((value - expectation) <= threshold) { break } } ``` --- ## Combine any statements/functions ``` r for (i in 1:20) { if (i%%2 == 1) { next # skip printing over odd numbers } else { print(i) } } ``` ``` ## [1] 2 ## [1] 4 ## [1] 6 ## [1] 8 ## [1] 10 ## [1] 12 ## [1] 14 ## [1] 16 ## [1] 18 ## [1] 20 ``` --- ## Vectorized operation Many operations in R are already vectorized, making code more efficient, concise, and easier to read ``` r x <- 1:4; y <- 6:9 x ``` ``` ## [1] 1 2 3 4 ``` ``` r y ``` ``` ## [1] 6 7 8 9 ``` ``` r x * y ``` ``` ## [1] 6 14 24 36 ``` ``` r x / y ``` ``` ## [1] 0.1666667 0.2857143 0.3750000 0.4444444 ``` --- ## Manipulating vectors ``` r ages <- c(40, 50, 60, 70, 80) # add a value to end of vector ages <- c(ages, 90) # add value at the beginning ages <- c(30, ages) # extracting second value ages[2] ``` ``` ## [1] 40 ``` ``` r # excluding second value ages[-2] ``` ``` ## [1] 30 50 60 70 80 90 ``` ``` r # extracting first and third values ages[c(1, 3)] ``` ``` ## [1] 30 50 ``` --- ## `apply` family of functions Writing for, while loops in R are inefficient, and we want to vectorize computation in R. - `apply()` - apply a function over the margins of an array - `lapply()` - loop over a list and evaluate a function on each element - `sapply()` - same as lapply but try to simplify results, if the result is a list where every element is length 1, then a vector is returned - `mapply()` - multivariate version of lapply - `tapply()` - apply a function over subsets of a vector --- ## apply examples ``` r x <- 1:4 lapply(x, runif) ``` ``` ## [[1]] ## [1] 0.3721239 ## ## [[2]] ## [1] 0.5728534 0.9082078 ## ## [[3]] ## [1] 0.2016819 0.8983897 0.9446753 ## ## [[4]] ## [1] 0.66079779 0.62911404 0.06178627 0.20597457 ``` ``` r x <- list(a = 1:4, b = rnorm(10), c = rnorm(20, 1)) sapply(x, mean) ``` ``` ## a b c ## 2.5000000 -0.1007662 0.9634915 ``` --- ## apply examples <!-- ``` r #If the result is a list where every element is a vector of the same length (> 1), a matrix is returned. x <- list(rnorm(100), runif(100), rpois(100, 1)) sapply(x, quantile, probs = c(0.25, 0.75)) ``` ``` ## [,1] [,2] [,3] ## 25% -0.7957767 0.1894651 0 ## 75% 0.6253511 0.7212868 2 ``` --> ``` r x <- matrix(rnorm(200), 20, 10) apply(x, 1, sum) ``` ``` ## [1] 1.8140048 3.0824614 -2.7211809 -5.0145234 1.7467243 -3.5266602 ## [7] 1.3334932 3.5583679 1.5893239 1.5345070 1.4851436 2.0284045 ## [13] -5.5429089 1.8442124 0.2154637 5.3294698 -3.7333796 -3.7938359 ## [19] 0.8003891 1.7423342 ``` ``` r apply(x, 2, mean) ``` ``` ## [1] 0.024159081 0.012556520 0.089704142 0.014767588 0.067265778 ## [6] 0.026121116 -0.077858902 0.069852367 0.005350539 -0.043327692 ``` --- ## apply examples For sums and means of matrix dimensions, we have some shortcuts ``` r rowSums = apply(x, 1, sum) rowMeans = apply(x, 1, mean) colSums = apply(x, 2, sum) colMeans = apply(x, 2, mean) ``` Check `?rowSums` help on these base R functions --- ## tapply Apply a function to groups of values of an array defined by the levels of certain factors. ``` r function (X, INDEX, FUN = NULL, ..., default = NA, simplify = TRUE) X is a vector INDEX is a factor or a list of factors (or else they are coerced to factors) FUN is a function to be applied ... contains other arguments to be passed FUN simplify, should we simplify the result? ``` ``` r x <- c(rnorm(10), runif(10), rnorm(10, 1)) f <- gl(3, 10) # [1] 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 # Levels: 1 2 3 tapply(x, f, mean) ``` ``` ## 1 2 3 ## 0.4824261 0.2253849 0.6480306 ``` --- ## mapply .panelset[ .panel[.panel-name[mapply] mapply is a multivariate version of sapply. mapply applies FUN to the first elements of each ... argument, the second elements, the third elements, and so on. Arguments are recycled if necessary. ``` r function (FUN, ..., MoreArgs = NULL, SIMPLIFY = TRUE, USE.NAMES = TRUE) FUN is a function to apply ... contains arguments to apply over MoreArgs is a list of other arguments to FUN. SIMPLIFY indicates whether the result should be simplified ``` ] .panel[.panel-name[Example 1] ``` r mapply(rep, 1:4, 4:1) ``` ``` ## [[1]] ## [1] 1 1 1 1 ## ## [[2]] ## [1] 2 2 2 ## ## [[3]] ## [1] 3 3 ## ## [[4]] ## [1] 4 ``` rep(x, length.out) ] .panel[.panel-name[Example 1] ``` r mapply(rnorm, mean = 1:3, sd = 1:3, n = seq(5, 15, by = 5)) ``` ``` ## [[1]] ## [1] 2.8939108 -0.7602263 1.9245468 0.4434578 0.8194159 ## ## [[2]] ## [1] 4.8948806 0.7857371 3.3587249 1.8128847 1.0198274 4.8213188 1.5508524 ## [8] 1.5750090 3.3927569 3.8303650 ## ## [[3]] ## [1] 0.2298771 6.4406198 1.0924049 0.3406701 -3.9994101 2.5635276 ## [7] 3.9508921 0.8775900 6.7264667 4.8604566 3.2997092 8.4231105 ## [13] -1.5072899 3.8569214 5.5371209 ``` rnorm(n, mean = 0, sd = 1) ] ]