Summary statistics, if-else statements, loops

class: center, middle, inverse, title-slide

.title[
# Summary statistics, if-else statements, loops
]

---

## Summary statistics

- Simple statistical functions: `min()`, `max()`, `mean()`, `median()`, `sd()`, `cor()`, `summary()`, `quantile()`)

- These, and many other functions, have settings to properly handle NAs, e.g., `mean(x, na.rm = FALSE, ...)`

- `unique()` - unique elements in a vector. Combine with `length()` to get the number of unique elements

- `table()` - contingency table for a vector (the number of elements per unique level)

---
## Summary statistics

``` r
data(mtcars)    # simple summary 
# ?mtcars
head(mtcars)
```

```
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
```

``` r
mean(mtcars$mpg) # Try median, sd, var, min, max
```

```
## [1] 20.09062
```

``` r
summary(mtcars$mpg)
```

```
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10.40   15.43   19.20   20.09   22.80   33.90
```

---
## Summary statistics

``` r
quantile(mtcars$mpg, probs = c(.20, .80))
```

```
##   20%   80% 
## 15.20 24.08
```

``` r
cor(mtcars$mpg, mtcars$hp) # sample correlation coeficient
```

```
## [1] -0.7761684
```

``` r
table(mtcars$cyl)
```

```
## 
##  4  6  8 
## 11  7 14
```

``` r
table(mtcars$cyl)/length(mtcars$cyl) # normalized by the total number of observations = 32
```

```
## 
##       4       6       8 
## 0.34375 0.21875 0.43750
```

---
## Handling NAs

- NAs create problems in calculations. Many functions have built-in mechanism to handle NAs.

- Check for presence of NAs in your data using `is.na()`

``` r
vec_with_NAs <- c(1, 2, NA, 3)
sum(is.na(vec_with_NAs))
```

```
## [1] 1
```

- `complete.cases()` on a matrix/data frame returns row-wise logical with TRUE for rows without NAs. Remove all **rows** containing NAs from a data frame

``` r
dat <- dat[complete.cases(dat), ]
```

---
## Random samples and permutations

``` r
sample(1:10)
```

```
##  [1]  3  2 10  7  9  1  6  4  5  8
```

``` r
sample(letters, 5)
```

```
## [1] "h" "q" "m" "n" "c"
```

``` r
sample(1:10, replace = TRUE)
```

```
##  [1]  7  6  9  9  9  9 10  2 10  7
```

``` r
set.seed(1)
```

---
## Statistical distribution functions

Prefix each R distribution name with "d" for the density or mass function, "p" for the CDF, "q" for the percentile function (also called the quantile), "r" for the generation of pseudorandom variables

| Function | Distribution |
|----------|--------------|
| (d,p,q,r)norm    | Normal       |
| (d,p,q,r)pois    | Poisson      |
| (d,p,q,r)binom   | Binomial     |
| (d,p,q,r)chisq   | Chi-squared  |
| (d,p,q,r)t       | Student's t  |
| (d,p,q,r)unif    | Uniform      |

---
## More useful stats functions

``` r
t.test(iris$Sepal.Length,iris$Petal.Length)
```

```
## 
## 	Welch Two Sample t-test
## 
## data:  iris$Sepal.Length and iris$Petal.Length
## t = 13.098, df = 211.54, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  1.771500 2.399166
## sample estimates:
## mean of x mean of y 
##  5.843333  3.758000
```

``` r
aov(Sepal.Length~Species,data=iris)
```

```
## Call:
##    aov(formula = Sepal.Length ~ Species, data = iris)
## 
## Terms:
##                  Species Residuals
## Sum of Squares  63.21213  38.95620
## Deg. of Freedom        2       147
## 
## Residual standard error: 0.5147894
## Estimated effects may be unbalanced
```

---
## More useful stats functions

``` r
chisq.test(iris$Petal.Length,iris$Species)
```

```
## Warning in chisq.test(iris$Petal.Length, iris$Species): Chi-squared
## approximation may be incorrect
```

```
## 
## 	Pearson's Chi-squared test
## 
## data:  iris$Petal.Length and iris$Species
## X-squared = 271.8, df = 84, p-value < 2.2e-16
```

``` r
fisher.test(mtcars$gear, mtcars$carb)
```

```
## 
## 	Fisher's Exact Test for Count Data
## 
## data:  mtcars$gear and mtcars$carb
## p-value = 0.2434
## alternative hypothesis: two.sided
```

---
## More useful stats functions

``` r
lm(Sepal.Length~Sepal.Width, data=iris) # simple linear regression
```

```
## 
## Call:
## lm(formula = Sepal.Length ~ Sepal.Width, data = iris)
## 
## Coefficients:
## (Intercept)  Sepal.Width  
##      6.5262      -0.2234
```

``` r
glm(ifelse(Species=="setosa",1,0)~Sepal.Width, family="binomial",data=iris)  # logistic regression
```

```
## 
## Call:  glm(formula = ifelse(Species == "setosa", 1, 0) ~ Sepal.Width, 
##     family = "binomial", data = iris)
## 
## Coefficients:
## (Intercept)  Sepal.Width  
##      -15.72         4.79  
## 
## Degrees of Freedom: 149 Total (i.e. Null);  148 Residual
## Null Deviance:	    191 
## Residual Deviance: 123.8 	AIC: 127.8
```

---
## Manipulating the `lm` object

---
# Control structures inside R/functions

- `if, else`

- `for`

- `while`

- `repeat`

- `break`

- `next`

---
## If-else statement

Conditional code execution

``` r
if (condition) {
  # do something
} else {
  # do something else
}
```

- `==`: Equal to
- `!=`: Not equal to
- `>`, `>=`: Greater than, greater than or equal to
- `<`, `<=`: Less than, less than or equal to

``` r
x <- 1:15
if (sample(x, 1) <= 10) {
  print("x is less than 10")
} else {
  print("x is greater than 10")
}
```

```
## [1] "x is less than 10"
```

---
## For loop

Repetitive code execution

``` r
for (i in 1:5) {
  cat(i)
}
```

```
## 12345
```

Compare with

``` r
for (i in 1:5) {
  print(i)
}
```

```
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
```

---
## More uses of For loops

``` r
x <- c("apples", "oranges", "bananas", "strawberries")

for (i in x) {
  cat(i); cat(" ")
}
```

```
## apples oranges bananas strawberries
```

``` r
for (i in 1:4) {
  cat(x[i]); cat(" ")
}
```

```
## apples oranges bananas strawberries
```

``` r
for (i in seq(x)) {
  cat(x[i]); cat(" ")
}
```

```
## apples oranges bananas strawberries
```

---
## Nested For loops

``` r
m <- matrix(1:10, 2)
m
```

```
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    3    5    7    9
## [2,]    2    4    6    8   10
```

``` r
for (i in seq(nrow(m))) {
  for (j in seq(ncol(m))) {
    print(m[i, j])
  }
}
```

```
## [1] 1
## [1] 3
## [1] 5
## [1] 7
## [1] 9
## [1] 2
## [1] 4
## [1] 6
## [1] 8
## [1] 10
```

---
## while, repeat loops

``` r
i <- 1
while (i < 10) {
  print(i)
  i <- i + 1
} # Be sure there is a way to exit out of a while loop
```

```
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
```

``` r
repeat {
  # simulations; generate some value have an expectation if within some range,
  # then exit the loop
  if ((value - expectation) <= threshold) {
    break
  }
}
```

---
## Combine any statements/functions

``` r
for (i in 1:20) {           
  if (i%%2 == 1) {
    next                # skip printing over odd numbers
  } else {
    print(i)
  }
}
```

```
## [1] 2
## [1] 4
## [1] 6
## [1] 8
## [1] 10
## [1] 12
## [1] 14
## [1] 16
## [1] 18
## [1] 20
```

---
## Vectorized operation

Many operations in R are already vectorized, making code more efficient, concise, and easier to read

``` r
x <- 1:4; y <- 6:9
x
```

```
## [1] 1 2 3 4
```

``` r
y
```

```
## [1] 6 7 8 9
```

``` r
x * y
```

```
## [1]  6 14 24 36
```

``` r
x / y
```

```
## [1] 0.1666667 0.2857143 0.3750000 0.4444444
```

---
## Manipulating vectors

``` r
ages <- c(40, 50, 60, 70, 80)
# add a value to end of vector
ages <- c(ages, 90) 
# add value at the beginning
ages <- c(30, ages)
# extracting second value
ages[2]
```

```
## [1] 40
```

``` r
# excluding second value
ages[-2]
```

```
## [1] 30 50 60 70 80 90
```

``` r
# extracting first and third values
ages[c(1, 3)] 
```

```
## [1] 30 50
```

---
## `apply` family of functions

Writing for, while loops in R are inefficient, and we want to vectorize computation in R.

- `apply()` - apply a function over the margins of an array

- `lapply()` - loop over a list and evaluate a function on each element

- `sapply()` - same as lapply but try to simplify results, if the result is a list where every element is length 1, then a vector is returned

- `mapply()` - multivariate version of lapply

- `tapply()` -  apply a function over subsets of a vector

---
## apply examples

``` r
x <- 1:4
lapply(x, runif)
```

```
## [[1]]
## [1] 0.3721239
## 
## [[2]]
## [1] 0.5728534 0.9082078
## 
## [[3]]
## [1] 0.2016819 0.8983897 0.9446753
## 
## [[4]]
## [1] 0.66079779 0.62911404 0.06178627 0.20597457
```

``` r
x <- list(a = 1:4, b = rnorm(10), c = rnorm(20, 1))
sapply(x, mean)
```

```
##          a          b          c 
##  2.5000000 -0.1007662  0.9634915
```

---
## apply examples

<!--

``` r
#If the result is a list where every element is a vector of the same length (> 1), a matrix is returned.
x <- list(rnorm(100), runif(100), rpois(100, 1))
sapply(x, quantile, probs = c(0.25, 0.75))
```

```
##           [,1]      [,2] [,3]
## 25% -0.7957767 0.1894651    0
## 75%  0.6253511 0.7212868    2
```
-->

``` r
x <- matrix(rnorm(200), 20, 10)
apply(x, 1, sum)
```

```
##  [1]  1.8140048  3.0824614 -2.7211809 -5.0145234  1.7467243 -3.5266602
##  [7]  1.3334932  3.5583679  1.5893239  1.5345070  1.4851436  2.0284045
## [13] -5.5429089  1.8442124  0.2154637  5.3294698 -3.7333796 -3.7938359
## [19]  0.8003891  1.7423342
```

``` r
apply(x, 2, mean)
```

```
##  [1]  0.024159081  0.012556520  0.089704142  0.014767588  0.067265778
##  [6]  0.026121116 -0.077858902  0.069852367  0.005350539 -0.043327692
```

---
## apply examples

For sums and means of matrix dimensions, we have some shortcuts

``` r
rowSums  = apply(x, 1, sum)
rowMeans = apply(x, 1, mean)
colSums  = apply(x, 2, sum)
colMeans = apply(x, 2, mean)
```

Check `?rowSums` help on these base R functions

---
## tapply

Apply a function to groups of values of an array defined by the levels of certain factors.

``` r
function (X, INDEX, FUN = NULL, ..., default = NA, simplify = TRUE)
X is a vector
INDEX is a factor or a list of factors (or else they are coerced to factors)
FUN is a function to be applied
... contains other arguments to be passed FUN
simplify, should we simplify the result?
```

``` r
x <- c(rnorm(10), runif(10), rnorm(10, 1))
f <- gl(3, 10)
#  [1] 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3
# Levels: 1 2 3
tapply(x, f, mean)
```

```
##         1         2         3 
## 0.4824261 0.2253849 0.6480306
```

---
## mapply

.panelset[
.panel[.panel-name[mapply]
mapply is a multivariate version of sapply. mapply applies FUN to the first elements of each ... argument, the second elements, the third elements, and so on. Arguments are recycled if necessary.

``` r
function (FUN, ..., MoreArgs = NULL, SIMPLIFY = TRUE, USE.NAMES = TRUE)
FUN is a function to apply
... contains arguments to apply over
MoreArgs is a list of other arguments to FUN.
SIMPLIFY indicates whether the result should be simplified
```
]
.panel[.panel-name[Example 1]

``` r
mapply(rep, 1:4, 4:1)
```

```
## [[1]]
## [1] 1 1 1 1
## 
## [[2]]
## [1] 2 2 2
## 
## [[3]]
## [1] 3 3
## 
## [[4]]
## [1] 4
```
rep(x, length.out)
]
.panel[.panel-name[Example 1]

``` r
mapply(rnorm, mean = 1:3, sd = 1:3, n = seq(5, 15, by = 5))
```

```
## [[1]]
## [1]  2.8939108 -0.7602263  1.9245468  0.4434578  0.8194159
## 
## [[2]]
##  [1] 4.8948806 0.7857371 3.3587249 1.8128847 1.0198274 4.8213188 1.5508524
##  [8] 1.5750090 3.3927569 3.8303650
## 
## [[3]]
##  [1]  0.2298771  6.4406198  1.0924049  0.3406701 -3.9994101  2.5635276
##  [7]  3.9508921  0.8775900  6.7264667  4.8604566  3.2997092  8.4231105
## [13] -1.5072899  3.8569214  5.5371209
```
rnorm(n, mean = 0, sd = 1)
]
]