Starting with R

notes

Starting with the very basics

Author

Josef Fruehwald

Published

September 4, 2024

Running R Code in a Quarto Notebook

To run R code in a Quarto notebook, you need to insert a “code chunk”. In visual editor mode, you can do that by typing the forward slash (/) and start typing in “R Code Chunk”. In the source editor mode, you have to have a line with ```{r} (three “backticks” followed by “r” in curly braces), then a few blank lines followed by another ```

```{r}
1+1
```

[1] 2

To actually run the code, you can either click on the green play button, or press the appropriate hotkey for your system

Mathematical Operations

Addition

1 + 4

[1] 5

Subtraction

1 - 4

[1] -3

Multiplication

5 * 4

[1] 20

Division

5 / 4

[1] 1.25

Exponentiation

5 ^ 4

[1] 625

Orders of Operation

Honestly, instead of gambling on how R may or may not interpret PEMDAS, just add parentheses ( ) around every operation in the order you want it to happen.

(5 ^ (2 * 2)) / 6

[1] 104.1667

Math

The formula to convert Celsius to Fahrenheit is

\[ \frac{9}{5}\text{C} + 32 \]

Somewhere around 20℃, the website Tops Aff declares it tops aff weather. What temperature is that in ℉?

While in mathemetical formulas we can write $\frac{9}{5}\text{C}$ to mean multiplication, in R code we need to make the multiplication explicit.

# this won't work
# (9/5)C ______

# this will
(9/5) * C ______

Assignment

To assign values to a variable in R, you can use either <- or ->. Most style guides shun ->, but I actually wind up using it a lot.

my_variable <- 4 * 5
print(my_variable)

[1] 20

my_variable / 2

[1] 10

Tip

Assign different values to C to see their conversion to Fahrenheit.

Data Types

Numeric

When using a number in R, we can only use digits and dots (.). If we try to enter “one hundred thousand” with a comma separator, we’ll get an error.

big_number <- 100,000

Error: <text>:1:18: unexpected ','
1: big_number <- 100,
                     ^

We also can’t use any percent signs (%) or currency symbols ($, £, €)

Characters

When we type in text without any quotes, R will assume it’s a variable or function that’s already been defined and go looking for it.

large <- 100000
large

[1] 1e+05

If the variable hasn’t been created already, we’ll get an error.

small

Error in eval(expr, envir, enclos): object 'small' not found

If we enter text inside of quotation marks, either single quotes ' or double quotes ", R will instead treat the text as a value that we could, for example, assign to a variable, or just print out.

"small"

[1] "small"

tiny_synonym <- "small"
tiny_synonym

[1] "small"

Common Error

You will often get confused about this and get the Error: object '' not found message. Even if you do this for 15 years, you will still sometimes enter plain text when you meant to put it in quotes, and put text in quotes you meant to enter without. It’s always annoying, but doesn’t mean you’re bad at doing this.

Exercise

What value is going to get printed below?

What value is going to be printed below? Change the code so that "fruit" gets printed.

fruit <- "tomato"
apple <- fruit

print(apple)

1: This line assigns the value "tomato" to the variable fruit.
2: This line assigns the variable fruit to the variable apple. They’ll share the same value, "tomato".

[1] "tomato"

Logical

There are two specialized values that you could call “True/False” or “Logical” or “Boolean” values

# fullnames
TRUE

[1] TRUE

FALSE

[1] FALSE

# Short Forms
T

[1] TRUE

[1] FALSE

These are often created using logical comparisons

large  <- 100000
medium <-    600

large < medium

[1] FALSE

short_word <- "to"

nchar(short_word) == 2

[1] TRUE

Exercise

Is 16℃ hotter than 50℉?

NA

When you have a missing value, that’s given a special NA value.

numbers <- c(1, NA, 5)
numbers

[1]  1 NA  5

Missing vs 0

Distinguishing between missing data and 0 data is super important for data analysis, but isn’t always done well. For example, if we asked 3 people what their names were, and only remembered to asked 2 of them what their age was, we’d get a really different estimate of their average age if we entered 0 for the missing person!

names <- c(
  "Skylar",
  "Oakley",
  "Jessie"
)

ages_0 <- c(
  30,
  0,
  35
)

ages_na <- c(
  30,
  NA,
  35
)

mean(ages_0, na.rm = T)

1: Entering missing data as 0.
2: Entering missing data as NA.

[1] 21.66667

mean(ages_na, na.rm = T)

[1] 32.5

Vectors

Vectors are basically 1 dimensional lists of values.¹ You can have numeric, character or logical vectors in R, but you can’t mix types. One way to create vectors is with the c() (for concatenate) function. There needs to be a comma , between every value that you add to a vector.

digital_words <- c(
  "enshittification",
  "chat",
  "gamers",
  "ice cream so good",
  "millennial pause",
  "skibidi"
)
print(digital_words)

1: New lines, and “whitespace” in general doesn’t matter.

[1] "enshittification"  "chat"              "gamers"           
[4] "ice cream so good" "millennial pause"  "skibidi"

digital_word_votes <- c(
  111,
  59,
  11,
  11,
  45,
  46
)
print(digital_word_votes)

[1] 111  59  11  11  45  46

You can also create vectors of sequential vectors with the : operator.

1:10

 [1]  1  2  3  4  5  6  7  8  9 10

Exercise

Create a vector containing the names of three cities.

More vector creating possibilities

There are a lot of functions for creating vectors.

seq(from = 1, to = 5, length = 10)

 [1] 1.000000 1.444444 1.888889 2.333333 2.777778 3.222222 3.666667 4.111111
 [9] 4.555556 5.000000

seq_along(digital_words)

[1] 1 2 3 4 5 6

rep(c("a", "b"), times = 3)

[1] "a" "b" "a" "b" "a" "b"

rep(c("a", "b"), each = 3)

[1] "a" "a" "a" "b" "b" "b"

Vector Arithmetic

You can do arithmetic on a whole vector of numbers. digital_word_votes is a vector of how many votes each word got. We can get the sum like so:

total_votes <- sum(digital_word_votes)
total_votes

[1] 283

Any single value we add, subtract, multiply, or divide will apply each value in the vector.

digital_word_votes * 10

[1] 1110  590  110  110  450  460

Exercise

Convert the raw counts of votes in digital_word_votes to proprtional votes.

Proportions are calculated by dividing each single amount by the total amount.

part1 <- 25
part2 <- 75

proportion1 = part1/(part1 + part2)
proportion2 = part2/(part1 + part2)

print(proportion1)

[1] 0.25

print(proportion2)

[1] 0.75

Indexing

Indexing from 1

If you’ve never programmed before, this part will make sense, and if you haven’t programmed before, this part will be confusing.

If you have a vector, and you want to get the first value from it, you put square brackets [] after the variable name, and put 1 inside.

print(digital_words)

[1] "enshittification"  "chat"              "gamers"           
[4] "ice cream so good" "millennial pause"  "skibidi"

digital_words[1]

[1] "enshittification"

If you want a range of values from a vector, you can give it a vector of numeric indices.

digital_words[2:5]

[1] "chat"              "gamers"            "ice cream so good"
[4] "millennial pause"

Exercise

The vector letters is built into R, and contains the 26 letters of the alphabet.

letters

 [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
[20] "t" "u" "v" "w" "x" "y" "z"

Get the 13th letter of the alphabet.

Exercise
Solution

Logical Indexing

Also really useful is the ability to do logical indexing. For example, if we wanted to see which digital words got twenty or fewer votes, we can do

digital_word_votes <= 20

[1] FALSE FALSE  TRUE  TRUE FALSE FALSE

We can use this sequence of TRUE and FALSE values to get the actual words from the digital_words vector.

digital_words[digital_word_votes <= 20]

[1] "gamers"            "ice cream so good"

Exercise

Using logical indexing, get the word that got the most votes out of digital_words.

If we compare all of the values in digital_word_votes to max(digital_word_votes), we’ll get back a vector with TRUE where the value is the max, and FALSE elsewhere.

digital_word_votes == max(digital_word_votes)

[1]  TRUE FALSE FALSE FALSE FALSE FALSE

If we can use that logical comparison as an index vector.

digital_words[digital_word_votes == max(digital_word_votes)]

[1] "enshittification"

To write more readable code, it might be nice to create intermediate variables, or use more newlines

is_max <- digital_word_votes == max(digital_word_votes)

digital_words[is_max]

[1] "enshittification"

digital_words[
  digital_word_votes == max(digital_word_votes)
]

[1] "enshittification"

Data Frames

The most common kind of data structure we’re going to be working with are Data Frames. These are two dimensional structures with rows and columns. The data types within each column all need to be the same.

library(tibble)

word_df <- tibble(
  type = "digital",
  word = digital_words,
  votes = digital_word_votes
)

print(word_df)

# A tibble: 6 × 3
  type    word              votes
  <chr>   <chr>             <dbl>
1 digital enshittification    111
2 digital chat                 59
3 digital gamers               11
4 digital ice cream so good    11
5 digital millennial pause     45
6 digital skibidi              46

Navigating data frames

To navigate data frames, there are a few handy functions. First, in RStudio you can launch a viewer with View()

View(word_df)

Keeping things inside the Quarto notebook, other useful functions are summary(), nrow(), ncol() and colnames().

summary(word_df)

     type               word               votes       
 Length:6           Length:6           Min.   : 11.00  
 Class :character   Class :character   1st Qu.: 19.50  
 Mode  :character   Mode  :character   Median : 45.50  
                                       Mean   : 47.17  
                                       3rd Qu.: 55.75  
                                       Max.   :111.00

nrow(word_df)

[1] 6

ncol(word_df)

[1] 3

colnames(word_df)

[1] "type"  "word"  "votes"

Indexing Data Frames

To get all of the data from a single column of a data frame, we can put $ after the data frame variable name, then the name of the column.

word_df$word

[1] "enshittification"  "chat"              "gamers"           
[4] "ice cream so good" "millennial pause"  "skibidi"

We’re going to have more, interesting ways to get specific rows from a data frame later on in the course, but for now if you want to subset just the rows that have 20 or fewer votes, we can use subset.

subset(word_df, votes <= 20)

Pipe Preview

The “pipe” (|>) is going to play a big role in our R workflow. What it does is take whatever is on its left hand side and inserts it as the first argument to the function on the left hand side. Here’s a preview.

word_df |> 
  subset(votes <= 20)

Exercise

Subset the word_df dataframe give us back the row with the most votes.

subset(
  word_df,
  votes == max(votes)
)

Packages

Packages get installed once with install.pacakges()

# Only needs to be run once ever, or when updating
install.packages("tidyverse")

But they need to be loaded every time with library()

# Needs to be run every time
library(tidyverse)

If you try to load a package that you haven’t installed yet, you’ll get this error:

library(fake_library)

Error in library(fake_library): there is no package called 'fake_library'

Footnotes

The reason they aren’t called “lists” is because there’s another kind of data object called a list that has different properties.↩︎

Reuse

CC-BY 4.0

Citation

BibTeX citation:

@online{fruehwald2024,
  author = {Fruehwald, Josef},
  title = {Starting with {R}},
  date = {2024-09-04},
  url = {https://lin611-2024.github.io/notes/meetings/2024-09-04_starting-r.html},
  langid = {en}
}

For attribution, please cite this work as:

Fruehwald, Josef. 2024. “Starting with R.” September 4, 2024. https://lin611-2024.github.io/notes/meetings/2024-09-04_starting-r.html.