Starting with R

notes
r

Starting with the very basics

Author

Josef Fruehwald

Published

September 4, 2024

Running R Code in a Quarto Notebook

To run R code in a Quarto notebook, you need to insert a “code chunk”. In visual editor mode, you can do that by typing the forward slash (/) and start typing in “R Code Chunk”. In the source editor mode, you have to have a line with ```{r} (three “backticks” followed by “r” in curly braces), then a few blank lines followed by another ```

```{r}
1+1
```
[1] 2

To actually run the code, you can either click on the green play button, or press the appropriate hotkey for your system

Mathematical Operations

Addition

1 + 4
[1] 5

Subtraction

1 - 4
[1] -3

Multiplication

5 * 4
[1] 20

Division

5 / 4
[1] 1.25

Exponentiation

5 ^ 4
[1] 625

Orders of Operation

Honestly, instead of gambling on how R may or may not interpret PEMDAS, just add parentheses ( ) around every operation in the order you want it to happen.

(5 ^ (2 * 2)) / 6
[1] 104.1667
Math

The formula to convert Celsius to Fahrenheit is

\[ \frac{9}{5}\text{C} + 32 \]

Somewhere around 20℃, the website Tops Aff declares it tops aff weather. What temperature is that in ℉?

While in mathemetical formulas we can write \(\frac{9}{5}\text{C}\) to mean multiplication, in R code we need to make the multiplication explicit.

# this won't work
# (9/5)C ______

# this will
(9/5) * C ______
((9/5) * C) + 32

Assignment

To assign values to a variable in R, you can use either <- or ->. Most style guides shun ->, but I actually wind up using it a lot.

my_variable <- 4 * 5
print(my_variable)
[1] 20
my_variable / 2
[1] 10
Tip

Assign different values to C to see their conversion to Fahrenheit.

Data Types

Numeric

When using a number in R, we can only use digits and dots (.). If we try to enter “one hundred thousand” with a comma separator, we’ll get an error.

big_number <- 100,000
Error: <text>:1:18: unexpected ','
1: big_number <- 100,
                     ^

We also can’t use any percent signs (%) or currency symbols ($, £, )

Characters

When we type in text without any quotes, R will assume it’s a variable or function that’s already been defined and go looking for it.

large <- 100000
large
[1] 1e+05

If the variable hasn’t been created already, we’ll get an error.

small
Error in eval(expr, envir, enclos): object 'small' not found

If we enter text inside of quotation marks, either single quotes ' or double quotes ", R will instead treat the text as a value that we could, for example, assign to a variable, or just print out.

"small"
[1] "small"
tiny_synonym <- "small"
tiny_synonym
[1] "small"
Common Error

You will often get confused about this and get the Error: object '' not found message. Even if you do this for 15 years, you will still sometimes enter plain text when you meant to put it in quotes, and put text in quotes you meant to enter without. It’s always annoying, but doesn’t mean you’re bad at doing this.

Exercise

What value is going to get printed below?

What value is going to be printed below? Change the code so that "fruit" gets printed.

fruit <- "tomato"
apple <- fruit

print(apple)
1
This line assigns the value "tomato" to the variable fruit.
2
This line assigns the variable fruit to the variable apple. They’ll share the same value, "tomato".
[1] "tomato"
fruit <- "tomato"
apple <- "fruit"

print(apple)
1
This will assign the value "fruit" to apple.
[1] "fruit"

Logical

There are two specialized values that you could call “True/False” or “Logical” or “Boolean” values

# fullnames
TRUE
[1] TRUE
FALSE
[1] FALSE
# Short Forms
T
[1] TRUE
F
[1] FALSE

These are often created using logical comparisons

large  <- 100000
medium <-    600

large < medium
[1] FALSE
short_word <- "to"

nchar(short_word) == 2
[1] TRUE
Exercise

Is 16℃ hotter than 50℉?

You’ll want to convert the temperatures to a common scale.

C_16 <- 16
F_50 <- 50

C_to_F <- ((9/5) * C_16) + 32
______

The direction of the comparison doesn’t matter so much.

C_16 <- 16
F_50 <- 50

C_to_F <- ((9/5) * C_16) + 32

C_to_F < F_50

NA

When you have a missing value, that’s given a special NA value.

numbers <- c(1, NA, 5)
numbers
[1]  1 NA  5
Missing vs 0

Distinguishing between missing data and 0 data is super important for data analysis, but isn’t always done well. For example, if we asked 3 people what their names were, and only remembered to asked 2 of them what their age was, we’d get a really different estimate of their average age if we entered 0 for the missing person!

names <- c(
  "Skylar",
  "Oakley",
  "Jessie"
)

ages_0 <- c(
  30,
  0,
  35
)

ages_na <- c(
  30,
  NA,
  35
)

mean(ages_0, na.rm = T)
1
Entering missing data as 0.
2
Entering missing data as NA.
[1] 21.66667
mean(ages_na, na.rm = T)
[1] 32.5

Vectors

Vectors are basically 1 dimensional lists of values.1 You can have numeric, character or logical vectors in R, but you can’t mix types. One way to create vectors is with the c() (for concatenate) function. There needs to be a comma , between every value that you add to a vector.

digital_words <- c(
  "enshittification",
  "chat",
  "gamers",
  "ice cream so good",
  "millennial pause",
  "skibidi"
)
print(digital_words)
1
New lines, and “whitespace” in general doesn’t matter.
[1] "enshittification"  "chat"              "gamers"           
[4] "ice cream so good" "millennial pause"  "skibidi"          
digital_word_votes <- c(
  111,
  59,
  11,
  11,
  45,
  46
)
print(digital_word_votes)
[1] 111  59  11  11  45  46

You can also create vectors of sequential vectors with the : operator.

1:10
 [1]  1  2  3  4  5  6  7  8  9 10
Exercise

Create a vector containing the names of three cities.

You’ll probably want to use the c() function.

cities <- c(______)

Make sure your city names go in quotes, with commas in between.

A solution might be:

cities <- c("Philadelphia", "Edinburgh", "Lexington")
More vector creating possibilities

There are a lot of functions for creating vectors.

seq(from = 1, to = 5, length = 10)
 [1] 1.000000 1.444444 1.888889 2.333333 2.777778 3.222222 3.666667 4.111111
 [9] 4.555556 5.000000
seq_along(digital_words)
[1] 1 2 3 4 5 6
rep(c("a", "b"), times = 3)
[1] "a" "b" "a" "b" "a" "b"
rep(c("a", "b"), each = 3)
[1] "a" "a" "a" "b" "b" "b"

Vector Arithmetic

You can do arithmetic on a whole vector of numbers. digital_word_votes is a vector of how many votes each word got. We can get the sum like so:

total_votes <- sum(digital_word_votes)
total_votes
[1] 283

Any single value we add, subtract, multiply, or divide will apply each value in the vector.

digital_word_votes * 10
[1] 1110  590  110  110  450  460
Exercise

Convert the raw counts of votes in digital_word_votes to proprtional votes.

Proportions are calculated by dividing each single amount by the total amount.

part1 <- 25
part2 <- 75

proportion1 = part1/(part1 + part2)
proportion2 = part2/(part1 + part2)

print(proportion1)
[1] 0.25
print(proportion2)
[1] 0.75

We already did an important step here by getting the sum of digital_votes.

sum(digital_word_votes)
[1] 283
digital_word_votes/sum(digital_word_votes)
[1] 0.39222615 0.20848057 0.03886926 0.03886926 0.15901060 0.16254417

Indexing

Indexing from 1

If you’ve never programmed before, this part will make sense, and if you haven’t programmed before, this part will be confusing.

If you have a vector, and you want to get the first value from it, you put square brackets [] after the variable name, and put 1 inside.

print(digital_words)
[1] "enshittification"  "chat"              "gamers"           
[4] "ice cream so good" "millennial pause"  "skibidi"          
digital_words[1]
[1] "enshittification"

If you want a range of values from a vector, you can give it a vector of numeric indices.

digital_words[2:5]
[1] "chat"              "gamers"            "ice cream so good"
[4] "millennial pause" 
Exercise

The vector letters is built into R, and contains the 26 letters of the alphabet.

letters
 [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
[20] "t" "u" "v" "w" "x" "y" "z"

Get the 13th letter of the alphabet.

letters[13]
[1] "m"

Logical Indexing

Also really useful is the ability to do logical indexing. For example, if we wanted to see which digital words got twenty or fewer votes, we can do

digital_word_votes <= 20
[1] FALSE FALSE  TRUE  TRUE FALSE FALSE

We can use this sequence of TRUE and FALSE values to get the actual words from the digital_words vector.

digital_words[digital_word_votes <= 20]
[1] "gamers"            "ice cream so good"
Exercise

Using logical indexing, get the word that got the most votes out of digital_words.

The maximum value in digital_word_votes will be the the most votes.

max(digital_word_votes)
[1] 111

If we compare all of the values in digital_word_votes to max(digital_word_votes), we’ll get back a vector with TRUE where the value is the max, and FALSE elsewhere.

digital_word_votes == max(digital_word_votes)
[1]  TRUE FALSE FALSE FALSE FALSE FALSE

If we can use that logical comparison as an index vector.

digital_words[digital_word_votes == max(digital_word_votes)]
[1] "enshittification"

To write more readable code, it might be nice to create intermediate variables, or use more newlines

is_max <- digital_word_votes == max(digital_word_votes)

digital_words[is_max]
[1] "enshittification"

or

digital_words[
  digital_word_votes == max(digital_word_votes)
]
[1] "enshittification"

Data Frames

The most common kind of data structure we’re going to be working with are Data Frames. These are two dimensional structures with rows and columns. The data types within each column all need to be the same.

library(tibble)

word_df <- tibble(
  type = "digital",
  word = digital_words,
  votes = digital_word_votes
)

print(word_df)
# A tibble: 6 × 3
  type    word              votes
  <chr>   <chr>             <dbl>
1 digital enshittification    111
2 digital chat                 59
3 digital gamers               11
4 digital ice cream so good    11
5 digital millennial pause     45
6 digital skibidi              46

Indexing Data Frames

To get all of the data from a single column of a data frame, we can put $ after the data frame variable name, then the name of the column.

word_df$word
[1] "enshittification"  "chat"              "gamers"           
[4] "ice cream so good" "millennial pause"  "skibidi"          

We’re going to have more, interesting ways to get specific rows from a data frame later on in the course, but for now if you want to subset just the rows that have 20 or fewer votes, we can use subset.

subset(word_df, votes <= 20)
Pipe Preview

The “pipe” (|>) is going to play a big role in our R workflow. What it does is take whatever is on its left hand side and inserts it as the first argument to the function on the left hand side. Here’s a preview.

word_df |> 
  subset(votes <= 20)
Exercise

Subset the word_df dataframe give us back the row with the most votes.

Think back to how we got the word with most votes out of the vector before.

Start with this code, and then modify it

subset(
  word_df, 
  votes <= 20
)
subset(
  word_df,
  votes == max(votes)
)

Packages

Packages get installed once with install.pacakges()

# Only needs to be run once ever, or when updating
install.packages("tidyverse")

But they need to be loaded every time with library()

# Needs to be run every time
library(tidyverse)

If you try to load a package that you haven’t installed yet, you’ll get this error:

library(fake_library)
Error in library(fake_library): there is no package called 'fake_library'

Footnotes

  1. The reason they aren’t called “lists” is because there’s another kind of data object called a list that has different properties.↩︎

Reuse

CC-BY 4.0

Citation

BibTeX citation:
@online{fruehwald2024,
  author = {Fruehwald, Josef},
  title = {Starting with {R}},
  date = {2024-09-04},
  url = {https://lin611-2024.github.io/notes/meetings/2024-09-04_starting-r.html},
  langid = {en}
}
For attribution, please cite this work as:
Fruehwald, Josef. 2024. “Starting with R.” September 4, 2024. https://lin611-2024.github.io/notes/meetings/2024-09-04_starting-r.html.