[1] "windy" "cloudy" "snowy" "cold" "hot"
[1] 5
[1] "windy"
[1] "windy" "cold"
[1] "cloudy" "cold" "hot" "snowy" "windy"
[1] "character"
[1] 1 3 5 7
[1] 1 2 3 4 5
[1] TRUE FALSE TRUE FALSE
G.H. Koo
By the end of this tutorial, you will be able to:
Before working with datasets, it helps to understand a few basic terms used in R. R stores information in different data types. Understanding these types helps ensure your data is read and analyzed correctly.
Numeric
Numbers that can include decimals.
Example: 2.5 inches of snow.
Integer
Whole numbers without decimals.
Example: number of cars = 2.
Character (string)
Text data such as letters, words, or sentences.
Example: "Washington DC".
Logical
Values that represent true or false.
In R, these are written as TRUE or FALSE.
Factor
Categorical variables with a fixed set of possible values (called levels).
Examples: "Child" vs "Adult", "Group A" vs "Group B".
When importing data, R may not always assign the correct type automatically. It is good practice to check and convert variable types before analysis.
Object
A general term for anything you create and store in R’s memory. For example, when you assign a value using <-, you are creating an object.
Vector
A sequence of values stored in order where all elements are the same type (such as numbers or characters). Vectors are the most basic data structure in R.
List
A container that can store multiple items of different types and lengths. Lists are flexible and can grow as needed. You can access elements by position (e.g., mylist[[1]]) or by name (e.g., mylist$age).
Matrix
A two-dimensional structure made of rows and columns. All values in a matrix must be the same type, and its size is fixed when created.
Data frame
A table-like structure commonly used for datasets. Each column contains one type of data, columns have names, and different columns can store different types (numeric, character, etc.).
These structures form the foundation for organizing and analyzing data in R and will be used throughout the rest of the book.
[1] "windy" "cloudy" "snowy" "cold" "hot"
[1] 5
[1] "windy"
[1] "windy" "cold"
[1] "cloudy" "cold" "hot" "snowy" "windy"
[1] "character"
[1] 1 3 5 7
[1] 1 2 3 4 5
[1] TRUE FALSE TRUE FALSE
Vector: one data type, stored in a sequence
[1] 1 2 3 4 5
[1] 1 2 3
[1] "Apple"
List: can store multiple objects of different types
Matrix: one data type, arranged in rows and columns
Convert matrix to data frame
Data frame: columns can be different types (each column is one type)
Convert a column (example)
[1] "character"
chr [1:32] "2.62" "2.875" "2.32" "3.215" "3.44" "3.46" "3.57" "3.19" ...
Matrices must contain a single data type. If you mix numbers and text, R will coerce everything into one type (often character). Data frames are usually more flexible for real-world datasets.
numbers fruits
1 1 Apple
2 2 Banana
3 3 Orange
4 4 Raspberry
5 5 Strawberry
numbers fruits
1 1 Apple
[1] 1 2 3 4 5
numbers fruits
1 1 Apple
2 2 Banana
3 3 Orange
Matrix example: mixing numeric + character coerces to character
Convert to numeric
Check the type of an object
Convert character to numeric
Common coercion functions
Caution: non-numeric text becomes NA
When coercion produces NA, it usually means the original value could not be converted. Always check for unexpected NA values after converting variables.
In this part of the module, we will now use mtcars, a built-in dataset from Motor Trend magazine (1974), to learn basic data cleaning and preprocessing.
When working with large datasets, filtering variables of interest, summarizing data, arranging, and mutating (such as to create new columns after averaging the scores) can significantly save time and memory usage in your software. We will use the ‘tidyverse’ package to do so. Tidyverse will automatically load packages that you’re likely to use in everyday data analyses, such as dplyr, readr, tidyr … etc.
Check the first few rows: Use head() function to check the first few rows of the dataset, and View() to open the dataset
|>, %>%In R version 4.1 and later, the base pipe operator |> works the same way as %>%. The difference is that |> is built directly into R, so you do not need to install or load dplyr to use it.
You can turn |> on through Tools > Global Options.
Basic code (without pipe operator)
Bonus: If we do not use c(), R will treat the numbers as separate function inputs instead of data values. Try: mean(1, 2, 3, 4) and check its result
It takes the result of one command and passes it into the next command.
It works almost the same way as |> but requires installing and loading packages.
Pipe operators are useful for longer code because they reduce repetition, improve readability, simplify debugging and modification, and avoid creating temporary variables (more examples below).
We will use the select(), filter(), mutate(), arrange(), and summarize() functions. You are also encouraged to use pipe operators.
Three options (with/without pipe operators):
Select three columns from the mtcars dataset
If you want to the first few rows of specific columns:
mpg hp wt
Mazda RX4 21.0 110 2.620
Mazda RX4 Wag 21.0 110 2.875
Datsun 710 22.8 93 2.320
Hornet 4 Drive 21.4 110 3.215
Hornet Sportabout 18.7 175 3.440
Valiant 18.1 105 3.460
mpg hp wt
Mazda RX4 21.0 110 2.620
Mazda RX4 Wag 21.0 110 2.875
Datsun 710 22.8 93 2.320
Hornet 4 Drive 21.4 110 3.215
Hornet Sportabout 18.7 175 3.440
Valiant 18.1 105 3.460
You can also create a new filtered dataframe ‘new_mtcars’
In this practice question, I will demonstrate how to clean and work with datasets from the HDSinRdata package. HDSinRdata package includes ten datasets used in the chapters and exercises of Paul, Alice (2023) “Health Data Science in R” https://alicepaul.github.io/health-data-science-using-r/.
After we call the HDSinRdata package, we will use ‘covidcases’ and ‘mobility’ data from the package.
Displays the first few lines of the dataset. Or you can specify the number of lines to view; here, it shows ‘5’
Displays median, mean, min/max … etc
state date samples m50
Length:9333 Length:9333 Min. : 2353 Min. : 0.018
Class :character Class :character 1st Qu.: 70184 1st Qu.: 3.377
Mode :character Mode :character Median : 224983 Median : 5.834
Mean : 307367 Mean : 6.653
3rd Qu.: 411125 3rd Qu.: 8.797
Max. :2625149 Max. :59.890
m50_index
Min. : 0.00
1st Qu.: 46.00
Median : 72.54
Mean : 72.92
3rd Qu.: 94.97
Max. :1563.00
Check the names of the columns
Display the frequency of unique values in the ‘state’ column
Filter rows where the ‘state’ column is ‘Hawaii’ You could directly filter it or create a new dataframe with the filtered rows (recommended)
Directly filter:
# A tibble: 101 × 5
state county week weekly_cases weekly_deaths
<chr> <chr> <dbl> <int> <int>
1 Hawaii Honolulu 10 2 0
2 Hawaii Maui 11 3 0
3 Hawaii Hawaii 11 1 0
4 Hawaii Honolulu 11 6 0
5 Hawaii Kauai 11 2 0
6 Hawaii Maui 12 8 0
7 Hawaii Hawaii 12 4 0
8 Hawaii Honolulu 12 50 0
9 Hawaii Kauai 12 1 0
10 Hawaii Maui 13 14 0
# ℹ 91 more rows
Create a new dataframe:
You can also use nrow() to count how many rows there are (in filtered dataset)
In this chapter, you learned how to work with core R structures such as vectors, data frames, objects, lists, and matrices. You also practiced converting between data types, using pipe operators to write clear and readable code, and cleaning and summarizing real-world datasets. These foundational skills prepare you to move from data preparation to analysis. In the next chapter, we will build on these concepts to explore data visualization and interpretation.