2. Data structure & cleaning

Author

G.H. Koo

Learning goals

By the end of this tutorial, you will be able to:

  • Understand vectors, data types, and data frames in R
  • Convert between different data types
  • Use the tidyverse to clean and summarize data
  • Apply pipe operators to write readable R code

1.Data types and structures in R

Before working with datasets, it helps to understand a few basic terms used in R. R stores information in different data types. Understanding these types helps ensure your data is read and analyzed correctly.

Data types

  • Numeric
    Numbers that can include decimals.
    Example: 2.5 inches of snow.

  • Integer
    Whole numbers without decimals.
    Example: number of cars = 2.

  • Character (string)
    Text data such as letters, words, or sentences.
    Example: "Washington DC".

  • Logical
    Values that represent true or false.
    In R, these are written as TRUE or FALSE.

  • Factor
    Categorical variables with a fixed set of possible values (called levels).
    Examples: "Child" vs "Adult", "Group A" vs "Group B".

Note

When importing data, R may not always assign the correct type automatically. It is good practice to check and convert variable types before analysis.

Data structures

  • Object
    A general term for anything you create and store in R’s memory. For example, when you assign a value using <-, you are creating an object.

  • Vector
    A sequence of values stored in order where all elements are the same type (such as numbers or characters). Vectors are the most basic data structure in R.

  • List
    A container that can store multiple items of different types and lengths. Lists are flexible and can grow as needed. You can access elements by position (e.g., mylist[[1]]) or by name (e.g., mylist$age).

  • Matrix
    A two-dimensional structure made of rows and columns. All values in a matrix must be the same type, and its size is fixed when created.

  • Data frame
    A table-like structure commonly used for datasets. Each column contains one type of data, columns have names, and different columns can store different types (numeric, character, etc.).

Note

These structures form the foundation for organizing and analyzing data in R and will be used throughout the rest of the book.

1.1 Vectors

# Vector of strings
weather <- c("windy", "cloudy", "snowy", "cold", "hot")
weather
[1] "windy"  "cloudy" "snowy"  "cold"   "hot"   
length(weather)
[1] 5
weather[1]           # first element
[1] "windy"
weather[c(1, 4)]     # first and fourth elements
[1] "windy" "cold" 
sort(weather)        # alphabetical order
[1] "cloudy" "cold"   "hot"    "snowy"  "windy" 
weather[1] <- "breezy"
class(weather)
[1] "character"
# Numeric vector
numbers <- c(7, 1, 3, 5)
sort(numbers)
[1] 1 3 5 7
# Sequence
number_seq <- 1:5
number_seq
[1] 1 2 3 4 5
# Logical vector
veracity <- c(TRUE, FALSE, TRUE, FALSE)
veracity
[1]  TRUE FALSE  TRUE FALSE

1.2 Object

first_computer <- 1946
first_moon <- 1969

1.3 Data frames

weather_df <- data.frame(
  day = c("Mon", "Tue", "Wed"),
  temp = c(70, 75, 80),
  snow = c(TRUE, FALSE, TRUE)
)
weather_df
  day temp  snow
1 Mon   70  TRUE
2 Tue   75 FALSE
3 Wed   80  TRUE

1.4 List: Vectors and lists

Vector: one data type, stored in a sequence

numbers <- c(1, 2, 3, 4, 5)
numbers
[1] 1 2 3 4 5
numbers[1:3]
[1] 1 2 3
fruits <- c("Apple", "Banana", "Orange")
fruits[1]
[1] "Apple"

List: can store multiple objects of different types

combined <- list(numbers = numbers, fruits = fruits)
combined
$numbers
[1] 1 2 3 4 5

$fruits
[1] "Apple"  "Banana" "Orange"
combined[["numbers"]]
[1] 1 2 3 4 5
combined[["fruits"]]
[1] "Apple"  "Banana" "Orange"
combined[["numbers"]][1:3]
[1] 1 2 3
combined[["fruits"]][1]
[1] "Apple"

1.5 Matrices: Matrix and data frame

Matrix: one data type, arranged in rows and columns

a <- matrix(0, nrow = 5, ncol = 5)
b <- matrix(rnorm(25), nrow = 5, ncol = 5)

Convert matrix to data frame

df <- as.data.frame(b)
df[1, 5]
[1] 0.07675458

Data frame: columns can be different types (each column is one type)

new_dataset <- mtcars
new_dataset$wt[1:5]
[1] 2.620 2.875 2.320 3.215 3.440

Convert a column (example)

new_dataset$wt <- as.character(new_dataset$wt)
class(new_dataset$wt)
[1] "character"
str(new_dataset$wt)
 chr [1:32] "2.62" "2.875" "2.32" "3.215" "3.44" "3.46" "3.57" "3.19" ...
Tip

Matrices must contain a single data type. If you mix numbers and text, R will coerce everything into one type (often character). Data frames are usually more flexible for real-world datasets.

Comparing list, matrix, and data frame

# Data frames require columns of equal length
fruits <- c("Apple", "Banana", "Orange", 
            "Raspberry", "Strawberry")
numbers <- c(1, 2, 3, 4, 5)
df <- data.frame(numbers = numbers, fruits = fruits)
df
  numbers     fruits
1       1      Apple
2       2     Banana
3       3     Orange
4       4  Raspberry
5       5 Strawberry
df[1, ]
  numbers fruits
1       1  Apple
df[, 1]
[1] 1 2 3 4 5
df[1:3, ]
  numbers fruits
1       1  Apple
2       2 Banana
3       3 Orange

Matrix example: mixing numeric + character coerces to character

numbers <- c(1, 2, 3)
fruits  <- c("Apple", "Banana", "Orange")
m <- matrix(c(numbers, fruits), nrow = 3, ncol = 2)
m
     [,1] [,2]    
[1,] "1"  "Apple" 
[2,] "2"  "Banana"
[3,] "3"  "Orange"
class(m)
[1] "matrix" "array" 
mode(m)
[1] "character"

1.6 Data type conversion

x <- 5
as.character(x)  # convert to character
as.numeric(x)    # convert to numeric
as.integer(x)    # convert to integer
as.logical(x)    # convert to logical
temps <- c("30", "25", "40")
class(temps)
[1] "character"

Convert to numeric

temps_num <- as.numeric(temps)
class(temps_num)
[1] "numeric"

Check the type of an object

variable <- "2.5"
class(variable)
[1] "character"

Convert character to numeric

numeric_variable <- as.numeric(variable)
class(numeric_variable)
numeric_variable

Common coercion functions

as.numeric(3.2)
as.character(10)
as.integer("3")

Caution: non-numeric text becomes NA

as.numeric("three")
[1] NA
Note

When coercion produces NA, it usually means the original value could not be converted. Always check for unexpected NA values after converting variables.

Introduction to tidyverse and the mtcars dataset

In this part of the module, we will now use mtcars, a built-in dataset from Motor Trend magazine (1974), to learn basic data cleaning and preprocessing.

Note

When working with large datasets, filtering variables of interest, summarizing data, arranging, and mutating (such as to create new columns after averaging the scores) can significantly save time and memory usage in your software. We will use the ‘tidyverse’ package to do so. Tidyverse will automatically load packages that you’re likely to use in everyday data analyses, such as dplyr, readr, tidyr … etc.


2. Data cleaning and summarizing

2.1 Take a first look at the data

# install.packages("tidyverse") # Run once if needed
library(tidyverse)
data(mtcars) # Load a sample dataset
?mtcars # To learn more about the dataset

Check the first few rows: Use head() function to check the first few rows of the dataset, and View() to open the dataset

head(mtcars) 
View(mtcars)

2.2 Pipe operators: |>, %>%

Tip

In R version 4.1 and later, the base pipe operator |> works the same way as %>%. The difference is that |> is built directly into R, so you do not need to install or load dplyr to use it.

You can turn |> on through Tools > Global Options.

Basic code (without pipe operator)

mean(c(1, 2, 3, 4)) # mean(): we get the mean score. 
[1] 2.5
Note

Bonus: If we do not use c(), R will treat the numbers as separate function inputs instead of data values. Try: mean(1, 2, 3, 4) and check its result

mean(1, 2, 3, 4)
[1] 1

Using |> (basic pipe)

It takes the result of one command and passes it into the next command.

c(1, 2, 3, 4) |> mean()
[1] 2.5

Using %>%

It works almost the same way as |> but requires installing and loading packages.

library(tidyverse)
c(1, 2, 3, 4) %>% mean()
[1] 2.5

Pipe operators are useful for longer code because they reduce repetition, improve readability, simplify debugging and modification, and avoid creating temporary variables (more examples below).

2.3 Data cleaning

We will use the select(), filter(), mutate(), arrange(), and summarize() functions. You are also encouraged to use pipe operators.

Example 1: Select specific columns (Useful when you have a huge dataset).

Three options (with/without pipe operators):

  1. Without pipe operator:

Select three columns from the mtcars dataset

select(mtcars, mpg, hp, wt) 

If you want to the first few rows of specific columns:

head(select(mtcars, mpg, hp, wt)) 
                   mpg  hp    wt
Mazda RX4         21.0 110 2.620
Mazda RX4 Wag     21.0 110 2.875
Datsun 710        22.8  93 2.320
Hornet 4 Drive    21.4 110 3.215
Hornet Sportabout 18.7 175 3.440
Valiant           18.1 105 3.460
  1. With base pipe |>
mtcars |>
  select(mpg, hp, wt) |>
  head()
                   mpg  hp    wt
Mazda RX4         21.0 110 2.620
Mazda RX4 Wag     21.0 110 2.875
Datsun 710        22.8  93 2.320
Hornet 4 Drive    21.4 110 3.215
Hornet Sportabout 18.7 175 3.440
Valiant           18.1 105 3.460
  1. With %>%
mtcars %>%
  select(mpg, hp, wt) %>% # Select three columns 
  head() # View the first few rows
                   mpg  hp    wt
Mazda RX4         21.0 110 2.620
Mazda RX4 Wag     21.0 110 2.875
Datsun 710        22.8  93 2.320
Hornet 4 Drive    21.4 110 3.215
Hornet Sportabout 18.7 175 3.440
Valiant           18.1 105 3.460

Example 2: Filter rows where Miles/gallon (mpg) is greater than 30

mtcars |> 
  filter(mpg > 30)

You can also create a new filtered dataframe ‘new_mtcars’

new_mtcars <- mtcars |> 
  filter(mpg > 30)

Example 3: Mutate (Create a new column) that shows TRUE if mpg is over 30, or FALSE otherwise

mtcars |>
  mutate(mpg_over_30 = ifelse(mpg > 30, TRUE, FALSE))

Example 4: Arrange data by horsepower (hp) in descending order

mtcars |> 
  arrange(desc(hp))

Example 5: Summarize data to get the average (mean) mpg and hp

mtcars |>
  summarize(
    avg_mpg = mean(mpg),
    avg_hp = mean(hp)
    )

3. Practice: Working with HDSinRdata

Note

In this practice question, I will demonstrate how to clean and work with datasets from the HDSinRdata package. HDSinRdata package includes ten datasets used in the chapters and exercises of Paul, Alice (2023) “Health Data Science in R” https://alicepaul.github.io/health-data-science-using-r/.

After we call the HDSinRdata package, we will use ‘covidcases’ and ‘mobility’ data from the package.

#install.packages("HDSinRdata")
library(HDSinRdata) 
data(covidcases) 
data(mobility)

Displays the first few lines of the dataset. Or you can specify the number of lines to view; here, it shows ‘5’

head(mobility) 
head(mobility, n = 5) 

Displays median, mean, min/max … etc

summary(mobility) 
    state               date              samples             m50        
 Length:9333        Length:9333        Min.   :   2353   Min.   : 0.018  
 Class :character   Class :character   1st Qu.:  70184   1st Qu.: 3.377  
 Mode  :character   Mode  :character   Median : 224983   Median : 5.834  
                                       Mean   : 307367   Mean   : 6.653  
                                       3rd Qu.: 411125   3rd Qu.: 8.797  
                                       Max.   :2625149   Max.   :59.890  
   m50_index      
 Min.   :   0.00  
 1st Qu.:  46.00  
 Median :  72.54  
 Mean   :  72.92  
 3rd Qu.:  94.97  
 Max.   :1563.00  

Pre-processing the data (select and summarize)

Check the names of the columns

colnames(covidcases)
colnames(mobility)

Display the frequency of unique values in the ‘state’ column

table(covidcases$state)
table(mobility$state)

Filter rows where the ‘state’ column is ‘Hawaii’ You could directly filter it or create a new dataframe with the filtered rows (recommended)

Directly filter:

covidcases |> 
  filter(state == "Hawaii")
# A tibble: 101 × 5
   state  county    week weekly_cases weekly_deaths
   <chr>  <chr>    <dbl>        <int>         <int>
 1 Hawaii Honolulu    10            2             0
 2 Hawaii Maui        11            3             0
 3 Hawaii Hawaii      11            1             0
 4 Hawaii Honolulu    11            6             0
 5 Hawaii Kauai       11            2             0
 6 Hawaii Maui        12            8             0
 7 Hawaii Hawaii      12            4             0
 8 Hawaii Honolulu    12           50             0
 9 Hawaii Kauai       12            1             0
10 Hawaii Maui        13           14             0
# ℹ 91 more rows

Create a new dataframe:

hawaii_covidcases <- covidcases |>
  filter(state == "Hawaii")
print(hawaii_covidcases)
hawaii_mobility <- mobility |>
  filter(state == "Hawaii")
print(hawaii_mobility)

You can also use nrow() to count how many rows there are (in filtered dataset)

nrow(hawaii_covidcases)  
[1] 101
nrow(hawaii_mobility)
[1] 183

Summary

In this chapter, you learned how to work with core R structures such as vectors, data frames, objects, lists, and matrices. You also practiced converting between data types, using pipe operators to write clear and readable code, and cleaning and summarizing real-world datasets. These foundational skills prepare you to move from data preparation to analysis. In the next chapter, we will build on these concepts to explore data visualization and interpretation.