4. Data access: Social media API

Author

G.H. Koo

Learning goals

By the end of this tutorial, you will be able to:

Explain how APIs enable researchers to collect data from social media platforms
Retrieve Reddit data using RedditExtractoR, including posts, threads, and comments
Authenticate and collect data from the YouTube API using the tuber package
Filter and organize retrieved data by keyword, channel, and time frame
Save and manage collected data for later cleaning and analysis
Automate tasks in R: For() loops and apply() functions

1. Reddit Data Collection

1.1 Using `RedditExtractoR` (Wrapper for the Reddit API)

The RedditExtractoR package allows collection of posts, comments, and metadata from Reddit.

CRAN documentation:
https://cran.r-project.org/web/packages/RedditExtractoR/RedditExtractoR.pdf

Important limitation:
Most queries return roughly up to 1,000 posts per subreddit or search. This works best for recent or popular content rather than full historical archives.

# install.packages("RedditExtractoR")
library(RedditExtractoR)

# View available functions
ls("package:RedditExtractoR")

Step 1: Find Relevant Subreddits

subreddits <- find_subreddits("washingtondc")
head(subreddits)

This returns a data frame with subreddit names, descriptions, and subscriber counts.

Step 2: Retrieve Thread URLs

thread_urls <- find_thread_urls(
  subreddit = "washingtondc",
  sort_by = "top",
  period = "week"
)

head(thread_urls)
nrow(thread_urls)

Search by keyword:

weekly_dc_snow <- find_thread_urls(
  keywords = "snow",
  sort_by = "top",
  subreddit = "washingtondc",
  period = "week"
)

head(weekly_dc_snow)
nrow(weekly_dc_snow)

write.csv(weekly_dc_snow, 
          "weekly_DCsnow_reddit.csv", row.names = FALSE)

If you want to use multiple keywords (“a” OR “b” OR “c”):

keywords <- c("snow", "blizzard", "ice")

weekly_dc_snow_ver2 <- do.call(
  rbind,
  lapply(keywords, function(k) {
    find_thread_urls(
      keywords = k,
      sort_by = "top",
      subreddit = "washingtondc",
      period = "week"
    )
  })
)

nrow(weekly_dc_snow_ver2)

Step 3: Extract Post and Comment Content

thread_content <- 
  get_thread_content(thread_urls$url[1])
head(thread_content)

write.csv(thread_content, 
          "DC_thread_content.csv", row.names = FALSE)

Step 4: Retrieve Data from a Specific User

user_content <- get_user_content("enterusernamehere")
head(user_content)

Filtering by Date

library(lubridate)
library(dplyr)

thread_urls$date <- as_datetime(thread_urls$timestamp)

filtered_threads <- thread_urls %>%
  filter(date >= as_datetime("2024-03-01") &
         date <= as_datetime("2024-03-10"))

1.2 Using the Reddit API Directly

For more control, use Reddit’s official API.
You must create an app and authenticate using:

client ID
client secret
username and password

Documentation:
https://support.reddithelp.com/hc/en-us/articles/16160319875092-Reddit-Data-API-Wiki

Rate limit: ~60 requests per minute for personal use.

Steps:

Go to http://reddit.com/prefs/apps
Click “Create App”
Select type: Script
Redirect URI: http://localhost:1410
Save client ID and client secret

2. YouTube API

Requirements:

Google account
API credentials
Daily quota: 10,000 units

Install and Authenticate

library(devtools)
devtools::install_github("soodoku/tuber", 
                         build_vignettes = TRUE)

library(tuber)
options(scipen = 999)

client_ID <- "YOUR_CLIENT_ID"
client_secret <- "YOUR_CLIENT_SECRET"

yt_oauth(client_ID, client_secret, token = "")

After authentication, return to R when prompted.

Retrieve Channel Information

channel_resources <- get_channel_stats(
  channel_id = "UCBi2mrWuNuyYy4gbM6fU18Q"
) |> as.data.frame()

channel_resources

To find a channel ID:

YouTube → Channel → About → Share → Copy channel ID

Search Videos by Keyword and Date

protest_abcnews <- yt_search(
  "protest",
  channel_id = "UCBi2mrWuNuyYy4gbM6fU18Q",
  published_after = "2024-12-01T00:00:00Z",
  published_before = "2025-02-01T00:00:00Z",
  max_results = 10
)

nrow(protest_abcnews)
head(protest_abcnews)

write.csv(protest_abcnews, 
          "protest_abcnews.csv", row.names = FALSE)

Retrieve Video Details and Comments

details <- 
  get_video_details(video_id = "Yvb6ko7ZWaI") |> 
  as.data.frame()

stats <- get_stats(video_id = "Yvb6ko7ZWaI") |> 
  as.data.frame()

ABC_comments <- get_comment_threads(
  filter = c(video_id = "Yvb6ko7ZWaI"),
  max_results = 20
)

head(ABC_comments)

Optional table display:

library(knitr)
kable(head(ABC_comments, 10))

Retrieve Captions

# remotes::install_github("jooyoungseo/youtubecaption")
library(youtubecaption)

url <- "https://www.youtube.com/watch?v=Yvb6ko7ZWaI"
captions <- get_caption(url)

head(captions)

3. Loops in R

A for() loop repeats code for each element in a sequence.

for (i in 1:5) {
  print(paste("Student", i))
}

[1] "Student 1"
[1] "Student 2"
[1] "Student 3"
[1] "Student 4"
[1] "Student 5"

Example with text:

seasons <- c("Spring", "Summer", "Fall", "Winter")

for (s in seasons) {
  print(paste("I love", s))
}

[1] "I love Spring"
[1] "I love Summer"
[1] "I love Fall"
[1] "I love Winter"

`lapply()` Function

Applies a function to each element of a vector or list.

numbers <- c(4, 9, 16)
lapply(numbers, sqrt)

[[1]]
[1] 2

[[2]]
[1] 3

[[3]]
[1] 4

words <- c("snow", "rain", "sun")
lapply(words, toupper)

[[1]]
[1] "SNOW"

[[2]]
[1] "RAIN"

[[3]]
[1] "SUN"

Looping Over Reddit Threads

threads <- find_thread_urls(
  subreddit = "washingtondc",
  sort_by = "top",
  period = "week"
)

urls <- threads$url[1:3]

thread_contents <- lapply(urls, function(u) {
  get_thread_content(u)
})

thread_contents[[1]]

Using a for loop:

thread_contents <- list()

for (i in 1:3) {
  thread_contents[[i]] <- 
    get_thread_content(threads$url[i])
}

Bonus: Notifications for Long Scripts

This plays a sound when code finishes running.

# install.packages("beepr")
library(beepr)
beep()

Summary

This module introduced practical techniques for collecting digital data directly from online platforms. Because APIs and platform policies change frequently, always verify current documentation before starting a project.