5. Basic textual analysis

Author

G.H. Koo

Learning goals

By the end of this tutorial, you will be able to:

Understand how text data is structured for analysis in R
Tokenize text into words and n-grams
Remove stopwords and clean textual data
Calculate word frequencies and visualize them
Create word clouds
Identify co-occurring words using bigrams

1. Introduction

We will learn how to preprocess text data, remove noise, and analyze patterns in language. First, install and load the packages

packages <- c("tidytext", "tokenizers", "wordcloud", 
              "textdata", "RColorBrewer", "htmltools")
install.packages(packages)

library(tidyverse)
library(tidytext)
library(tokenizers)
library(wordcloud)
library(textdata)
library(htmltools)
library(dplyr)

In this module, we use YouTube data collected via the YouTube API. The dataset was gathered using the keyword “protest” from 12-01-2024 to 02-01-2025.

data <- read.csv("data/protest_YouTube.csv")

Let’s create a subset with five key variables to begin the analysis.

YT_data <- data[, c("channelTitle", "title", "description", 
                    "publishedAt", "video_id")]

2. Pre-processing text

Text data is often unstructured, so it needs to be processed before analysis. One key step is tokenization, which splits text into individual words.

Example sentence: I am going to school today
Tokenized output: I, am, going, to, school, today

This processed text data can then be used for various analyses, such as identifying word frequency (which words appear most often) or detecting sentiment by analyzing the presence of positive or negative words.

2-1. Tokenization

Common words like “the,” “and,” and “to” appear frequently but carry little meaning. These are called stopwords. Let’s tokenize the description column in YT_data and create a word column.

YT_tokenized <- YT_data |>
  unnest_tokens(word, description)

Tokenization breaks text into individual words and removes sentence structure.If you examine the first 20 words, you’ll notice common words like “to,” “the,” “and,” and “of,” which add little (or no) meaning.

head(YT_tokenized$word, 20)

 [1] "people's"  "march"     "adviser"   "ash"       "lee"       "woodard"  
 [7] "henderson" "discusses" "the"       "upcoming"  "march"     "to"       
[13] "protest"   "donald"    "trump's"   "second"    "term"      "and"      
[19] "ahead"     "of"

2-2. Removing stopwords

One of the first steps in Natural Language Processing (NLP) is removing words that carry little to no meaningful information, such as “a,” “the,” and “is” These are known as stopwords.

In tidytext, there is a predefined list of stopwords (stop_words).

YT_tokenized_cleaned <- YT_tokenized |>
  anti_join(stop_words, by = "word")
head(YT_tokenized_cleaned)

Note

To view some preset stopwords, you can simply run ‘stop_words’

2-3. Customize stopwords

If there is a specific stopword you’d like to add—such as “lol”, you can create a custom stopword list and merge it with the existing tidytext stopwords dataset.For example, the following code adds “lol” to the stopword list (creating a new dataframe with your custom stopwords):

Create a custom stopword list (e.g., adding “lol”):

custom_stop <- data.frame(word = c("lol"), lexicon = "custom")

Combine the custom stopwords with the default stopwords

final_stop <- bind_rows(stop_words, custom_stop)

This code removes unwanted words from the tokenized dataset.

YT_tokenized_cleaned <- YT_tokenized |>
  anti_join(final_stop, by = "word")

Try: head(YT_tokenized_cleaned$word, 20) to display the cleaned words with stopwords removed.

3. Word frequency analysis

Count word frequencies

word_counts <- YT_tokenized_cleaned |>
  count(word, sort = TRUE)
head(word_counts, 20)

         word   n
1     protest 180
2        bpsc  53
3        news  53
4       https  45
5    protests  37
6        live  33
7   president  30
8  protesters  28
9       trump  27
10  subscribe  25
11         na  24
12     police  24
13     donald  23
14    trump's  22
15      bihar  21
16      march  21
17         de  20
18         se  20
19     shorts  20
20  thousands  20

Tip

It is natural for the keyword used in data collection to appear most frequently. If you want to remove “protest”, you can add it to the stopword list.

3-1. Visualization

You can also visualize this using ggplot2. For example, to display the top 10 most frequently occurring words:

word_counts <- YT_tokenized_cleaned |>
  count(word, sort = TRUE) |>
  slice_max(n, n = 10)

Here, I selected top 15 words. If you don’t specify a number or use this function, the graph will include the entire dataset, which may be difficult to interpret.

ggplot(word_counts, 
       aes(x = reorder(word, n), y = n)) +
  geom_col() +
  coord_flip() +
  labs(title = "Top 15 Most Frequent Words",
       x = "Words",
       y = "Frequency")

Word frequency does not capture context, sarcasm, or its underlying meaning. Text analysis should be interpreted carefully and ideally combined with qualitative review.

4. Word cloud

You can also generate a word cloud, which visually represents the most common words in a text. In a word cloud, larger words indicate higher frequency. This can be a useful way to quickly identify the most important terms in a large dataset.

Let’s count word frequency

word_counts <- YT_tokenized_cleaned |>
  count(word, sort = TRUE)

Also, create a word cloud (using wordcloud and RColorBrewer)

wordcloud(words = word_counts$word,
          freq = word_counts$n,
          min.freq = 2,
          max.words = 100,
          colors = brewer.pal(8, "Dark2"),
          random.order = FALSE)

Tip

min.freq = Minimum frequency for a word to be displayed (display words that appear at least twice).
max.words = Maximum words to show.
colors = brewer.pal(8, "Dark2") = Color palette (this palette use the max of 8 colors; you can change it). If you’d like to see what color palettes are available, you can use display.brewer.all().
random.order = FALSE Arrange by frequency.

5. N-gram analysis

An n-gram is a sequence of words. This analysis helps examine sequences of words in text data.

A bigram consists of two consecutive words, such as communication theory. A trigram contains three words in sequence, like school of communication. A 5-gram (quintgram) includes five consecutive words, for example, communication theory and research foundation.

You can use n-gram analysis in R using the unnest_tokens() function to extract meaningful word patterns from text. We previous tokenized our text and looked at each individual word, but if we want to analyze n-grams, we can set the token argument to “ngrams”.

Bigrams (2-word sequence)

YT_bigram <- YT_data |>
  unnest_tokens(bigram, description, token = "ngrams", n = 2)

Trigrams (3-word sequence) Additionally, you can specify the n argument to define the length of the n-gram. For example, setting n = 3 will extract trigrams (three-word sequences).

YT_trigram <- YT_data |>
  unnest_tokens(trigram, description, token = "ngrams", n = 3)
head(YT_trigram$trigram)

[1] "people's march adviser"      "march adviser ash"          
[3] "adviser ash lee"             "ash lee woodard"            
[5] "lee woodard henderson"       "woodard henderson discusses"

Let’s identify top 5 trigram (by frequency)

YT_trigram |>
  count(trigram, sort = TRUE) |>
  head(5)

               trigram   n
1                 <NA> 119
2     subscribe to sun  14
3      news channel to  13
4     sun news channel  13
5 sunnews subscribe to  13

Q. How do we use it in our research?

If you want to analyze how different phrases are used—for instance, if your bigrams show “vaccine mandate,” “vaccine hesitancy,” “vaccine conspiracy,” and “vaccine autism”*, this dataset may primarily reflect anti-vaccine sentiment. By examining the frequency and context of these phrases, you can identify the patterns or trends of online discourse.

Improve the n-gram explanation with a cleaning step

When doing bigrams, many results include stopwords like: ‘of the,’ ‘in the,’ ‘to the.’ You may want to remove stopwords from bigrams:

YT_bigram_clean <- YT_data |>
  unnest_tokens(bigram, description, 
                token = "ngrams", n = 2) |>
  separate(bigram, into = c("word1", "word2"), sep = " ") |>
  filter(!word1 %in% stop_words$word,
         !word2 %in% stop_words$word)

YT_bigram_clean now contains two-word combinations, no stopwords, and cleaned phrases ready for frequency analysis

Note

description is the text column being analyzed
unnest_tokens() splits text into tokens
token = "ngrams" tells R to create word sequences
n = 2 means two-word sequences (bigrams)
The new column created is named bigram
Each bigram like “people protest” is split into: word1 = “people” & word2 = “protest”
filter() removes bigrams where either word is a common stopword.

Bi-gram: Words associated with a keyword

We can identify words associated with specific keywords in the dataset.

For example, to find words that appear with “president,” we tokenize the text into bigrams using unnest_tokens(), split each bigram into two words, and filter for pairs that include “president”.

Counting the accompanying words shows which terms most often co-occur with “president,” helping us see how the term is described or used in context.

president_bigrams <- YT_data |>
  unnest_tokens(bigram, description, 
                token = "ngrams", n = 2) |>
  separate(bigram, into = c("word1", "word2"), sep = " ") |>
  filter(word1 == "president" | word2 == "president") |>
  count(word1, word2, sort = TRUE)

Let’s look at the results:

president_bigrams

       word1      word2 n
1  president     donald 7
2  president      elect 6
3    against  president 4
4       news  president 4
5  president    trump's 3
6  president       yoon 3
7         us  president 3
8        for  president 2
9         of  president 2
10        to  president 2
11   america  president 1
12       and  president 1
13        as  president 1
14   hegseth  president 1
15    oppose  president 1
16 president aleksandar 1
17 president        has 1
18 president       news 1
19 president         of 1
20 president      putin 1
21 president  subscribe 1
22 president      trump 1
23 president   vladimir 1
24 president    william 1
25 president   zelensky 1
26   protest  president 1
27     putin  president 1
28   russian  president 1
29   sudan's  president 1
30       u.s  president 1
31      with  president 1
32  zionists  president 1

This captures immediate neighboring words (co-occurrences) such as “president trump” or “president elect.” We will also cover the semantic network analysis which allows us to looks at broader relationships among words across the dataset, not just adjacent ones.

Summary

In this module, you learned how to structure and clean textual data in R. You practiced tokenizing text, removing stopwords, analyzing word frequency, generating visualizations, and examining word sequences using n-grams. These techniques provide a foundation for computational text analysis and prepare you for more advanced methods such as sentiment analysis, topic modeling, and semantic network analysis.