EPPS 6302 Assignment 4

  1. Read about the package quanteda at https://quanteda.io/

  2. Download quanteda_textanalytics01.R from Teams or class GitHub

  3. Analyze:

    a. Biden-Xi summit data

    Here are the topic ten hashtags regarding the Biden-Xi summit:

    [1] “#china”          “#biden”          “#xijinping”     
    [4] “#joebiden”       “#america”        “#americans”     
    [7] “#coronavirus”    “#fentanyl”       “#xi”            
    [10] “#uyghurgenocide”

  1. b. US presidential inaugural speeches

    i. Any similarities and differences over time and among presidents?

    The presidents used similar words in their speeches. Many of the common words used amongst the previous and current presidents are America-centric, and relate to “America” and American values. Below are some charts showcasing these similarities and differences. The first chart is a word cloud of the most common words used by former Presidents George Bush, Barrack Obama, and Donald Trump. It can be seen that all three president used very similar wording; the largest words, i.e. the most frequently used words, were related the America or the US, or the country as a whole.

    As it can be seen in the chart below, the four most recent presidents including the current president, Joe Biden, still use similar phrasing. “America”, “nation”, and “people/citizens” all topped the four presidents’ most commonly used words. The ten most common words used all still related to America and American values, such as “freedom” and “liberty”.

    ii. Analyze positions of different presidents.

    The chart below depicts the relative frequency of the word “American” used by the ten most recent presidents of the United States. As it can be seen, the frequency of this word usage is very sporadic. The one most notable feature of the chart however, is how much more frequent Donald Trump used the word “American” in comparison to the other presidents.

    c. What is Wordfish?

    Here is an excerpt from Quanteda itself (Link to Quanteda Website) describing what Wordfish is:

    “Wordfish is a Poisson scaling model of one-dimensional document positions (Slapin and Proksch 2008). Wordfish also allows for scaling documents, but compared to Wordscores reference scores/texts are not required. Wordfish is an unsupervised one-dimensional text scaling method, meaning that it estimates the positions of documents solely based on the observed word frequencies.” (Quanteda)

    Here is another description of what Wordfish is from Slapin and Proksch (2008), the developers of Wordfish:

    “Recent advances in computational content analysis have provided scholars promising new ways for estimating party positions. However, existing text-based methods face challenges in producing valid and reliable time-series data. This article proposes a scaling algorithm called WORDFISH to estimate policy positions based on word frequencies in texts. The technique allows researchers to locate parties in one or multiple elections.” (Slapin and Proksch 2008)

    Here is the link to their peer-reviewed article for more information: https://doi.org/10.1111/j.1540-5907.2008.00338.x

# Sample program for using quanteda for text modeling and analysis
# Use vignette("auth", package = "rtweet") for authentication
# Documentation: vignette("quickstart", package = "quanteda")
# Website: https://quanteda.io/

library(quanteda)
Warning in .recacheSubclasses(def@className, def, env): undefined subclass
"packedMatrix" of class "mMatrix"; definition not updated
Warning in .recacheSubclasses(def@className, def, env): undefined subclass
"packedMatrix" of class "replValueSp"; definition not updated
Package version: 3.2.3
Unicode version: 14.0
ICU version: 70.1
Parallel computing: 8 of 8 threads used.
See https://quanteda.io for tutorials and examples.
library(quanteda.textmodels)
library(quanteda.textplots)
library(readr)
library(ggplot2)

# Twitter data about President Biden and Xi summit in November 2021
# Do some background search/study on the event

summit <- read_csv("https://raw.githubusercontent.com/datageneration/datamethods/master/textanalytics/summit_11162021.csv")
Rows: 14520 Columns: 90
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (50): screen_name, text, source, reply_to_screen_name, hashtags, symbol...
dbl  (26): user_id, status_id, display_text_width, reply_to_status_id, reply...
lgl  (10): is_quote, is_retweet, quote_count, reply_count, ext_media_type, q...
dttm  (4): created_at, quoted_created_at, retweet_created_at, account_create...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(summit)
# A tibble: 6 × 90
  user_id status_id created_at          screen_name    text              source 
    <dbl>     <dbl> <dttm>              <chr>          <chr>             <chr>  
1 1.38e18   1.46e18 2021-11-16 20:10:23 DSJ78992721    "Breaking News: … Twitte…
2 2.60e 8   1.46e18 2021-11-16 20:10:17 bradhooperarch "https://t.co/rK… Twitte…
3 3.00e 9   1.46e18 2021-11-16 20:10:10 scarecrow1113  "[Recap] Biden u… Twitte…
4 3.00e 9   1.46e18 2021-11-15 19:24:04 scarecrow1113  "U.S. President … Twitte…
5 1.36e18   1.46e18 2021-11-16 06:22:29 Internl_Leaks  "#BREAKING Biden… Twitte…
6 1.36e18   1.46e18 2021-11-16 20:09:36 Internl_Leaks  "#BREAKING Biden… Twitte…
# … with 84 more variables: display_text_width <dbl>, reply_to_status_id <dbl>,
#   reply_to_user_id <dbl>, reply_to_screen_name <chr>, is_quote <lgl>,
#   is_retweet <lgl>, favorite_count <dbl>, retweet_count <dbl>,
#   quote_count <lgl>, reply_count <lgl>, hashtags <chr>, symbols <chr>,
#   urls_url <chr>, urls_t.co <chr>, urls_expanded_url <chr>, media_url <chr>,
#   media_t.co <chr>, media_expanded_url <chr>, media_type <chr>,
#   ext_media_url <chr>, ext_media_t.co <chr>, ext_media_expanded_url <chr>, …
sum_twt = summit$text
toks = tokens(sum_twt)
sumtwtdfm <- dfm(toks)

# Latent Semantic Analysis
sum_lsa <- textmodel_lsa(sumtwtdfm)
summary(sum_lsa)
                Length    Class     Mode   
sk                     10 -none-    numeric
docs               145200 -none-    numeric
features           160020 -none-    numeric
matrix_low_rank 232349040 -none-    numeric
data            232349040 dgCMatrix S4     
tweet_dfm <- tokens(sum_twt, remove_punct = TRUE) %>%
  dfm()
head(tweet_dfm)
Document-feature matrix of: 6 documents, 15,941 features (99.89% sparse) and 0 docvars.
       features
docs    breaking news us president biden amp communist china leader xi
  text1        1    1  1         1     1   1         1     2      1  1
  text2        0    0  0         0     0   0         0     0      0  0
  text3        0    0  0         0     1   0         0     0      0  1
  text4        0    0  0         1     1   0         0     0      0  1
  text5        0    0  0         0     1   0         0     0      0  1
  text6        0    0  0         0     1   0         0     0      0  1
[ reached max_nfeat ... 15,931 more features ]
tag_dfm <- dfm_select(tweet_dfm, pattern = "#*")
toptag <- names(topfeatures(tag_dfm, 50))
head(toptag, 10)
 [1] "#china"          "#biden"          "#xijinping"      "#joebiden"      
 [5] "#america"        "#americans"      "#coronavirus"    "#fentanyl"      
 [9] "#xi"             "#uyghurgenocide"
library("quanteda.textplots")
tag_fcm <- fcm(tag_dfm)
head(tag_fcm)
Feature co-occurrence matrix of: 6 by 685 features.
               features
features        #breaking #breakingnews #biden #china #usa #pray4america
  #breaking             0             4      4      5    5             0
  #breakingnews         0             0      4      5    4             0
  #biden                0             0      0    415   44             0
  #china                0             0      0      8   76             0
  #usa                  0             0      0      0    6             0
  #pray4america         0             0      0      0    0             0
               features
features        #joebiden #xijinping #america #americans
  #breaking             0          0        0          0
  #breakingnews         0          0        0          0
  #biden              299        366      301        295
  #china              339        433      308        295
  #usa                 12         14        0          0
  #pray4america         0          0        0          0
[ reached max_nfeat ... 675 more features ]
topgat_fcm <- fcm_select(tag_fcm, pattern = toptag)
textplot_network(topgat_fcm, min_freq = 50, edge_alpha = 0.8, edge_size = 5)

user_dfm <- dfm_select(tweet_dfm, pattern = "@*")
topuser <- names(topfeatures(user_dfm, 50))
head(topuser, 20)
 [1] "@potus"           "@joebiden"        "@politico"        "@eneskanter"     
 [5] "@jendeben"        "@nwadhams"        "@nba"             "@washwizards"    
 [9] "@pelicansnba"     "@capitalonearena" "@kevinliptakcnn"  "@foxbusiness"    
[13] "@morningsmaria"   "@scmpnews"        "@uyghur_american" "@nytimes"        
[17] "@petermartin_pcm" "@nahaltoosi"      "@phelimkine"      "@kaylatausche"   
user_fcm <- fcm(user_dfm)
head(user_fcm, 20)
Feature co-occurrence matrix of: 20 by 741 features.
                 features
features          @youtube @bfmtv @cnn @lauhaim @barackobama @joebiden
  @youtube               0      0    0        0            0         0
  @bfmtv                 0      0    1        1            1         1
  @cnn                   0      0    0        1            1         1
  @lauhaim               0      0    0        0            1         1
  @barackobama           0      0    0        0            0         1
  @joebiden              0      0    0        0            0         3
  @kamalaharris          0      0    0        0            0         0
  @hillaryclinton        0      0    0        0            0         0
  @billclinton           0      0    0        0            0         0
  @cbsnews               0      0    0        0            0         0
                 features
features          @kamalaharris @hillaryclinton @billclinton @cbsnews
  @youtube                    0               0            0        0
  @bfmtv                      1               1            1        1
  @cnn                        1               1            1        1
  @lauhaim                    1               1            1        1
  @barackobama                1               1            1        1
  @joebiden                   1               1            1        1
  @kamalaharris               0               1            1        1
  @hillaryclinton             0               0            1        1
  @billclinton                0               0            0        1
  @cbsnews                    0               0            0        0
[ reached max_feat ... 10 more features, reached max_nfeat ... 731 more features ]
user_fcm <- fcm_select(user_fcm, pattern = topuser)
textplot_network(user_fcm, min_freq = 20, edge_color = "firebrick", edge_alpha = 0.8, edge_size = 5)

# Wordcloud
# based on US presidential inaugural address texts, and metadata (for the corpus), from 1789 to present.
dfm_inaug <- corpus_subset(data_corpus_inaugural, Year <= 1826) %>% 
  dfm(remove = stopwords('english'), remove_punct = TRUE) %>%
  dfm_trim(min_termfreq = 10, verbose = FALSE)
Warning: 'dfm.corpus()' is deprecated. Use 'tokens()' first.
Warning: '...' should not be used for tokens() arguments; use 'tokens()' first.
Warning: 'remove' is deprecated; use dfm_remove() instead
set.seed(100)
textplot_wordcloud(dfm_inaug)

inaug_speech = data_corpus_inaugural

corpus_subset(data_corpus_inaugural, 
              President %in% c("Trump", "Obama", "Bush")) %>%
  tokens(remove_punct = TRUE) %>%
  tokens_remove(stopwords("english")) %>%
  dfm() %>%
  dfm_group(groups = President) %>%
  dfm_trim(min_termfreq = 5, verbose = FALSE) %>%
  textplot_wordcloud(comparison = TRUE)
Warning in wordcloud_comparison(x, min_size, max_size, min_count, max_words, :
throughout could not be fit on page. It will not be plotted.
Warning in wordcloud_comparison(x, min_size, max_size, min_count, max_words, :
children could not be fit on page. It will not be plotted.

textplot_wordcloud(dfm_inaug, min_count = 10,
                   color = c('red', 'pink', 'green', 'purple', 'orange', 'blue'))

data_corpus_inaugural_subset <- 
  corpus_subset(data_corpus_inaugural, Year > 1949)
kwic(tokens(data_corpus_inaugural_subset), pattern = "american") %>%
  textplot_xray()

textplot_xray(
  kwic(data_corpus_inaugural_subset, pattern = "american"),
  kwic(data_corpus_inaugural_subset, pattern = "people"),
  kwic(data_corpus_inaugural_subset, pattern = "communist")
)
Warning: 'kwic.corpus()' is deprecated. Use 'tokens()' first.
Warning: 'kwic.corpus()' is deprecated. Use 'tokens()' first.

Warning: 'kwic.corpus()' is deprecated. Use 'tokens()' first.

theme_set(theme_bw())
g <- textplot_xray(
  kwic(data_corpus_inaugural_subset, pattern = "american"),
  kwic(data_corpus_inaugural_subset, pattern = "people"),
  kwic(data_corpus_inaugural_subset, pattern = "communist")
)
Warning: 'kwic.corpus()' is deprecated. Use 'tokens()' first.

Warning: 'kwic.corpus()' is deprecated. Use 'tokens()' first.

Warning: 'kwic.corpus()' is deprecated. Use 'tokens()' first.
g + aes(color = keyword) + 
  scale_color_manual(values = c("blue", "red", "green")) +
  theme(legend.position = "none")

library("quanteda.textstats")
Warning in .recacheSubclasses(def@className, def, env): undefined subclass
"packedMatrix" of class "mMatrix"; definition not updated
Warning in .recacheSubclasses(def@className, def, env): undefined subclass
"packedMatrix" of class "replValueSp"; definition not updated
features_dfm_inaug <- textstat_frequency(dfm_inaug, n = 100)

# Sort by reverse frequency order
features_dfm_inaug$feature <- with(features_dfm_inaug, reorder(feature, -frequency))

ggplot(features_dfm_inaug, aes(x = feature, y = frequency)) +
  geom_point() + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

# Get frequency grouped by president
freq_grouped <- textstat_frequency(dfm(tokens(data_corpus_inaugural_subset)), 
                                   groups = data_corpus_inaugural_subset$President)

# Filter the term "american"
freq_american <- subset(freq_grouped, freq_grouped$feature %in% "american")  

ggplot(freq_american, aes(x = group, y = frequency)) +
  geom_point() + 
  scale_y_continuous(limits = c(0, 14), breaks = c(seq(0, 14, 2))) +
  xlab(NULL) + 
  ylab("Frequency") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

dfm_rel_freq <- dfm_weight(dfm(tokens(data_corpus_inaugural_subset)), scheme = "prop") * 100
head(dfm_rel_freq)
Document-feature matrix of: 6 documents, 4,346 features (85.57% sparse) and 4 docvars.
                 features
docs                      my    friends        ,    before          i
  1953-Eisenhower 0.14582574 0.14582574 4.593511 0.1822822 0.10936930
  1957-Eisenhower 0.20975354 0.10487677 6.345045 0.1573152 0.05243838
  1961-Kennedy    0.19467878 0.06489293 5.451006 0.1297859 0.32446463
  1965-Johnson    0.17543860 0.05847953 5.555556 0.2339181 0.87719298
  1969-Nixon      0.28973510 0          5.546358 0.1241722 0.86920530
  1973-Nixon      0.05012531 0.05012531 4.812030 0.2005013 0.60150376
                 features
docs                   begin      the expression       of     those
  1953-Eisenhower 0.03645643 6.234050 0.03645643 5.176814 0.1458257
  1957-Eisenhower 0          5.977976 0          5.034085 0.1573152
  1961-Kennedy    0.19467878 5.580792 0          4.218040 0.4542505
  1965-Johnson    0          4.502924 0          3.333333 0.1754386
  1969-Nixon      0          5.629139 0          3.890728 0.4552980
  1973-Nixon      0          4.160401 0          3.408521 0.3007519
[ reached max_nfeat ... 4,336 more features ]
rel_freq <- textstat_frequency(dfm_rel_freq, groups = dfm_rel_freq$President)

# Filter the term "american"
rel_freq_american <- subset(rel_freq, feature %in% "american")  

ggplot(rel_freq_american, aes(x = group, y = frequency)) +
  geom_point() + 
  scale_y_continuous(limits = c(0, 0.7), breaks = c(seq(0, 0.7, 0.1))) +
  xlab(NULL) + 
  ylab("Relative frequency") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

dfm_weight_pres <- data_corpus_inaugural %>%
  corpus_subset(Year > 2000) %>%
  tokens(remove_punct = TRUE) %>%
  tokens_remove(stopwords("english")) %>%
  dfm() %>%
  dfm_weight(scheme = "prop")

# Calculate relative frequency by president
freq_weight <- textstat_frequency(dfm_weight_pres, n = 10, 
                                  groups = dfm_weight_pres$President)

ggplot(data = freq_weight, aes(x = nrow(freq_weight):1, y = frequency)) +
  geom_point() +
  facet_wrap(~ group, scales = "free") +
  coord_flip() +
  scale_x_continuous(breaks = nrow(freq_weight):1,
                     labels = freq_weight$feature) +
  labs(x = NULL, y = "Relative frequency")

# Only select speeches by Obama and Trump
pres_corpus <- corpus_subset(data_corpus_inaugural, 
                             President %in% c("Obama", "Trump"))

# Create a dfm grouped by president
pres_dfm <- tokens(pres_corpus, remove_punct = TRUE) %>%
  tokens_remove(stopwords("english")) %>%
  tokens_group(groups = President) %>%
  dfm()

# Calculate keyness and determine Trump as target group
result_keyness <- textstat_keyness(pres_dfm, target = "Trump")

# Plot estimated word keyness
textplot_keyness(result_keyness) 

# Plot without the reference text (in this case Obama)
textplot_keyness(result_keyness, show_reference = FALSE)

library("quanteda.textmodels")

# Transform corpus to dfm
data(data_corpus_irishbudget2010, package = "quanteda.textmodels")
ie_dfm <- dfm(tokens(data_corpus_irishbudget2010))

# Set reference scores
refscores <- c(rep(NA, 4), 1, -1, rep(NA, 8))

# Predict Wordscores model
ws <- textmodel_wordscores(ie_dfm, y = refscores, smooth = 1)

# Plot estimated word positions (highlight words and print them in red)
textplot_scale1d(ws,
                 highlighted = c("minister", "have", "our", "budget"), 
                 highlighted_color = "red")

# Get predictions
pred <- predict(ws, se.fit = TRUE)

# Plot estimated document positions and group by "party" variable
textplot_scale1d(pred, margin = "documents",
                 groups = docvars(data_corpus_irishbudget2010, "party"))

# Plot estimated document positions using the LBG transformation and group by "party" variable

pred_lbg <- predict(ws, se.fit = TRUE, rescaling = "lbg")

textplot_scale1d(pred_lbg, margin = "documents",
                 groups = docvars(data_corpus_irishbudget2010, "party"))

# Estimate Wordfish model
library("quanteda.textmodels")
wf <- textmodel_wordfish(dfm(tokens(data_corpus_irishbudget2010)), dir = c(6, 5))

# Plot estimated word positions
textplot_scale1d(wf, margin = "features", 
                 highlighted = c("government", "global", "children", 
                                 "bank", "economy", "the", "citizenship",
                                 "productivity", "deficit"), 
                 highlighted_color = "red")

# Plot estimated document positions
textplot_scale1d(wf, groups = data_corpus_irishbudget2010$party)

# Transform corpus to dfm
ie_dfm <- dfm(tokens(data_corpus_irishbudget2010))

# Run correspondence analysis on dfm
ca <- textmodel_ca(ie_dfm)

# Plot estimated positions and group by party
textplot_scale1d(ca, margin = "documents",
                 groups = docvars(data_corpus_irishbudget2010, "party"))