In few cases, one can find CSV or other type of flat files available on the internet. However, most of the times the data is not organised into a file that you can directly download into R. In these cases, we must use an Application Programming Interface (API), a description of the requests that can be sent to a certain service (database, website, etc.) and the kind of data that are returned. Many sources of data have made their data available via APIs over the internet; a computer program, or client, can make requests to the server, and the server responds back with the data, or with an error message.
We have seen the functionality of some packages,tidyverse
, wbstats
, eurostat
) that provide API wrappers and make life easier. However, there are many cases where you would like to get data or use a service with an API query when no ready-made R package exists. In addition, one can also scape data off a website, such as tables that appear in a Wikipedia entry or download someone’s tweets; we will be looking at this in another section.
Many APIs require you to register for access. This allows them to track who is querying their services and, more importantly, to manage demand - if you submit too many queries too quickly, you might be rate-limited and your requests de-prioritized or blocked. You should always check the API access policy of the web site to determine what these limits are.
Let us consider an example. Opencagedata provides an API to geocode, namely to convert back and forth between geographical coordinates (longitude/latitude) and addresses. I actually believe this is a better geocoder than the one used in Google maps.
In order to be able to use their service you must: - install the package with: install.packages("opencage")
- Go to the opencage website and sign up for an account. Make sure you select both forward and reverse geocoding. -By default you get the free service that allows 2,500 requests/day and limits you to 1 request/sec. - Once you register, you get an email with your API key, or you can go to the dashboard and be able to see and copy your API key there.
All functions of the opencage
package will conveniently look for your API key, so before using the service, you must save your API key as an R environment variable, rather than having to input it manually in every single function call.
To save your API key, you must create (or edit) a file, .Renviron
; this is a hidden file that lives in your home directory. The easiest way to find and edit .Renviron is with a function from the usethis
package. In R, after you load the usethis
package with library(usethis)
you just invoke
usethis::edit_r_environ()
Your .Renviron
file should show up in your editor, where you add a line
OPENCAGE_KEY=“xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx”
with your own, unique API key.
Before you exit, make sure your .Renviron ends with a blank line, then save and close it. Restart RStudio after modifying .Renviron in order to load the API key into memory. To check everything worked, go to console and type Sys.getenv("OPENCAGE_KEY")
Now that we have set the API key as an environment variable, we can geocode the LBS postcode using opencage_forward()
. The results we get are not just the latitude/longitude coordinates that allow us to pinpoint NW1 4SA on a map, but a wealth of other information.
We can also use opencage_reverse()
, where we pass the latitude/longitude coordinates, and, in our example, we get back information on Soho’s John Snow pub (and no, there is no pub named after the fictional Jon Snow).
library(opencage)
# Forward geocode London Business School postcode, NW1 4SA.
lbs_geocode <- opencage_forward("NW1 4SA")
lbs_geocode$results %>%
knitr::kable() %>%
kable_styling(c("striped", "bordered")) %>%
scroll_box(width = "100%", height = "200px")
annotations.DMS.lat | annotations.DMS.lng | annotations.MGRS | annotations.Maidenhead | annotations.Mercator.x | annotations.Mercator.y | annotations.OSM.note_url | annotations.OSM.url | annotations.UN_M49.regions.EUROPE | annotations.UN_M49.regions.GB | annotations.UN_M49.regions.NORTHERN_EUROPE | annotations.UN_M49.regions.WORLD | annotations.UN_M49.statistical_groupings | annotations.callingcode | annotations.currency.decimal_mark | annotations.currency.html_entity | annotations.currency.iso_code | annotations.currency.iso_numeric | annotations.currency.name | annotations.currency.smallest_denomination | annotations.currency.subunit | annotations.currency.subunit_to_unit | annotations.currency.symbol | annotations.currency.symbol_first | annotations.currency.thousands_separator | annotations.flag | annotations.geohash | annotations.qibla | annotations.roadinfo.drive_on | annotations.roadinfo.speed_in | annotations.sun.rise.apparent | annotations.sun.rise.astronomical | annotations.sun.rise.civil | annotations.sun.rise.nautical | annotations.sun.set.apparent | annotations.sun.set.astronomical | annotations.sun.set.civil | annotations.sun.set.nautical | annotations.timezone.name | annotations.timezone.now_in_dst | annotations.timezone.offset_sec | annotations.timezone.offset_string | annotations.timezone.short_name | annotations.what3words.words | components.ISO_3166-1_alpha-2 | components.ISO_3166-1_alpha-3 | components._category | components._type | components.city | components.continent | components.country | components.country_code | components.county | components.county_code | components.postcode | components.state | components.state_code | components.suburb | confidence | formatted | geometry.lat | geometry.lng | query |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
51° 31’ 35.69160’’ N | 0° 9’ 40.96080’’ E | 30UXC9690312205 | IO91wm06pj | -17964.517 | 6681500.038 | https://www.openstreetmap.org/note/new#map=16/51.52658/-0.16138&layers=N | https://www.openstreetmap.org/?mlat=51.52658&mlon=-0.16138#map=16/51.52658/-0.16138 | 150 | 826 | 154 | 001 | MEDC | 44 | . | £ | GBP | 826 | British Pound | 1 | Penny | 100 | £ | 1 | , | <U+0001F1EC><U+0001F1E7> | gcpvhk4kpdgu4uvu403f | 118.97 | left | mph | 1594699320 | 0 | 1594696620 | 1594692720 | 1594757460 | 0 | 1594760100 | 1594764000 | Europe/London | 1 | 3600 | +0100 | BST | humans.unit.volume | GB | GBR | postcode | postcode | London | Europe | United Kingdom | gb | Westminster | WSM | NW1 4SA | England | ENG | Regent’s Park | 10 | London NW1 4SA, United Kingdom | 51.5 | -0.161 | NW1 4SA |
# Reverse geocode latitude/longitude that corresponds
# to the The John Snow pub in 39 Browadwick Street, Soho, London
reverse_john_snow <- opencage_reverse(51.51328, -0.13657)
reverse_john_snow$results %>%
knitr::kable() %>%
kable_styling(c("striped", "bordered")) %>%
scroll_box(width = "100%", height = "200px")
annotations.DMS.lat | annotations.DMS.lng | annotations.MGRS | annotations.Maidenhead | annotations.Mercator.x | annotations.Mercator.y | annotations.OSM.edit_url | annotations.OSM.note_url | annotations.OSM.url | annotations.UN_M49.regions.EUROPE | annotations.UN_M49.regions.GB | annotations.UN_M49.regions.NORTHERN_EUROPE | annotations.UN_M49.regions.WORLD | annotations.UN_M49.statistical_groupings | annotations.callingcode | annotations.currency.decimal_mark | annotations.currency.html_entity | annotations.currency.iso_code | annotations.currency.iso_numeric | annotations.currency.name | annotations.currency.smallest_denomination | annotations.currency.subunit | annotations.currency.subunit_to_unit | annotations.currency.symbol | annotations.currency.symbol_first | annotations.currency.thousands_separator | annotations.flag | annotations.geohash | annotations.qibla | annotations.roadinfo.drive_on | annotations.roadinfo.road | annotations.roadinfo.speed_in | annotations.sun.rise.apparent | annotations.sun.rise.astronomical | annotations.sun.rise.civil | annotations.sun.rise.nautical | annotations.sun.set.apparent | annotations.sun.set.astronomical | annotations.sun.set.civil | annotations.sun.set.nautical | annotations.timezone.name | annotations.timezone.now_in_dst | annotations.timezone.offset_sec | annotations.timezone.offset_string | annotations.timezone.short_name | annotations.what3words.words | bounds.northeast.lat | bounds.northeast.lng | bounds.southwest.lat | bounds.southwest.lng | components.ISO_3166-1_alpha-2 | components.ISO_3166-1_alpha-3 | components._category | components._type | components.city | components.continent | components.country | components.country_code | components.county | components.county_code | components.house_number | components.postcode | components.pub | components.road | components.state | components.state_code | components.state_district | components.suburb | confidence | formatted | geometry.lat | geometry.lng | query |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
51° 30’ 47.79864’’ N | 0° 8’ 11.73588’’ E | 30UXC9868010793 | IO91wm33oe | -15205.495 | 6679126.202 | https://www.openstreetmap.org/edit?way=273697641#map=16/51.51328/-0.13659 | https://www.openstreetmap.org/note/new#map=16/51.51328/-0.13659&layers=N | https://www.openstreetmap.org/?mlat=51.51328&mlon=-0.13659#map=16/51.51328/-0.13659 | 150 | 826 | 154 | 001 | MEDC | 44 | . | £ | GBP | 826 | British Pound | 1 | Penny | 100 | £ | 1 | , | <U+0001F1EC><U+0001F1E7> | gcpvhcsw94usv1nxc7x6 | 118.98 | left | Broadwick Street | mph | 1594699320 | 0 | 1594696620 | 1594692720 | 1594757460 | 0 | 1594760100 | 1594764000 | Europe/London | 1 | 3600 | +0100 | BST | tests.exist.rungs | 51.5133344 | -0.1365051 | 51.5132212 | -0.136682 | GB | GBR | commerce | pub | London | Europe | United Kingdom | gb | Westminster | WSM | 39 | W1F 9QJ | The John Snow | Broadwick Street | England | ENG | Greater London | Soho | 9 | The John Snow, 39 Broadwick Street, London W1F 9QJ, United Kingdom | 51.5 | -0.137 | 51.51328,-0.13657 |
rtweet
: Twitter data and Text MiningThe rtweet
package allows us to download Twitter data. Besides having your own Twitter account, you must create a Twitter app in order to get a Twitter API access token. To do this,
Once you create your application and it’s approved, go to the Keys and Tokens tab, and find the values Consumer Key (aka “API Key”) and Consumer Secret (aka “API Secret”).
Copy and paste the two keys (along with the name of your app) into an R script file and pass them along to create_token()
, using your own keys, rather than xxxx.
## autheticate via web browser
token <- create_token(
app = "rtweet_tokens",
consumer_key = "xxxxxxxxxxxxxxxx",
consumer_secret = "xxxxxxxxxxxxxxxx",
access_token = "xxxxxxxxxxxxxxxx",
access_secret = "xxxxxxxxxxxxxxxx")
A browser window should pop up. If you are logged in your Twitter account, click to approve and return to R. The rtweet::create_token()
function should automatically save your token as an environment variable for you. To make sure it worked, compare the created token object to the object returned by rtweet::get_token()
Now that you have authorised the Twitter API, let us retrieve the most recent 3200 tweets of a couple of Twitter users who appear to be friends](https://twitter.com/realDonaldTrump/status/1001961235838103552), Sesame Street’s Cookie Monster, and the BBC Breaking News service. For all users, we will plot their weekly tweet frequency, and perform a text mining analysis.
# load twitter library
library(rtweet)
# plotting and pipes and dplyr - tidyverse
library(tidyverse)
# text mining
library(tidytext)
library(textdata)
# retrieve most recent 3200 tweets of some twitter users- this is the max we can retrieve
twitter_users <- get_timeline(
user = c("BBCBreaking", "KimKardashian", "MeCookieMonster", "realDonaldTrump"),
n = 3200
)
# group by user and plot weekly tweet frequency
twitter_users %>%
group_by(screen_name) %>%
ts_plot(by = "months")
Some lines seem to stop earlier; it’s not there was no activity before that, but this shows that some users are busier tweeting than others; Donald Trump has been busier tweeting and his last 3200 tweets cover just a few months, compared to others.
It would be interesting to do a quick text mining analysis; Let us first glimpse at the dataframe, to see its structure, variables names, etc.
## Rows: 7,295
## Columns: 90
## $ user_id <chr> "5402612", "5402612", "5402612", "5402612",...
## $ status_id <chr> "1283020124220514305", "1283005301508255745...
## $ created_at <dttm> 2020-07-14 12:46:49, 2020-07-14 11:47:55, ...
## $ screen_name <chr> "BBCBreaking", "BBCBreaking", "BBCBreaking"...
## $ text <chr> "US government puts to death man who killed...
## $ source <chr> "TweetDeck", "TweetDeck", "SocialFlow", "So...
## $ display_text_width <dbl> 124, 125, 110, 165, 111, 132, 116, 140, 142...
## $ reply_to_status_id <chr> "1282929825947279360", NA, NA, NA, NA, NA, ...
## $ reply_to_user_id <chr> "5402612", NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ reply_to_screen_name <chr> "BBCBreaking", NA, NA, NA, NA, NA, NA, NA, ...
## $ is_quote <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F...
## $ is_retweet <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F...
## $ favorite_count <int> 973, 5899, 1582, 936, 3524, 12208, 3490, 0,...
## $ retweet_count <int> 206, 2339, 476, 353, 999, 3866, 1499, 1015,...
## $ quote_count <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ reply_count <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ hashtags <list> [NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ symbols <list> [NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ urls_url <list> ["bbc.in/392pRRa", "bbc.in/396b489", "bbc....
## $ urls_t.co <list> ["https://t.co/vEQL9jy50o", "https://t.co/...
## $ urls_expanded_url <list> ["https://bbc.in/392pRRa", "https://bbc.in...
## $ media_url <list> [NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ media_t.co <list> [NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ media_expanded_url <list> [NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ media_type <list> [NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ ext_media_url <list> [NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ ext_media_t.co <list> [NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ ext_media_expanded_url <list> [NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ ext_media_type <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ mentions_user_id <list> [NA, NA, NA, NA, NA, NA, NA, "265902729", ...
## $ mentions_screen_name <list> [NA, NA, NA, NA, NA, NA, NA, "BBCSport", N...
## $ lang <chr> "en", "en", "en", "en", "en", "en", "en", "...
## $ quoted_status_id <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ quoted_text <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ quoted_created_at <dttm> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ quoted_source <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ quoted_favorite_count <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ quoted_retweet_count <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ quoted_user_id <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ quoted_screen_name <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ quoted_name <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ quoted_followers_count <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ quoted_friends_count <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ quoted_statuses_count <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ quoted_location <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ quoted_description <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ quoted_verified <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ retweet_status_id <chr> NA, NA, NA, NA, NA, NA, NA, "12825949926960...
## $ retweet_text <chr> NA, NA, NA, NA, NA, NA, NA, "Manchester Cit...
## $ retweet_created_at <dttm> NA, NA, NA, NA, NA, NA, NA, 2020-07-13 08:...
## $ retweet_source <chr> NA, NA, NA, NA, NA, NA, NA, "TweetDeck", NA...
## $ retweet_favorite_count <int> NA, NA, NA, NA, NA, NA, NA, 3990, NA, 4014,...
## $ retweet_retweet_count <int> NA, NA, NA, NA, NA, NA, NA, 1015, NA, 820, ...
## $ retweet_user_id <chr> NA, NA, NA, NA, NA, NA, NA, "265902729", NA...
## $ retweet_screen_name <chr> NA, NA, NA, NA, NA, NA, NA, "BBCSport", NA,...
## $ retweet_name <chr> NA, NA, NA, NA, NA, NA, NA, "BBC Sport", NA...
## $ retweet_followers_count <int> NA, NA, NA, NA, NA, NA, NA, 8465538, NA, 84...
## $ retweet_friends_count <int> NA, NA, NA, NA, NA, NA, NA, 326, NA, 326, N...
## $ retweet_statuses_count <int> NA, NA, NA, NA, NA, NA, NA, 468003, NA, 468...
## $ retweet_location <chr> NA, NA, NA, NA, NA, NA, NA, "MediaCityUK, S...
## $ retweet_description <chr> NA, NA, NA, NA, NA, NA, NA, "Official https...
## $ retweet_verified <lgl> NA, NA, NA, NA, NA, NA, NA, TRUE, NA, TRUE,...
## $ place_url <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ place_name <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ place_full_name <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ place_type <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ country <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ country_code <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ geo_coords <list> [<NA, NA>, <NA, NA>, <NA, NA>, <NA, NA>, <...
## $ coords_coords <list> [<NA, NA>, <NA, NA>, <NA, NA>, <NA, NA>, <...
## $ bbox_coords <list> [<NA, NA, NA, NA, NA, NA, NA, NA>, <NA, NA...
## $ status_url <chr> "https://twitter.com/BBCBreaking/status/128...
## $ name <chr> "BBC Breaking News", "BBC Breaking News", "...
## $ location <chr> "London, UK", "London, UK", "London, UK", "...
## $ description <chr> "Breaking news alerts and updates from the ...
## $ url <chr> "http://t.co/vBzl7LOaso", "http://t.co/vBzl...
## $ protected <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F...
## $ followers_count <int> 44557304, 44557304, 44557304, 44557304, 445...
## $ friends_count <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3...
## $ listed_count <int> 135627, 135627, 135627, 135627, 135627, 135...
## $ statuses_count <int> 36640, 36640, 36640, 36640, 36640, 36640, 3...
## $ favourites_count <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ account_created_at <dttm> 2007-04-22 14:42:37, 2007-04-22 14:42:37, ...
## $ verified <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, T...
## $ profile_url <chr> "http://t.co/vBzl7LOaso", "http://t.co/vBzl...
## $ profile_expanded_url <chr> "http://www.bbc.co.uk/news", "http://www.bb...
## $ account_lang <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ profile_banner_url <chr> "https://pbs.twimg.com/profile_banners/5402...
## $ profile_background_url <chr> "http://abs.twimg.com/images/themes/theme1/...
## $ profile_image_url <chr> "http://pbs.twimg.com/profile_images/115071...
Text mining is essentially a neat way to count words, calculate their frequency, and map words into sentiments using a sentiment analysis lexicon. The results of text mining seems impressive, but it isn’t magic — it’s really just counting.
The first step in text mining is to take text and tokenise it, namely to take whatever is written and turn it into a word-by-word column. In addition, when we process text, we filter out the most common words in a language; these are called stop words, for example and, but, the, is, at, etc.
In our case, there are 90 variables (or columns), and we just to want to concentrate on what users wrote, rather than retweeted (is_retweet
). We will use dplyr
to filter out retweets and then use tidytext::unnest_tokens()
to split a sentence into one-word-per-column, as shown below.
tidy_tweets <- twitter_users %>% # #take the data frame, and then
filter(is_retweet==FALSE)%>% # filter just their original tweets, and then
select(screen_name, text)%>% # select variables of interest, and then
unnest_tokens(word, text) # split column with text in one word (or token)-per-row format
tidy_tweets %>% #take the data frame, and then
filter(screen_name == "realDonaldTrump") %>%#filter tweets for a certain user
head(20) %>% #show the first 20 rows, and then
knitr::kable() %>% # use kable to make the table look good
kable_styling(c("striped", "bordered"))
screen_name | word |
---|---|
realDonaldTrump | would |
realDonaldTrump | be |
realDonaldTrump | so |
realDonaldTrump | great |
realDonaldTrump | if |
realDonaldTrump | the |
realDonaldTrump | media |
realDonaldTrump | would |
realDonaldTrump | get |
realDonaldTrump | the |
realDonaldTrump | word |
realDonaldTrump | out |
realDonaldTrump | to |
realDonaldTrump | the |
realDonaldTrump | people |
realDonaldTrump | in |
realDonaldTrump | a |
realDonaldTrump | fair |
realDonaldTrump | and |
realDonaldTrump | balanced |
Reading text like this is very unwiedly for human, but very efficient for computers, as we can easily group them, count them, etc.
As we mentiond earlier, when we process text, we filter out the most common words used in a language; these are called stop words, for example and, but, the, is, at, etc., and below we can see the first few entries for such stop words.
stop_words %>% #take the stop_words, and then
head(20) %>% #show the first 20 rows, and then
knitr::kable() %>% #make the table better looking
kable_styling(c("striped", "bordered"))
word | lexicon |
---|---|
a | SMART |
a’s | SMART |
able | SMART |
about | SMART |
above | SMART |
according | SMART |
accordingly | SMART |
across | SMART |
actually | SMART |
after | SMART |
afterwards | SMART |
again | SMART |
against | SMART |
ain’t | SMART |
all | SMART |
allow | SMART |
allows | SMART |
almost | SMART |
alone | SMART |
along | SMART |
In addition to these stop words, we should create another set of stop words specific to Twitter– these are words that include https for webpage addresses, rt for retweet, t.co
which is a URL shorthand notation, etc.
twitter_stop_words <- tibble( #construct a dataframe
word = c(
"https",
"t.co",
"rt",
"amp"
),
lexicon = "twitter"
)
# Connect stop words
all_stop_words <- stop_words %>%
bind_rows(twitter_stop_words) # connect two data frames row-wise
# Remove numbers
no_numbers <- tidy_tweets %>%
filter(is.na(as.numeric(word))) # filter() returns rows where conditions are true
So far, we have defined our stop words and got rid of tweets that contain numbers. We will now use anti_join()
to get rid of all stop words in our dataframe. anti_join()
returns all rows from the dataframe where there are not matching values that are contained in all_stop_words
.
# Get rid of the combined stop words by using anti_join().
# anti_join() returns all rows from x where there are not matching values in y
no_stop_words <- no_numbers %>%
anti_join(all_stop_words, by = "word")
# instead of anti_join() we could also use
# filter(!(word %in% all_stop_words$word))
no_stop_words <- no_numbers %>%
filter(!(word %in% all_stop_words$word))
# We group by screen_name, and then
# count and sort number of times each word appears, and then
# sort the list, and then
# keep the top 20 words
words_count<- no_stop_words %>%
dplyr::group_by(screen_name) %>%
dplyr::count(word, sort = TRUE) %>%
top_n(20) %>%
ungroup()
The great majority of tweeted words is stop words. Removing the stop words is important for visualisation and sentiment analysis - we only want to plot and analyse the top 25 interesting words for each user. This is a slightly trickier plot, as we have to get the top words per user. To do this, we shall use tidytext::reorder_within()
with three arguments:
word
n
, the number of times a word was used, andscreen_name
After we reorder_within
, we used scale_x_reordered()
to finish making this plot.
words_count %>%
mutate(word = reorder_within(word, n, screen_name)) %>%
ggplot(aes(x=word, y=n, fill = screen_name)) +
geom_col(show.legend = FALSE) +
facet_wrap(~screen_name, scales = "free") +
coord_flip() +
scale_x_reordered()+
theme_bw(8)+
labs(
title = "What are the most common words in tweets?",
x = "",
y = "count of words in tweets"
)
We can also look at the frequency of pairs of words, or bigrams, rather than looking at single words. We will look at common bigrams and again filter out stop words, as we do not want things like of, the, and, that, etc.
tweet_bigrams <- twitter_users %>% # #take the data frame, and then
filter(is_retweet==FALSE)%>% # filter just their original tweets, and then
select(screen_name, text)%>% # select variables of interest, and then
# create column bigram with tweet text with n=2 words-per-row
unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
# Split the bigram column into two columns
separate(bigram, c("word1", "word2"), sep = " ") %>%
filter(!word1 %in% all_stop_words$word,
!word2 %in% all_stop_words$word) %>%
# Put the two word columns back together
unite(bigram, word1, word2, sep = " ") %>%
dplyr::group_by(screen_name) %>%
dplyr::count(bigram, sort = TRUE) %>%
top_n(20) %>%
ungroup()
tweet_bigrams %>%
mutate(bigram = reorder_within(bigram, n, screen_name)) %>%
ggplot(aes(x=bigram, y=n, fill = screen_name)) +
geom_col(show.legend = FALSE) +
facet_wrap(~screen_name, scales = "free") +
coord_flip() +
scale_x_reordered()+
theme_bw(8)+
labs(
title = "What are the most common bigram in tweets?",
x = "",
y = "count of bigrams in tweets"
)
To perform sentiment analysis, we must use a specific sentiment lexicon that assigns individual words to different sentiments (negative, positive, uncertain, etc.)
sentiment <- get_sentiments("loughran") # get specific sentiment lexicon
sentiment_words <- no_stop_words %>%
inner_join(sentiment, by="word")
sentiment_words %>%
group_by(screen_name,sentiment) %>% # group by sentiment type
summarise (n = n()) %>%
mutate(freq = n / sum(n)) %>% #calculate frequency (%) of sentiments
ungroup() %>%
mutate(sentiment = reorder_within(sentiment, freq, screen_name)) %>%
#and now plot the data
ggplot(aes(x=sentiment, y=freq, fill = screen_name)) +
geom_col(show.legend = FALSE) +
scale_y_continuous(labels = scales::percent) +
facet_wrap(~screen_name, scales = "free") +
coord_flip() +
theme_bw()+
scale_x_reordered()+
labs(
title = "What is the prevalent sentiment in tweets?",
x = "",
y = "Frequency of sentiment in tweets"
)
Finally, we can plot a world cloud, using the package wordcloud2
with the words that realDonaldTrump
used the most.
library(wordcloud2)
# plot using wordcloud2
trump_word_count <- no_stop_words %>%
filter(screen_name == 'realDonaldTrump') %>%
dplyr::count(word, sort = TRUE) %>%
top_n(500)
wordcloud2(trump_word_count)
rvest
: scrape web dataThis is a placeholder. Material on scraping web pages will appear here
This page last updated on: 2020-07-14