Digital trace data for Bayer stock price analysis in R

In this article I post a script that queries financial stock data from quantmod using the Bayer stock ticker from Yahoo Finance. Moreover I query Google search intensity data using gtrendsR, tweets via twitteR and The Guardian news articles using GuardianR. All related to the global enterprise Bayer AG. I plot the relationship of these data sets, applying feature scaling and sentiment analysis for analysis and better visualization.

This article is of interest to supply chain managers since digital trace data, i.e. data “left” as “traces” while surfing and using the web, is of relevance to e.g. supply chain risk management. Knowing how to query such data can be helpful in modeling and preventing supply chain disruptions.

First, I need to load the required packages in R.

rm(list=ls())              # clears the memory
library(GuardianR)         # package for streaming Guardian news feed into the model
library(SentimentAnalysis) # package for analyzing sentiment in text
## 
## Attaching package: 'SentimentAnalysis'
## The following object is masked from 'package:base':
## 
##     write
library(dplyr)             # package for data manipulation 
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)           # package for data visualization
library(quantmod)          # package for streaming and modelling financial data    
## Loading required package: xts
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## 
## Attaching package: 'xts'
## The following objects are masked from 'package:dplyr':
## 
##     first, last
## Loading required package: TTR
## Version 0.4-0 included new data defaults. See ?getSymbols.
library(gtrendsR)          # package for streaming google trends data
library(twitteR)           # package for querrying tweets from Twitter
## 
## Attaching package: 'twitteR'
## The following objects are masked from 'package:dplyr':
## 
##     id, location

We will start by looking at news artciles published on The Guardian and that are related to BAYER. Using sentiment analysis methodology I assess the sentiment content of these articles and visualize analysis outcome a long a timeline.

For this I will need to setup an API-key, referred to as “guardian_key” by my script.

Once the API-key has been set up it is time to read in the news feed from The Guardian. Using the GuardianR package this can be done directly ib R. I consider articles published in the business, economy, world and money sections.

news_df <- get_guardian("+Bayer+",from.date="2008-01-01",
                        to.date ="2019-05-14",
                        api.key=guardian_key,
                        section=c("business",
                                  "economy",
                                  "world",
                                  "money"))
## [1] "Fetched page #1 of 1"

Above querry is simplified. For a high-performance production ready R script I would apply regular expressions to filter out articles that contain the BAYER term but are not related to the BAYER AG enterprise.

Next, we can calculate the sentiment of the news articles from The Guardian. For this I apply the SentimentAnalysis package in R. Sentiment analysis methodology is a scientific area of its own and much research is still being done. Most sentiment analysis methodes implemented in R use some kind of library that matches sentiment scores to words. In addition, a set of rules defines how combinations of words are to be treated. E.g. “not good” has a negative sentiment while “very good” has a very positive sentiment.

# calculate sentiment scores using SentimentAnalysis package methodology
sentiment_df <- analyzeSentiment(as.character(news_df$body))
# add date stamp from The Guardian news feed
sentiment_df$date <- as.Date(as.character(news_df$webPublicationDate))

The sentiment scores of The Guardian`s articles depend on the exact sentiment calculation methodology. The differences in sentiment calculations mostly originates from the fact that different methodes use different baseline models. It is in the underlying rules and the mapping of sentiment scores to words that methodes might differ.

Below chart visualizes the difference in the various methodes´ sentiment calculation, for the same The Guardian news feed. All methodes refered to here are from the SentimentAnalysis package.

results_df <- as.data.frame(matrix(nrow=nrow(sentiment_df),ncol=0))
results_df$date <- sentiment_df$date

results_df$GI   <- sentiment_df$SentimentGI
results_df$LM   <- sentiment_df$SentimentLM
results_df$HE   <- sentiment_df$SentimentHE
results_df$QDAP <- sentiment_df$SentimentQDAP
results_df$mean <- (sentiment_df$SentimentGI+sentiment_df$SentimentHE+sentiment_df$SentimentLM+sentiment_df$SentimentQDAP)/4

plotable_df <- as.data.frame(matrix(nrow=nrow(results_df)*5,ncol=0))

plotable_df$date <- c(results_df$date,
                      results_df$date,
                      results_df$date,
                      results_df$date,results_df$date)

plotable_df$score <- c(results_df$GI,
                       results_df$LM,
                       results_df$HE,
                       results_df$QDAP,results_df$mean)

plotable_df$type <- c(rep("GI",times   = nrow(results_df)),
                      rep("LM",times   = nrow(results_df)),
                      rep("HE",times   = nrow(results_df)),
                      rep("QDAP",times = nrow(results_df)),
                      rep("mean",times = nrow(results_df)))

ggplot(plotable_df) + geom_smooth(mapping=aes(x=date,y=score,color=type)) + ggtitle("Sentiment content of BAYER articles on The Guardian") +
  xlab("Time") +
  ylab("Sentiment scores")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Digital trace data streaming - comparison of results from different sentiment methodes
Using publicly available online media articles is one way of using digital trace data

Moving ahead we now, instead of considering The Guardian news articles, want to consider tweets and their sentiment. We want to consider Bayer-related tweets only and we start by crawling through tweets published by Bayer (i.e. the once available through the REST API). For thi, a set of keys and tokens must be specified. I did so and stored the values to the variables “consumer_key”, “consumer_secret”, “access_token” and “access_secret”. With these keys I can now access Bayers tweet timeline. However, only most recent tweets will be available through the open and free functionality of Twitter´s twitteR package in R.

# querrying twitter via the twitteR package. Unfortunately, tweets can only be downloaded a few days back using searchTwitter function
twitter_df <- as.data.frame(matrix(nrow=0,ncol=2))
colnames(twitter_df) <- c("text","created")
setup_twitter_oauth(consumer_key, consumer_secret, access_token, access_secret)
## [1] "Using direct authentication"
timeline_ls <- twitteR::userTimeline("Bayer",n=200000)
twitter_df = rbind(twitter_df,select(twitteR::twListToDF(timeline_ls),text,created))
twitter_df$date <- as.Date(twitter_df$created)

# some text cleaning
cleaningfunction <- function(x){
  return(gsub("[^[:alnum:][:blank:]?&/\\-]", "", x))
}
twitter_df$text <- as.character(lapply(twitter_df$text,cleaningfunction))

# calculate mean sentiment scores, considering all four approaches
twitter_df$sentiment <- (analyzeSentiment(twitter_df$text)$SentimentGI+analyzeSentiment(twitter_df$text)$SentimentHE+analyzeSentiment(twitter_df$text)$SentimentLM+analyzeSentiment(twitter_df$text)$SentimentQDAP)/4

# visualizing twitter sentiment for @Bayer along the timeline
ggplot(twitter_df) + geom_smooth(mapping=aes(x=date,y = sentiment)) + ggtitle("Avg. daily sentiment of BAYER tweets") +
  xlab("Time") +
  ylab("Sentiment score")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Digital trace data streaming - avg daily sentiment of most recent Bayer tweets
Streaming tweets into a model is another broadly applied use of digital trace data

Now, I crawl Twitter´s search function for tweets mentioning Bayer – regardless of tweet author. Only most recent tweets will be available through the REST API accessible via the twitteR package.

# querrying twitter via the twitteR package. Unfortunately, tweets can only be downloaded a few days back using searchTwitter function
twitter_df <- as.data.frame(matrix(nrow=0,ncol=2))
colnames(twitter_df) <- c("text","created")
search_ls <- twitteR::searchTwitter("@Bayer",n=20000,retryOnRateLimit = 1000000)
## [1] "Rate limited .... blocking for a minute and retrying up to 999999 times ..."
## [1] "Rate limited .... blocking for a minute and retrying up to 999998 times ..."
## [1] "Rate limited .... blocking for a minute and retrying up to 999997 times ..."
## [1] "Rate limited .... blocking for a minute and retrying up to 999996 times ..."
## [1] "Rate limited .... blocking for a minute and retrying up to 999995 times ..."
## [1] "Rate limited .... blocking for a minute and retrying up to 999994 times ..."
## [1] "Rate limited .... blocking for a minute and retrying up to 999993 times ..."
## [1] "Rate limited .... blocking for a minute and retrying up to 999992 times ..."
## [1] "Rate limited .... blocking for a minute and retrying up to 999991 times ..."
## [1] "Rate limited .... blocking for a minute and retrying up to 999990 times ..."
## [1] "Rate limited .... blocking for a minute and retrying up to 999989 times ..."
## [1] "Rate limited .... blocking for a minute and retrying up to 999988 times ..."
## [1] "Rate limited .... blocking for a minute and retrying up to 999987 times ..."
## [1] "Rate limited .... blocking for a minute and retrying up to 999986 times ..."
twitter_df = rbind(twitter_df,select(twitteR::twListToDF(search_ls),text,created))
twitter_df$date <- as.Date(twitter_df$created)

# some text cleaning
twitter_df$text <- as.character(lapply(twitter_df$text,cleaningfunction))

# calculate mean sentiment scores, considering all four approaches
twitter_df$sentiment <- (analyzeSentiment(twitter_df$text)$SentimentGI+analyzeSentiment(twitter_df$text)$SentimentHE+analyzeSentiment(twitter_df$text)$SentimentLM+analyzeSentiment(twitter_df$text)$SentimentQDAP)/4

# visualizing twitter sentiment for @Bayer along the timeline
twitter_df <- twitter_df %>% group_by(date) %>% summarize(meanSentiment=mean(sentiment))
ggplot(twitter_df) + geom_line(mapping=aes(x=date,y = meanSentiment)) + 
  ggtitle("Mean sentiment of reply tweets to BAYER") +
  xlab("Time") +
  ylab("Sentiment score")
Digital trace data streaming - sentiment of most recent Bayer tweets
In this case, digital trace data from twitter is does not help us much – since free data streams only provide most recent tweets

One more data source is of interest to me. I want to see how Google search intensity for the term “Bayer” evolved throughout time.

# querry and arrange Google search trend data using the R package
google_df <- gtrends(keyword = "Bayer AG",time="2010-01-01 2019-05-20")
google_df <- google_df$interest_over_time %>% select(date,hits)
google_df$date  <- as.Date(google_df$date)
google_df$hits <- as.numeric(google_df$hits)

# visualize the Google search trend for "Bayer AG"
ggplot(google_df) + geom_line(mapping=aes(x=date,y=hits)) + ggtitle("Google search intensity index for BAYER") +
  xlab("Time") +
  ylab("Google trends score (normed, MAX=100)")
Digital trace data streaming - Google search intensity for Bayer search term
Google search trend values are the result of digital traces that are left on Google’s search engine. We can this digital trace data in e.g. regression analysis

Upon having visualized sentiment score development of news articles and tweets along the timeline I now combine those informations with the actual stock price development. For this I need to use the quantmod package, serving as an interface that lets us extract financial data by using Yahoo Finance stock tickers. To enable comparison I apply feature scaling to observation values.

# define a function for feature scaling
feature_scaling <- function(x){
  (x-min(x))/(max(x)-min(x))
}
# --- setup finance df ---
getSymbols("BAYN.DE",src="yahoo")
## 'getSymbols' currently uses auto.assign=TRUE by default, but will
## use auto.assign=FALSE in 0.5-0. You will still be able to use
## 'loadSymbols' to automatically load data. getOption("getSymbols.env")
## and getOption("getSymbols.auto.assign") will still be checked for
## alternate defaults.
## 
## This message is shown once per session and may be disabled by setting 
## options("getSymbols.warning4.0"=FALSE). See ?getSymbols for details.
## 
## WARNING: There have been significant changes to Yahoo Finance data.
## Please see the Warning section of '?getSymbols.yahoo' for details.
## 
## This message is shown once per session and may be disabled by setting
## options("getSymbols.yahoo.warning"=FALSE).
## [1] "BAYN.DE"
finance_df <- as.data.frame(BAYN.DE)
finance_df$source = rep("quantmod",nrow(finance_df))
finance_df$date <- as.Date(rownames(finance_df))
# feature scaling
finance_df <- finance_df %>% select(date,BAYN.DE.Adjusted,source)
colnames(finance_df) <- c("date","value","source")
finance_df$value <- feature_scaling(finance_df$value)
# --- setup guardian sentiment df ---
guardianSenti_df <- dplyr::filter(plotable_df,type=="mean")
guardianSenti_df$source <- rep("guardian",nrow(guardianSenti_df))
guardianSenti_df$date <- as.Date(guardianSenti_df$date)
# feature scaling
guardianSenti_df <- guardianSenti_df %>% select(date,score,source)
colnames(guardianSenti_df) <- c("date","value","source")
guardianSenti_df$value <- feature_scaling(guardianSenti_df$value)
# -- setup google trends df ---
gtrends_df <- google_df %>% select(date,hits)
colnames(gtrends_df) <- c("date","value")
gtrends_df$date <- as.Date(gtrends_df$date)
gtrends_df$source <- rep("google",nrow(gtrends_df))
# feature scaling
gtrends_df$value <- feature_scaling(gtrends_df$value)
# --- setup twitter sentiment df --
twitterSenti_df <- twitter_df %>% select(date,meanSentiment)
colnames(twitterSenti_df) <- c("date","value")
twitterSenti_df$source <- rep("twitter",nrow(twitterSenti_df))
twitterSenti_df$date <- as.Date(twitterSenti_df$date)
# feature scaling
twitterSenti_df$value <- feature_scaling(twitterSenti_df$value)

# merge frames
combined_df <- rbind(finance_df,
                     guardianSenti_df,
                     gtrends_df,
                     twitterSenti_df)

# visualize - is there a correlation?
ggplot(combined_df) + geom_point(mapping=aes(x=date,y=value,color=source)) + ggtitle("Timeline for BAYER data") + 
  xlab("Time") +
  ylab("Observation value (normed)")
Digital trace data streaming - historical values for all relevant observation topcis

In next steps, one could continue the analysis using regression analysis, including regression models from time series analysis. On my blog you can find posts on regression analysis in R, too.

You May Also Like

Leave a Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.