Data Science: WEEK_3: Twitter tweet analysis

Hi there! Welcome to the week 3 session.

Today we will discuss Twitter tweet analysis.

Pr-requisites:

Windows/Mac/Linux machine with r-base and RStudio installed (if you don't have it yet, you can refer my previous post and get them on your PC).
Basic understanding of R data types and syntax.
And finally, YOU.

The very first thing is creating data set for our operations. We need to download our twitter archive for this purpose. Follow below instructions to get your twitter archive.

Step: 1
Navigate to your twitter account settings page by following this link.

Step: 2
Request your twitter archive by clicking on Request Your Archive link.
Twitter will send your archive via email, check your email inbox (associated with twitter account) and download the archive file.

Step3:
Extract the zipped file and find tweets.csv file. Copy the file to your working directory.
By default, your RStudio will set Documents folder as working directory. But you can change the working directory by executing setwd() command in your RStudio.
Ex:
setwd("C:/Users/Sharath/Downloads")

So, now we have the data source. Let us jump into the R code.

We need to use 3 packages.

Install those libraries first.


install.packages("ggplot2")
install.packages("lubridate")

install.packages("scales")

Let us load those libraries.


library(ggplot2)
library(lubridate)
library(scales)

Read data from tweets.csv


tweets <- read.csv("tweets.csv", stringsAsFactors = FALSE)

convert timestamp to date-time object


tweets$timestamp <- ymd_hms(tweets$timestamp)
tweets$timestamp <- with_tz(tweets$timestamp, "America/Chicago")

Now let us analyze the your tweeting trend, like when do you tweet more etc.

#basic histogram showing the distribution of my tweets over time
ggplot(data = tweets, aes(x = timestamp)) +
  geom_histogram(aes(fill = ..count..)) +
  theme(legend.position = "none") +
  xlab("Time") + ylab("Number of tweets") + 
  scale_fill_gradient(low = "midnightblue", high = "aquamarine4")



#tweets by year
ggplot(data = tweets, aes(x = year(timestamp))) +
  geom_histogram(breaks = seq(2007.5, 2016.2, by =1), aes(fill = ..count..)) +
  theme(legend.position = "none") +
  xlab("Time") + ylab("Number of tweets") + 
  scale_fill_gradient(low = "midnightblue", high = "aquamarine4")

#group by week days
ggplot(data = tweets, aes(x = wday(timestamp, label = TRUE))) +
  geom_histogram(breaks = seq(0.5, 7.5, by =1), aes(fill = ..count..)) +
  theme(legend.position = "none") +
  xlab("Day of the Week") + ylab("Number of tweets") + 
  scale_fill_gradient(low = "midnightblue", high = "aquamarine4")






#chi-square test to test the distribution of my tweets over week days
chisq.test(table(wday(tweets$timestamp, label = TRUE)))


###
myTable <- table(wday(tweets$timestamp, label = TRUE))
mean(myTable[c(2:5)])/mean(myTable[c(1,6,7)])

###
chisq.test(table(wday(tweets$timestamp, label = TRUE)), p = c(4, 5, 5, 5, 5, 4, 4)/32)

#tweets by months
ggplot(data = tweets, aes(x = month(timestamp, label = TRUE))) +
  geom_histogram(aes(fill = ..count..)) +
  theme(legend.position = "none") +
  xlab("Month") + ylab("Number of tweets") + 
  scale_fill_gradient(low = "midnightblue", high = "aquamarine4")



###
chisq.test(table(month(tweets$timestamp, label = TRUE)))


#fetch time of tweet and add it to existing tweet holder
tweets$timeonly <- as.numeric(tweets$timestamp - trunc(tweets$timestamp, "days"))

tweets[(minute(tweets$timestamp) == 0 & second(tweets$timestamp) == 0),11] <- NA
mean(is.na(tweets$timeonly))


class(tweets$timeonly) <- "POSIXct"

#number of tweets by time
ggplot(data = tweets, aes(x = timeonly)) +
  geom_histogram(aes(fill = ..count..)) +
  theme(legend.position = "none") +
  xlab("Time") + ylab("Number of tweets") + 
  scale_x_datetime(breaks = date_breaks("3 hours"), 
                   labels = date_format("%H:00")) +
  scale_fill_gradient(low = "midnightblue", high = "aquamarine4")

#late night tweets by year
latenighttweets <- tweets[(hour(tweets$timestamp) < 6),]
ggplot(data = latenighttweets, aes(x = timestamp)) +
  geom_histogram(aes(fill = ..count..)) +
  theme(legend.position = "none") +
  xlab("Time") + ylab("Number of tweets") + ggtitle("Late Night Tweets") +
  scale_fill_gradient(low = "midnightblue", high = "aquamarine4")

#number of tweets with hashtags
ggplot(tweets, aes(factor(grepl("#", tweets$text)))) +
  geom_bar(fill = "midnightblue") + 
  theme(legend.position="none", axis.title.x = element_blank()) +
  ylab("Number of tweets") + 
  ggtitle("Tweets with Hashtags") +
  scale_x_discrete(labels=c("No hashtags", "Tweets with hashtags"))

#number of tweets retweeted

ggplot(tweets, aes(factor(!is.na(retweeted_status_id)))) +
geom_bar(fill = "midnightblue") +
theme(legend.position="none", axis.title.x = element_blank()) +
ylab("Number of tweets") +
ggtitle("Retweeted Tweets") +
scale_x_discrete(labels=c("Not retweeted", "Retweeted tweets"))

#number of replied tweets
ggplot(tweets, aes(factor(!is.na(in_reply_to_status_id)))) +
  geom_bar(fill = "midnightblue") + 
  theme(legend.position="none", axis.title.x = element_blank()) +
  ylab("Number of tweets") + 
  ggtitle("Replied Tweets") +
  scale_x_discrete(labels=c("Not in reply", "Replied tweets"))


#categorize tweets under types
tweets$type <- "tweet"
tweets[(!is.na(tweets$retweeted_status_id)),12] <- "RT"
tweets[(!is.na(tweets$in_reply_to_status_id)),12] <- "reply"
tweets$type <- as.factor(tweets$type)
tweets$type = factor(tweets$type,levels(tweets$type)[c(3,1,2)])

#plot with types tweeting, retweeting, and replying
ggplot(data = tweets, aes(x = timestamp, fill = type)) +
geom_histogram() +
xlab("Time") + ylab("Number of tweets") +
scale_fill_manual(values = c("midnightblue", "deepskyblue4", "aquamarine3"))


#proportion of tweets among them
ggplot(data = tweets, aes(x = timestamp, fill = type)) +
  geom_bar(position = "fill") +
  xlab("Time") + ylab("Proportion of tweets") +
  scale_fill_manual(values = c("midnightblue", "deepskyblue4", "aquamarine3"))


#calculate characters per tweet
tweets$charsintweet <- sapply(tweets$text, function(x) nchar(x))


#plot char per tweet
ggplot(data = tweets, aes(x = charsintweet)) +
  geom_histogram(aes(fill = ..count..), binwidth = 8) +
  theme(legend.position = "none") +
  xlab("Characters per Tweet") + ylab("Number of tweets") + 
  scale_fill_gradient(low = "midnightblue", high = "aquamarine4")

We will discuss about twitter sentiment analysis in the next post.

Thanks for visiting my blog. I always love to hear constructive feedback. Please give your feedback in the comment section below or write to me personally here.

Data Science

Monday, March 7, 2016

WEEK_3: Twitter tweet analysis

No comments:

Post a Comment