Data Science: March 2016

Wednesday, March 30, 2016

WEEK_9: Amazon review analysis –Ranking reviews, filtering duplicate and irrelevant reviews

Hi there!

Nowadays the ecommerce business is becoming very popular and trying to take a very huge market share. One of the reason why customer likes ecommerce platform is product review feature offered by sellers. Customer can review the product that they have purchased. It will help both the seller as well as other customers.

But because of the competition between sellers, manufacturers, we come across many spam reviews. Paid reviewers will post biased reviews, duplicate reviews or reviews which are irrelevant to the product.

It becomes a very hectic process for the seller to filter those reviews. So let us try to find a solution for this using R language.

Logic used:

In this example, I have taken reviews of an iPhone in Amzon website.

Step 1:

We need to have a list of keywords. This is the keywords which we will expect in the reviews. As we are doing review analysis of an iPhone, our keyword list will contain keys like, "camera" "battery" "life" "screen" "heat" etc.

Step 2:

In the second step, we should find how many times these keys are repeated in those reviews build a matrix to store them.

If a review doesn’t contain any of the keyword, then that review is possibly a spam. That review is not useful for the seller or the customer. If a review contains many of the keywords, that review should be considered first. That reviewer might be talking about some serious issue with the product.

Step 3:

Depending on the number of keywords found, calculate their relevance score and sort them.

Step 4:

In the final step we need to eliminate duplicate reviews. I have used selection sort method to compare reviews.

Now let us get into coding.

Here is the R script.
(You can find materials and scripts used in this post on my Github repo.)

#cleanup the work space

rm(list = setdiff(ls(), lsf.str()))
library(stringr)
#################################################################################
#read important keywords
#################################################################################
keywords = scan('KeyWords.txt',
                what='character', comment.char=';',sep = "\n")
#################################################################################
#read the data to be valuated, review.txt contains 11 review, each separated by new line character
#################################################################################
reviews <- scan('review.txt',
                what='character', comment.char=';',sep = "\n")
#################################################################################
#score it and compare
#################################################################################
findScore <- function(review,k) {
  keyLength <- length(keywords)
  matScore <- c()
  tDF <- c()
  for(i in 1:keyLength) {
    tDF <- c(k[i],sum(str_count(review,k[i])))
    matScore <- rbind(matScore,tDF)
  }
  return(matScore)
}


reviewLength <- length(reviews)
score <- c()
score <- keywords
for (i in 1:reviewLength) {
  tScore <-c()
  tScore <- findScore(reviews[i],keywords)
  score <- cbind(score,tScore[,2])
}

View(score)
#################################################################################
#function to calculate relevance of reviews
#################################################################################

findRel <- function(reviewScore) {
  totalScore <- 0
  keyLength <- length(keywords)
  for(i in 1:keyLength) {
    totalScore <- as.numeric(reviewScore[i])+totalScore
  }
  return(totalScore)
}

#find irrelevant reviews
totalScoreR <-c()
for (i in 2:dim(score)[2]) {
  totalScoreR[i-1] <- findRel(score[,i])
}

findIfRel <- function (totalScoreR) {
  for (i in 1:length(totalScoreR)) {
    if(as.numeric(totalScoreR[i]==0))
      cat("Review:",i,"is irrelevant\n")
    else
      cat("Number of keywords found in review:",i,"is ", totalScoreR[i],"\n")
  }
}

#call the above function to find relevance of reviews
findIfRel(totalScoreR)

#################################################################################
#sort reviews according to their importance#################################################################################

#high value review
  highValueR <- c()
  highValueR <- cbind(totalScoreR,c(1:length(totalScoreR)))
  highValueR <- data.frame(highValueR)
  highValueR <- highValueR[order(highValueR$totalScoreR, decreasing = TRUE),]
  
  print("Reviews in decresing order of importance:") 
  for (i in 1:dim(highValueR)[1]) {
    cat("Rank ",i,":\n") 
    cat(reviews[highValueR[i,2]])
    cat("\n##########################################\n")
  }
  
#################################################################################
#find similar reviews
#################################################################################
for (i in 1:length(reviews)) {
  for (j in i:length(reviews))  {
    if(!i==j)
      if(identical(reviews[i],reviews[j]))
        cat("\nReview ",i," and review ",j," are same\n")
  }
}

Thanks for visiting my blog. I always love to hear constructive feedback. Please give your feedback in the comment section below or write to me personally here.

WEEK_8: Review analysis

Hi there!

This post is about review analysis. Review analysis is very useful tool for sellers especially for those who sell online.
Customers give feedback after every purchase. It is very essential to analyze those reviews and take necessary actions to improve the business.
But unfortunately there are many spammers. Spammers may post unrelated reviews or post same review multiple times. It is very difficult task for the seller to categorize which is the genuine review.
So, here is a solution using R. This R script will eliminate the irrelevant reviews.

Logic of the code:
Build a list of all words in which we are interested. For example, for a phone reviews the keywords list will be camera,display,heating,battery etc.

Read all reviews to be analyzed and score them using simple algorithm using the keywords list. The score is about number of occurrences of the keywords in those reviews.

Depending on the scores take proper actions.

Here is the R script:

#cleanup the work space
rm(list = setdiff(ls(), lsf.str()))

#load stringr library for string operations
library(stringr)
#################################################################################
#read the important keywords
#################################################################################
keywords = scan('keywords.txt',
                what='character', comment.char=';',sep = "\n")
#################################################################################
#read the data to be valuated (you can use any method to read these files, you can combine all reviews in just one file too)
#################################################################################
review1 = scan('review1.txt',
               what='character', comment.char=';',sep = "\n")
review2 = scan('review2.txt',
               what='character', comment.char=';',sep = "\n")
review3 = scan('review3.txt',
               what='character', comment.char=';',sep = "\n")
review4 = scan('review4.txt',
               what='character', comment.char=';',sep = "\n")
#################################################################################
#score it and compare
#################################################################################
findScore <- function(review,k) {
  keyLength <- length(keywords)
  matScore <- 0
  for(i in 1:keyLength) {
    tDF <- c(k[i],sum(str_count(review,k[i])))
    matScore <- rbind(matScore,tDF)
  }
  return(matScore)
}

#call above function and score reviews
matScoreR1 <- findScore(review1,keywords)
matScoreR2 <- findScore(review2,keywords)
matScoreR3 <- findScore(review3,keywords)
matScoreR4 <- findScore(review4,keywords)

#view score matrices
View(matScoreR1)
View(matScoreR2)
View(matScoreR3)
View(matScoreR4)

findRel <- function(reviewScore) {
  totalScore <- 0
  keyLength <- length(keywords)
  for(i in 1:keyLength) {
    totalScore <- as.numeric(reviewScore[i,2])+totalScore
  }
  return(totalScore)
}
#################################################################################
#find irrelevant reviews
#################################################################################

totalScoreR1 <- findRel(matScoreR1)
totalScoreR2 <- findRel(matScoreR2)
totalScoreR3 <- findRel(matScoreR3)
totalScoreR4 <- findRel(matScoreR4)

#function to find if the review is relevant or not
findIfRel <- function (score,name) {
  if(as.numeric(score==0))
    cat(name," is irrelevant")
  else
    cat("Number of keywords found in ",name," :", score)
}

#check review relevance of each review
findIfRel(totalScoreR1,"review 1")
findIfRel(totalScoreR2,"review 2")
findIfRel(totalScoreR3,"review 3")
findIfRel(totalScoreR4,"review 4")

Just eliminating irrelevant reviews is not enough for any seller. We have to analyze a lot other aspects. We will discuss them in next posts.
Find materials related to this post on my Github repo here.
Thanks for visiting my blog. I always love to hear constructive feedback. Please give your feedback in the comment section below or write to me personally here.

WEEK_7: Spam message filtering

Hi there!

Today we will discuss about spam message filtering using R.

Hi there!

We have seen the word spam filtering in many places. Let it be email, SMS or any kind of communication media, spammers will try get your attention.

We need to filter those spam messages. There are many algorithms to do that.

I use Naive Bayes method in this example. If you are not familiar with Naïve Bayes, you can learn more here.

In Naïve Bayes method, the system will learn with its experience. Initially we need to teach the system, how to categorize spam and ham, by providing it some sample spam and ham messages.

Logic used:

Step 1:

We need some sample spam and ham messages. Load them into separate variables first.

Then we need keyword list. This is the list of keywords, which we might encounter in spam and ham messages.

Ex: money, account password, urgent etc.

Step 2:

Build a matrix which stores keywords in one dimension and number of times they appear in spam and ham messages in other dimensions.

Step 3:

Now load the new message which is yet to be filtered.

Calculate how many times each keyword has repeated.

Step 4:

Use above calculated matrix as reference and use Naïve Bayes formula to find out if the new message is spam or ham.

Here is the R script:

#clean up the workplace


rm(list = setdiff(ls(), lsf.str()))

library(stringr) #load required library, stringr is used in string comparison
#################################################################################
#read the sample spam,ham and keywords list
#################################################################################
ham = scan('ham.txt',
           what='character', comment.char=';',sep = "\n")
spam = scan('spam.txt',
           what='character', comment.char=';',sep = "\n")
keywords = scan('KeyWords.txt',
            what='character', comment.char=';',sep = "\n")

#################################################################################
#Calculate spam matrix
#################################################################################
keyLength <- length(keywords)
matSpam <- 0
matHam <- 0
for(i in 1:keyLength) {
  tDF <- c(keywords[i],sum(str_count(spam,keywords[i])))
  matSpam <- rbind(matSpam,tDF)
  tDF <- c(keywords[i],sum(str_count(ham,keywords[i])))
  matHam <- rbind(matHam,tDF)
}
#################################################################################
#read the data to be valuated, mesage.txt contains all messages which are to be valuated. Each message is delimited by new line.
#################################################################################
message = scan('message.txt',
                what='character', comment.char=';',sep = "\n")
#################################################################################
#categorize the message as spam or ham
#################################################################################

#score it and build matrix
keyLength <- length(keywords)
matScore <- 0
for(i in 1:keyLength) {
  tDF <- c(keywords[i],sum(str_count(message,keywords[i])))
  matScore <- rbind(matScore,tDF)
}

#apply formula and find if it is spam
lengthSpam <- length(spam)
lengthHam <- length(ham)
totalScore <- 0
for (i in 1:keyLength) {
    totalScore <- totalScore+as.numeric(matScore[i,2])*((as.numeric(matSpam[i,2])/lengthSpam)-(as.numeric(matHam[i,2])/lengthHam))
}
if(totalScore<0)
    totalScore <- totalScore*(-1)
totalScore <- totalScore*100
print("Percentage of being spam")
100-totalScore

Find materials for this post on my Github.
We will discuss about review analysis in the next post.

Thanks for visiting my blog. I always love to hear constructive feedback. Please give your feedback in the comment section below or write to me personally here.

Monday, March 7, 2016

WEEK_6: Amazon Review Analysis

Hi there!

Today we will discuss how to analyze Amazon review.

In this analysis, we will find overall score of a review and decide if the review is positive or negative.
We will use 3 files in this program.
One file contains all possible words in the review which are treated as positive.
Another file contains all possible words in the review which are treated as negative.
And one more file with possible keywords (for ex: If it is a mobile phone review, then possible keywords are screen, battery, camera etc).

#################################################################################
#read the dictionary files
#################################################################################
pos = scan('positive-words.txt',
           what='character', comment.char=';',sep = "\n")
neg = scan('negative-words.txt',
           what='character', comment.char=';',sep = "\n")
key = scan('key-words.txt',
           what='character', comment.char=';',sep = "\n")

#you can add more words to the list
pos.words = c(pos, 'awsm')
neg.words = c(neg, 'wait', 'lol')
key.words = c(key, 'graphics')

#################################################################################
#function to calculate sentiment per line
#here we will pass the review and dictionary words to the function. function will break reviews into #tokens and calculate number of occurrences of the dictionary words. If the dictionary file is positive words dictionary, then return value is positive score of the review
#################################################################################

score.sentiment = function(sentences, dic.words, .progress='none')
{
require(plyr)
require(stringr)

scores = laply(sentences, function(sentence, dic.words) {
    #clean the data
    sentence = gsub('[[:punct:]]', '', sentence)
    sentence = gsub('[[:cntrl:]]', '', sentence)
    sentence = gsub('\\d+', '', sentence)

    # and convert to lower case:
    sentence = tolower(sentence)

    # split into words. str_split is in the stringr package
    word.list = str_split(sentence, '\\s+')
    # sometimes a list() is one level of hierarchy too much
    words = unlist(word.list)
    # compare our words to the dictionaries of positive & negative terms
    dic.matches = match(words, dic.words)

    dic.matches = !is.na(dic.matches)

    # and conveniently enough, TRUE/FALSE will be treated as 1/0 by sum():
    score = sum(dic.matches)

    return(score)
}, dic.words, .progress=.progress)

scores.df = data.frame(review=sentences, Senti_Score=scores)
return(scores.df)
}

#################################################################################
#function to fetch only important reviews
#here we are passing important keywords along with review. function will calculate how may #keywords are found in particular review and return the matrix. We can eliminate unnecessary #reviews by looking at their importance score
#################################################################################

impReviews = function(sentences, key.words, .progress='none')
{
require(plyr)
require(stringr)

scores = laply(sentences, function(sentence, key.words) {
    #clean the data
    sentence = gsub('[[:punct:]]', '', sentence)
    sentence = gsub('[[:cntrl:]]', '', sentence)
    sentence = gsub('\\d+', '', sentence)

    # and convert to lower case:
    sentence = tolower(sentence)

    # split into words. str_split is in the stringr package
    word.list = str_split(sentence, '\\s+')
    # sometimes a list() is one level of hierarchy too much
    words = unlist(word.list)
    # compare our words to the dictionaries of positive & negative terms
    key.matches = match(words, key.words)

    key.matches = !is.na(key.matches)

    # and conveniently enough, TRUE/FALSE will be treated as 1/0 by sum():
    score = sum(key.matches)

    return(score)
}, key.words, .progress=.progress)

scores.df = data.frame(review=sentences, Imp_Score=scores)
return(scores.df)
}

#################################################################################
#test data
#################################################################################

freeText1 = "Xiaomi played a Trick here but i am not sure if it would work. This phone is Actually The Redmi Note 4G, but with a Different name and an extra Sim Slot....What is New in this then ?? Same 1 year Old Model
I dont understand WHY to launch an already Discontinued devices when you have a Lot of devices (Mi 5 which might never launch i guess)
Xiaomi India is taking credit of it but i dont see anything NEW as such in the phone.
The Company is NOT launching any good phone now like REDMI NOTE 3 & REDMI NOTE 3 PRIME , due to Legal issues.
You launch Good devices in CHINA and launch such devices which dont sell there anymore ,to dispose off in India.
I WOULD NOT RECOMMEND THIS DEVICE. And customers should make it clear to such brand that there are Many other Brands Which we can Opt. Its not that Only Xiaomi is the One in Market.
If INDIAN Customers have given Xiaomi that Market BOOST , They can take that Back too.
Atleast it should keep in Mind that INDIAN users are Not to be Served an OUTDATED phone. We want better specifications too which have been launched worldwide.
This launch by Xiaomi shows they just want to OUTSTOCK their Old phones , and surprisingly customers are happy with this also.
I dont find any good reason to Buy this phone being an model 1 year Old , just a Big Publicity Launching would Not make it go Far..!!
THIS IS OUTDATED ....Not recommended at All !!!
LIKE THIS COMMENT TO SEND A MESSAGE THAT EVEN WE WANT UPDATED PHONES WHICH ARE LAUNCHED WORLDWIDE AND SUCH DEVICES ARE NOT ACCEPTED BEING OUT OF MARKET TRENDS...!!!"
freeText2 = "hi Dear 5 star Keyboard warriors .Please read my reviews and give expert advice . I brought this phone last week and in 3 days these are the defects i found

1 . ii have attached the screen shot for reference . This phone has a media server app that takes almost 70 % of your battery . This app cannot be force stopped (this app is used to scan all media files and refresh in your gallery). So if you charge your phone 100 % it will be 65 % in jus half and hour even on standby because of this app. And the back panel of phone heats so much that its is very useful during winter to keep you warm :P. Trust me my pant gets warm as soon i slide my phone in pocket in jus 3 minutes.Mounting External SD card will make the media server app go worse . So solution is dont add any files on your phone to keep media server quite and get high battery life..

2 The apps cannot be moved to memory card so you have to use the only 11 gb space available in phone for apps (half will be consumed by whats app :P ) . and don even think of rooting the phone ,if you root by seeing the you tube videos (MOST ARE ONLY FOR REDMI PRIME /REDMI NOTE 4G) .If u root it any ways you wont get any updates to install.

3 there is no search option in music player provided in phone , so if u like that one song among 700 songs you have to scroll way down to get that one song to listen.So if you install any other music app which has search option, it will only scan your internal storage songs not external storage songs.

I have reported all these bugs to xiaomi , still no action taken ..oooops i forgot why will they take any action .. i already got scammed with 8500 rs by them :P

online chat and support numbers also dont respond :D only updates and fixes can save my phone.

this is all i could find the most non user friendly thing in this outdated phone in 3 days . will post more on this .

OVERALL I HAVE MADE YET ANOTHER STUPID DECISON IN MY LIFE AND STILL LAUGH ABOUT IT :D"
freeText3 = "Please do not purchase this product. Too much heat generation on while using net and calling. Not user friendly. Not getting proper connectivity When comparing to other brands"
sample = c(freeText1,freeText2,freeText3)

#################################################################################
#call all functions
#################################################################################

result_senti_pos = score.sentiment(sample, pos.words)
View(result_senti_pos)

sample output:

result_senti_neg = score.sentiment(sample, neg.words)
View(result_senti_neg)

sample output:

result_imp_count = impReviews(sample, key.words)
View(result_imp_count)

sample output:

We will discuss about spam filtering in the next post.

Thanks for visiting my blog. I always love to hear constructive feedback. Please give your feedback in the comment section below or write to me personally here.

WEEK_5: WhatsApp sentiment analysis

Hi there!

Hope you had fun with your Twitter sentiment analysis last week. Today we will discuss sentiment analysis on WhatsApp data.

First of all, we need to get the WhatsApp chat archive.
Get your chat history using 'email chat history' facility offered by WhatsApp (follow this link if you are finding it difficult to get chat history).

#load required libraries


library(ggplot2)
library(lubridate)
library(Scale)
library(reshape2)

#Read from chat history file
texts <- readLines("w.txt")

#load libraries to create wordcloud
library("tm")
library("SnowballC")
library("wordcloud")
library("RColorBrewer")

text=texts;
docs <- Corpus(VectorSource(text))
toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))
docs <- tm_map(docs, toSpace, "/")
docs <- tm_map(docs, toSpace, "@")
docs <- tm_map(docs, toSpace, "\\|")
docs <- tm_map(docs, content_transformer(tolower))
docs <- tm_map(docs, removeNumbers)
docs <- tm_map(docs, removeWords, stopwords("english"))
docs <- tm_map(docs, removeWords, c("sharath","gunaje")) 
docs <- tm_map(docs, removePunctuation)
docs <- tm_map(docs, stripWhitespace)
docs <- tm_map(docs, stemDocument)
dtm <- TermDocumentMatrix(docs)
m <- as.matrix(dtm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
head(d, 10)
set.seed(1234)
wordcloud(words = d$word, freq = d$freq, min.freq = 1,
          max.words=200, random.order=FALSE, rot.per=0.35, 
          colors=brewer.pal(8, "Dark2"))

#wordcloud of words used in chat has been created, I can not share my wordclud for obvious reasons :P

#sentiment analysis
#we use all packages that are used for twitter sentiment analysis
library(tm)
library(stringr)
library(syuzhet) #this library contain sentiment dictionary
library(lubridate) #provides tools that make it easier to parse and manipulate dates
library(ggplot2)
library(scales)
library(reshape2)
library(dplyr ) #dplyr provides a flexible grammar of data manipulation

#fetch sentiment words from tweets
mySentiment <- get_nrc_sentiment(texts)
head(mySentiment)
text <- cbind(texts, mySentiment)

#count the sentiment words by category
sentimentTotals <- data.frame(colSums(text[,c(2:11)]))
names(sentimentTotals) <- "count"
sentimentTotals <- cbind("sentiment" = rownames(sentimentTotals), sentimentTotals)
rownames(sentimentTotals) <- NULL

#total sentiment score of all texts
ggplot(data = sentimentTotals, aes(x = sentiment, y = count)) +
  geom_bar(aes(fill = sentiment), stat = "identity") +
  theme(legend.position = "none") +
  xlab("Sentiment") + ylab("Total Count") + ggtitle("Total Sentiment Score for All Texts with XYZ")

#here is my output

You can use this code if you are clueless where your chat is leading to!, Just kidding :P

We will be discussing about the Amazon review analysis in the next post.

Thanks for visiting my blog. I always love to hear constructive feedback. Please give your feedback in the comment section below or write to me personally here.

WEEK_4: Twitter sentiment analysis

Hi there!

This post is the continuation of previous post.
You need to read the twitter archive and store tweets in tweets variable (refer previous post for steps).

Load required libraries.





library(tm)
library(stringr)
library(wordcloud)

library(syuzhet) #this library contain sentiment dictionary
library(lubridate) #provides tools that make it easier to parse and manipulate dates
library(ggplot2)
library(scales)
library(reshape2)
library(dplyr ) #dplyr provides a flexible grammar of data manipulation



#read tweets again (previous one is modified)
tweets <- read.csv("./tweets.csv", stringsAsFactors = FALSE)

# remove the Twitter handlers
nohandles <- str_replace_all(tweets$text, "@\\w+", "")


#clean up the remaining text
wordCorpus <- Corpus(VectorSource(nohandles))
wordCorpus <- tm_map(wordCorpus, removePunctuation)
wordCorpus <- tm_map(wordCorpus, content_transformer(tolower))
wordCorpus <- tm_map(wordCorpus, removeWords, stopwords("english"))
wordCorpus <- tm_map(wordCorpus, removeWords, c("like", "video"))
wordCorpus <- tm_map(wordCorpus, stripWhitespace)
wordCorpus <- tm_map(wordCorpus, stemDocument)
pal <- brewer.pal(9,"YlGnBu")
pal <- pal[-(1:4)]
set.seed(123)



#create a word cloud
wordcloud(words = wordCorpus, scale=c(5,1), max.words=100, random.order=FALSE,    rot.per=0.35, use.r.layout=FALSE, colors=pal)

#this is the wordcloud of my tweets


#document term matrix creation
tdm <- TermDocumentMatrix(wordCorpus)
tdm


#analyse the twitter handler
friends <- str_extract_all(tweets$text, "@\\w+")
namesCorpus <- Corpus(VectorSource(friends))

#wordcloud of twitter handlers
set.seed(146)
wordcloud(words = namesCorpus, scale=c(3,0.5), max.words=40, random.order=FALSE, 
          rot.per=0.10, use.r.layout=FALSE, colors=pal)

#here is my twitter handler wordcloud


#let us move to sentiment analysis

#fetch sentiment words from tweets
mySentiment <- get_nrc_sentiment(tweets$text)
head(mySentiment)
tweets <- cbind(tweets, mySentiment)

#count the sentiment words by category
sentimentTotals <- data.frame(colSums(tweets[,c(11:18)]))
names(sentimentTotals) <- "count"
sentimentTotals <- cbind("sentiment" = rownames(sentimentTotals), sentimentTotals)
rownames(sentimentTotals) <- NULL

#total sentiment score of all tweets
ggplot(data = sentimentTotals, aes(x = sentiment, y = count)) +
  geom_bar(aes(fill = sentiment), stat = "identity") +
  theme(legend.position = "none") +
  xlab("Sentiment") + ylab("Total Count") + ggtitle("Total Sentiment Score for All Tweets")

#my output


#categorize by time
tweets$timestamp <- with_tz(ymd_hms(tweets$timestamp), "Asia/Kolkata")
posnegtime <- tweets %>% 
  group_by(timestamp = cut(timestamp, breaks="2 months")) %>%
  summarise(negative = mean(negative),
            positive = mean(positive)) %>% melt
names(posnegtime) <- c("timestamp", "sentiment", "meanvalue")
posnegtime$sentiment = factor(posnegtime$sentiment,levels(posnegtime$sentiment)[c(2,1)])

#sentiment over time
ggplot(data = posnegtime, aes(x = as.Date(timestamp), y = meanvalue, group = sentiment)) +
  geom_line(size = 2.5, alpha = 0.7, aes(color = sentiment)) +
  geom_point(size = 0.5) +
  ylim(0, NA) + 
  scale_colour_manual(values = c("springgreen4", "firebrick3")) +
  theme(legend.title=element_blank(), axis.title.x = element_blank()) +
  scale_x_date(breaks = date_breaks("9 months"), 
               labels = date_format("%Y-%b")) +
  ylab("Average sentiment score") + 
  ggtitle("Sentiment Over Time")


#Sentiment During the Week
tweets$weekday <- wday(tweets$timestamp, label = TRUE)
weeklysentiment <- tweets %>% group_by(weekday) %>% 
  summarise(anger = mean(anger), 
            anticipation = mean(anticipation), 
            disgust = mean(disgust), 
            fear = mean(fear), 
            joy = mean(joy), 
            sadness = mean(sadness), 
            surprise = mean(surprise), 
            trust = mean(trust)) %>% melt
names(weeklysentiment) <- c("weekday", "sentiment", "meanvalue")

#plot Sentiment During the Week
ggplot(data = weeklysentiment, aes(x = weekday, y = meanvalue, group = sentiment)) +
  geom_line(size = 2.5, alpha = 0.7, aes(color = sentiment)) +
  geom_point(size = 0.5) +
  ylim(0, 0.6) +
  theme(legend.title=element_blank(), axis.title.x = element_blank()) +
  ylab("Average sentiment score") + 
  ggtitle("Sentiment During the Week")


#Sentiment During the Year
tweets$month <- month(tweets$timestamp, label = TRUE)
monthlysentiment <- tweets %>% group_by(month) %>% 
  summarise(anger = mean(anger), 
            anticipation = mean(anticipation), 
            disgust = mean(disgust), 
            fear = mean(fear), 
            joy = mean(joy), 
            sadness = mean(sadness), 
            surprise = mean(surprise), 
            trust = mean(trust)) %>% melt
names(monthlysentiment) <- c("month", "sentiment", "meanvalue")

#Sentiment During the Year
ggplot(data = monthlysentiment, aes(x = month, y = meanvalue, group = sentiment)) +
  geom_line(size = 2.5, alpha = 0.7, aes(color = sentiment)) +
  geom_point(size = 0.5) +
  ylim(0, NA) +
  theme(legend.title=element_blank(), axis.title.x = element_blank()) +
  ylab("Average sentiment score") + 
  ggtitle("Sentiment During the Year")

I will be writing about WhatsApp sentiment analysis in the next post.

Thanks for visiting my blog. I always love to hear constructive feedback. Please give your feedback in the comment section below or write to me personally here.

WEEK_3: Twitter tweet analysis

Hi there! Welcome to the week 3 session.

Today we will discuss Twitter tweet analysis.

Pr-requisites:

Windows/Mac/Linux machine with r-base and RStudio installed (if you don't have it yet, you can refer my previous post and get them on your PC).
Basic understanding of R data types and syntax.
And finally, YOU.

The very first thing is creating data set for our operations. We need to download our twitter archive for this purpose. Follow below instructions to get your twitter archive.

Step: 1
Navigate to your twitter account settings page by following this link.

Step: 2
Request your twitter archive by clicking on Request Your Archive link.
Twitter will send your archive via email, check your email inbox (associated with twitter account) and download the archive file.

Step3:
Extract the zipped file and find tweets.csv file. Copy the file to your working directory.
By default, your RStudio will set Documents folder as working directory. But you can change the working directory by executing setwd() command in your RStudio.
Ex:
setwd("C:/Users/Sharath/Downloads")

So, now we have the data source. Let us jump into the R code.

We need to use 3 packages.

Install those libraries first.


install.packages("ggplot2")
install.packages("lubridate")

install.packages("scales")

Let us load those libraries.


library(ggplot2)
library(lubridate)
library(scales)

Read data from tweets.csv


tweets <- read.csv("tweets.csv", stringsAsFactors = FALSE)

convert timestamp to date-time object


tweets$timestamp <- ymd_hms(tweets$timestamp)
tweets$timestamp <- with_tz(tweets$timestamp, "America/Chicago")

Now let us analyze the your tweeting trend, like when do you tweet more etc.

#basic histogram showing the distribution of my tweets over time
ggplot(data = tweets, aes(x = timestamp)) +
  geom_histogram(aes(fill = ..count..)) +
  theme(legend.position = "none") +
  xlab("Time") + ylab("Number of tweets") + 
  scale_fill_gradient(low = "midnightblue", high = "aquamarine4")



#tweets by year
ggplot(data = tweets, aes(x = year(timestamp))) +
  geom_histogram(breaks = seq(2007.5, 2016.2, by =1), aes(fill = ..count..)) +
  theme(legend.position = "none") +
  xlab("Time") + ylab("Number of tweets") + 
  scale_fill_gradient(low = "midnightblue", high = "aquamarine4")

#group by week days
ggplot(data = tweets, aes(x = wday(timestamp, label = TRUE))) +
  geom_histogram(breaks = seq(0.5, 7.5, by =1), aes(fill = ..count..)) +
  theme(legend.position = "none") +
  xlab("Day of the Week") + ylab("Number of tweets") + 
  scale_fill_gradient(low = "midnightblue", high = "aquamarine4")






#chi-square test to test the distribution of my tweets over week days
chisq.test(table(wday(tweets$timestamp, label = TRUE)))


###
myTable <- table(wday(tweets$timestamp, label = TRUE))
mean(myTable[c(2:5)])/mean(myTable[c(1,6,7)])

###
chisq.test(table(wday(tweets$timestamp, label = TRUE)), p = c(4, 5, 5, 5, 5, 4, 4)/32)

#tweets by months
ggplot(data = tweets, aes(x = month(timestamp, label = TRUE))) +
  geom_histogram(aes(fill = ..count..)) +
  theme(legend.position = "none") +
  xlab("Month") + ylab("Number of tweets") + 
  scale_fill_gradient(low = "midnightblue", high = "aquamarine4")



###
chisq.test(table(month(tweets$timestamp, label = TRUE)))


#fetch time of tweet and add it to existing tweet holder
tweets$timeonly <- as.numeric(tweets$timestamp - trunc(tweets$timestamp, "days"))

tweets[(minute(tweets$timestamp) == 0 & second(tweets$timestamp) == 0),11] <- NA
mean(is.na(tweets$timeonly))


class(tweets$timeonly) <- "POSIXct"

#number of tweets by time
ggplot(data = tweets, aes(x = timeonly)) +
  geom_histogram(aes(fill = ..count..)) +
  theme(legend.position = "none") +
  xlab("Time") + ylab("Number of tweets") + 
  scale_x_datetime(breaks = date_breaks("3 hours"), 
                   labels = date_format("%H:00")) +
  scale_fill_gradient(low = "midnightblue", high = "aquamarine4")

#late night tweets by year
latenighttweets <- tweets[(hour(tweets$timestamp) < 6),]
ggplot(data = latenighttweets, aes(x = timestamp)) +
  geom_histogram(aes(fill = ..count..)) +
  theme(legend.position = "none") +
  xlab("Time") + ylab("Number of tweets") + ggtitle("Late Night Tweets") +
  scale_fill_gradient(low = "midnightblue", high = "aquamarine4")

#number of tweets with hashtags
ggplot(tweets, aes(factor(grepl("#", tweets$text)))) +
  geom_bar(fill = "midnightblue") + 
  theme(legend.position="none", axis.title.x = element_blank()) +
  ylab("Number of tweets") + 
  ggtitle("Tweets with Hashtags") +
  scale_x_discrete(labels=c("No hashtags", "Tweets with hashtags"))

#number of tweets retweeted

ggplot(tweets, aes(factor(!is.na(retweeted_status_id)))) +
geom_bar(fill = "midnightblue") +
theme(legend.position="none", axis.title.x = element_blank()) +
ylab("Number of tweets") +
ggtitle("Retweeted Tweets") +
scale_x_discrete(labels=c("Not retweeted", "Retweeted tweets"))

#number of replied tweets
ggplot(tweets, aes(factor(!is.na(in_reply_to_status_id)))) +
  geom_bar(fill = "midnightblue") + 
  theme(legend.position="none", axis.title.x = element_blank()) +
  ylab("Number of tweets") + 
  ggtitle("Replied Tweets") +
  scale_x_discrete(labels=c("Not in reply", "Replied tweets"))


#categorize tweets under types
tweets$type <- "tweet"
tweets[(!is.na(tweets$retweeted_status_id)),12] <- "RT"
tweets[(!is.na(tweets$in_reply_to_status_id)),12] <- "reply"
tweets$type <- as.factor(tweets$type)
tweets$type = factor(tweets$type,levels(tweets$type)[c(3,1,2)])

#plot with types tweeting, retweeting, and replying
ggplot(data = tweets, aes(x = timestamp, fill = type)) +
geom_histogram() +
xlab("Time") + ylab("Number of tweets") +
scale_fill_manual(values = c("midnightblue", "deepskyblue4", "aquamarine3"))


#proportion of tweets among them
ggplot(data = tweets, aes(x = timestamp, fill = type)) +
  geom_bar(position = "fill") +
  xlab("Time") + ylab("Proportion of tweets") +
  scale_fill_manual(values = c("midnightblue", "deepskyblue4", "aquamarine3"))


#calculate characters per tweet
tweets$charsintweet <- sapply(tweets$text, function(x) nchar(x))


#plot char per tweet
ggplot(data = tweets, aes(x = charsintweet)) +
  geom_histogram(aes(fill = ..count..), binwidth = 8) +
  theme(legend.position = "none") +
  xlab("Characters per Tweet") + ylab("Number of tweets") + 
  scale_fill_gradient(low = "midnightblue", high = "aquamarine4")

We will discuss about twitter sentiment analysis in the next post.

Thanks for visiting my blog. I always love to hear constructive feedback. Please give your feedback in the comment section below or write to me personally here.