Data Science

Monday, March 7, 2016

WEEK_6: Amazon Review Analysis

Hi there!

Today we will discuss how to analyze Amazon review.

In this analysis, we will find overall score of a review and decide if the review is positive or negative.
We will use 3 files in this program.
One file contains all possible words in the review which are treated as positive.
Another file contains all possible words in the review which are treated as negative.
And one more file with possible keywords (for ex: If it is a mobile phone review, then possible keywords are screen, battery, camera etc).

#################################################################################
#read the dictionary files
#################################################################################
pos = scan('positive-words.txt',
           what='character', comment.char=';',sep = "\n")
neg = scan('negative-words.txt',
           what='character', comment.char=';',sep = "\n")
key = scan('key-words.txt',
           what='character', comment.char=';',sep = "\n")

#you can add more words to the list
pos.words = c(pos, 'awsm')
neg.words = c(neg, 'wait', 'lol')
key.words = c(key, 'graphics')

#################################################################################
#function to calculate sentiment per line
#here we will pass the review and dictionary words to the function. function will break reviews into #tokens and calculate number of occurrences of the dictionary words. If the dictionary file is positive words dictionary, then return value is positive score of the review
#################################################################################

score.sentiment = function(sentences, dic.words, .progress='none')
{
require(plyr)
require(stringr)

scores = laply(sentences, function(sentence, dic.words) {
    #clean the data
    sentence = gsub('[[:punct:]]', '', sentence)
    sentence = gsub('[[:cntrl:]]', '', sentence)
    sentence = gsub('\\d+', '', sentence)

    # and convert to lower case:
    sentence = tolower(sentence)

    # split into words. str_split is in the stringr package
    word.list = str_split(sentence, '\\s+')
    # sometimes a list() is one level of hierarchy too much
    words = unlist(word.list)
    # compare our words to the dictionaries of positive & negative terms
    dic.matches = match(words, dic.words)

    dic.matches = !is.na(dic.matches)

    # and conveniently enough, TRUE/FALSE will be treated as 1/0 by sum():
    score = sum(dic.matches)

    return(score)
}, dic.words, .progress=.progress)

scores.df = data.frame(review=sentences, Senti_Score=scores)
return(scores.df)
}

#################################################################################
#function to fetch only important reviews
#here we are passing important keywords along with review. function will calculate how may #keywords are found in particular review and return the matrix. We can eliminate unnecessary #reviews by looking at their importance score
#################################################################################

impReviews = function(sentences, key.words, .progress='none')
{
require(plyr)
require(stringr)

scores = laply(sentences, function(sentence, key.words) {
    #clean the data
    sentence = gsub('[[:punct:]]', '', sentence)
    sentence = gsub('[[:cntrl:]]', '', sentence)
    sentence = gsub('\\d+', '', sentence)

    # and convert to lower case:
    sentence = tolower(sentence)

    # split into words. str_split is in the stringr package
    word.list = str_split(sentence, '\\s+')
    # sometimes a list() is one level of hierarchy too much
    words = unlist(word.list)
    # compare our words to the dictionaries of positive & negative terms
    key.matches = match(words, key.words)

    key.matches = !is.na(key.matches)

    # and conveniently enough, TRUE/FALSE will be treated as 1/0 by sum():
    score = sum(key.matches)

    return(score)
}, key.words, .progress=.progress)

scores.df = data.frame(review=sentences, Imp_Score=scores)
return(scores.df)
}

#################################################################################
#test data
#################################################################################

freeText1 = "Xiaomi played a Trick here but i am not sure if it would work. This phone is Actually The Redmi Note 4G, but with a Different name and an extra Sim Slot....What is New in this then ?? Same 1 year Old Model
I dont understand WHY to launch an already Discontinued devices when you have a Lot of devices (Mi 5 which might never launch i guess)
Xiaomi India is taking credit of it but i dont see anything NEW as such in the phone.
The Company is NOT launching any good phone now like REDMI NOTE 3 & REDMI NOTE 3 PRIME , due to Legal issues.
You launch Good devices in CHINA and launch such devices which dont sell there anymore ,to dispose off in India.
I WOULD NOT RECOMMEND THIS DEVICE. And customers should make it clear to such brand that there are Many other Brands Which we can Opt. Its not that Only Xiaomi is the One in Market.
If INDIAN Customers have given Xiaomi that Market BOOST , They can take that Back too.
Atleast it should keep in Mind that INDIAN users are Not to be Served an OUTDATED phone. We want better specifications too which have been launched worldwide.
This launch by Xiaomi shows they just want to OUTSTOCK their Old phones , and surprisingly customers are happy with this also.
I dont find any good reason to Buy this phone being an model 1 year Old , just a Big Publicity Launching would Not make it go Far..!!
THIS IS OUTDATED ....Not recommended at All !!!
LIKE THIS COMMENT TO SEND A MESSAGE THAT EVEN WE WANT UPDATED PHONES WHICH ARE LAUNCHED WORLDWIDE AND SUCH DEVICES ARE NOT ACCEPTED BEING OUT OF MARKET TRENDS...!!!"
freeText2 = "hi Dear 5 star Keyboard warriors .Please read my reviews and give expert advice . I brought this phone last week and in 3 days these are the defects i found

1 . ii have attached the screen shot for reference . This phone has a media server app that takes almost 70 % of your battery . This app cannot be force stopped (this app is used to scan all media files and refresh in your gallery). So if you charge your phone 100 % it will be 65 % in jus half and hour even on standby because of this app. And the back panel of phone heats so much that its is very useful during winter to keep you warm :P. Trust me my pant gets warm as soon i slide my phone in pocket in jus 3 minutes.Mounting External SD card will make the media server app go worse . So solution is dont add any files on your phone to keep media server quite and get high battery life..

2 The apps cannot be moved to memory card so you have to use the only 11 gb space available in phone for apps (half will be consumed by whats app :P ) . and don even think of rooting the phone ,if you root by seeing the you tube videos (MOST ARE ONLY FOR REDMI PRIME /REDMI NOTE 4G) .If u root it any ways you wont get any updates to install.

3 there is no search option in music player provided in phone , so if u like that one song among 700 songs you have to scroll way down to get that one song to listen.So if you install any other music app which has search option, it will only scan your internal storage songs not external storage songs.

I have reported all these bugs to xiaomi , still no action taken ..oooops i forgot why will they take any action .. i already got scammed with 8500 rs by them :P

online chat and support numbers also dont respond :D only updates and fixes can save my phone.

this is all i could find the most non user friendly thing in this outdated phone in 3 days . will post more on this .

OVERALL I HAVE MADE YET ANOTHER STUPID DECISON IN MY LIFE AND STILL LAUGH ABOUT IT :D"
freeText3 = "Please do not purchase this product. Too much heat generation on while using net and calling. Not user friendly. Not getting proper connectivity When comparing to other brands"
sample = c(freeText1,freeText2,freeText3)

#################################################################################
#call all functions
#################################################################################

result_senti_pos = score.sentiment(sample, pos.words)
View(result_senti_pos)

sample output:

result_senti_neg = score.sentiment(sample, neg.words)
View(result_senti_neg)

sample output:

result_imp_count = impReviews(sample, key.words)
View(result_imp_count)

sample output:

We will discuss about spam filtering in the next post.

Thanks for visiting my blog. I always love to hear constructive feedback. Please give your feedback in the comment section below or write to me personally here.

WEEK_5: WhatsApp sentiment analysis

Hi there!

Hope you had fun with your Twitter sentiment analysis last week. Today we will discuss sentiment analysis on WhatsApp data.

First of all, we need to get the WhatsApp chat archive.
Get your chat history using 'email chat history' facility offered by WhatsApp (follow this link if you are finding it difficult to get chat history).

#load required libraries


library(ggplot2)
library(lubridate)
library(Scale)
library(reshape2)

#Read from chat history file
texts <- readLines("w.txt")

#load libraries to create wordcloud
library("tm")
library("SnowballC")
library("wordcloud")
library("RColorBrewer")

text=texts;
docs <- Corpus(VectorSource(text))
toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))
docs <- tm_map(docs, toSpace, "/")
docs <- tm_map(docs, toSpace, "@")
docs <- tm_map(docs, toSpace, "\\|")
docs <- tm_map(docs, content_transformer(tolower))
docs <- tm_map(docs, removeNumbers)
docs <- tm_map(docs, removeWords, stopwords("english"))
docs <- tm_map(docs, removeWords, c("sharath","gunaje")) 
docs <- tm_map(docs, removePunctuation)
docs <- tm_map(docs, stripWhitespace)
docs <- tm_map(docs, stemDocument)
dtm <- TermDocumentMatrix(docs)
m <- as.matrix(dtm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
head(d, 10)
set.seed(1234)
wordcloud(words = d$word, freq = d$freq, min.freq = 1,
          max.words=200, random.order=FALSE, rot.per=0.35, 
          colors=brewer.pal(8, "Dark2"))

#wordcloud of words used in chat has been created, I can not share my wordclud for obvious reasons :P

#sentiment analysis
#we use all packages that are used for twitter sentiment analysis
library(tm)
library(stringr)
library(syuzhet) #this library contain sentiment dictionary
library(lubridate) #provides tools that make it easier to parse and manipulate dates
library(ggplot2)
library(scales)
library(reshape2)
library(dplyr ) #dplyr provides a flexible grammar of data manipulation

#fetch sentiment words from tweets
mySentiment <- get_nrc_sentiment(texts)
head(mySentiment)
text <- cbind(texts, mySentiment)

#count the sentiment words by category
sentimentTotals <- data.frame(colSums(text[,c(2:11)]))
names(sentimentTotals) <- "count"
sentimentTotals <- cbind("sentiment" = rownames(sentimentTotals), sentimentTotals)
rownames(sentimentTotals) <- NULL

#total sentiment score of all texts
ggplot(data = sentimentTotals, aes(x = sentiment, y = count)) +
  geom_bar(aes(fill = sentiment), stat = "identity") +
  theme(legend.position = "none") +
  xlab("Sentiment") + ylab("Total Count") + ggtitle("Total Sentiment Score for All Texts with XYZ")

#here is my output

You can use this code if you are clueless where your chat is leading to!, Just kidding :P

We will be discussing about the Amazon review analysis in the next post.

Thanks for visiting my blog. I always love to hear constructive feedback. Please give your feedback in the comment section below or write to me personally here.

WEEK_4: Twitter sentiment analysis

Hi there!

This post is the continuation of previous post.
You need to read the twitter archive and store tweets in tweets variable (refer previous post for steps).

Load required libraries.





library(tm)
library(stringr)
library(wordcloud)

library(syuzhet) #this library contain sentiment dictionary
library(lubridate) #provides tools that make it easier to parse and manipulate dates
library(ggplot2)
library(scales)
library(reshape2)
library(dplyr ) #dplyr provides a flexible grammar of data manipulation



#read tweets again (previous one is modified)
tweets <- read.csv("./tweets.csv", stringsAsFactors = FALSE)

# remove the Twitter handlers
nohandles <- str_replace_all(tweets$text, "@\\w+", "")


#clean up the remaining text
wordCorpus <- Corpus(VectorSource(nohandles))
wordCorpus <- tm_map(wordCorpus, removePunctuation)
wordCorpus <- tm_map(wordCorpus, content_transformer(tolower))
wordCorpus <- tm_map(wordCorpus, removeWords, stopwords("english"))
wordCorpus <- tm_map(wordCorpus, removeWords, c("like", "video"))
wordCorpus <- tm_map(wordCorpus, stripWhitespace)
wordCorpus <- tm_map(wordCorpus, stemDocument)
pal <- brewer.pal(9,"YlGnBu")
pal <- pal[-(1:4)]
set.seed(123)



#create a word cloud
wordcloud(words = wordCorpus, scale=c(5,1), max.words=100, random.order=FALSE,    rot.per=0.35, use.r.layout=FALSE, colors=pal)

#this is the wordcloud of my tweets


#document term matrix creation
tdm <- TermDocumentMatrix(wordCorpus)
tdm


#analyse the twitter handler
friends <- str_extract_all(tweets$text, "@\\w+")
namesCorpus <- Corpus(VectorSource(friends))

#wordcloud of twitter handlers
set.seed(146)
wordcloud(words = namesCorpus, scale=c(3,0.5), max.words=40, random.order=FALSE, 
          rot.per=0.10, use.r.layout=FALSE, colors=pal)

#here is my twitter handler wordcloud


#let us move to sentiment analysis

#fetch sentiment words from tweets
mySentiment <- get_nrc_sentiment(tweets$text)
head(mySentiment)
tweets <- cbind(tweets, mySentiment)

#count the sentiment words by category
sentimentTotals <- data.frame(colSums(tweets[,c(11:18)]))
names(sentimentTotals) <- "count"
sentimentTotals <- cbind("sentiment" = rownames(sentimentTotals), sentimentTotals)
rownames(sentimentTotals) <- NULL

#total sentiment score of all tweets
ggplot(data = sentimentTotals, aes(x = sentiment, y = count)) +
  geom_bar(aes(fill = sentiment), stat = "identity") +
  theme(legend.position = "none") +
  xlab("Sentiment") + ylab("Total Count") + ggtitle("Total Sentiment Score for All Tweets")

#my output


#categorize by time
tweets$timestamp <- with_tz(ymd_hms(tweets$timestamp), "Asia/Kolkata")
posnegtime <- tweets %>% 
  group_by(timestamp = cut(timestamp, breaks="2 months")) %>%
  summarise(negative = mean(negative),
            positive = mean(positive)) %>% melt
names(posnegtime) <- c("timestamp", "sentiment", "meanvalue")
posnegtime$sentiment = factor(posnegtime$sentiment,levels(posnegtime$sentiment)[c(2,1)])

#sentiment over time
ggplot(data = posnegtime, aes(x = as.Date(timestamp), y = meanvalue, group = sentiment)) +
  geom_line(size = 2.5, alpha = 0.7, aes(color = sentiment)) +
  geom_point(size = 0.5) +
  ylim(0, NA) + 
  scale_colour_manual(values = c("springgreen4", "firebrick3")) +
  theme(legend.title=element_blank(), axis.title.x = element_blank()) +
  scale_x_date(breaks = date_breaks("9 months"), 
               labels = date_format("%Y-%b")) +
  ylab("Average sentiment score") + 
  ggtitle("Sentiment Over Time")


#Sentiment During the Week
tweets$weekday <- wday(tweets$timestamp, label = TRUE)
weeklysentiment <- tweets %>% group_by(weekday) %>% 
  summarise(anger = mean(anger), 
            anticipation = mean(anticipation), 
            disgust = mean(disgust), 
            fear = mean(fear), 
            joy = mean(joy), 
            sadness = mean(sadness), 
            surprise = mean(surprise), 
            trust = mean(trust)) %>% melt
names(weeklysentiment) <- c("weekday", "sentiment", "meanvalue")

#plot Sentiment During the Week
ggplot(data = weeklysentiment, aes(x = weekday, y = meanvalue, group = sentiment)) +
  geom_line(size = 2.5, alpha = 0.7, aes(color = sentiment)) +
  geom_point(size = 0.5) +
  ylim(0, 0.6) +
  theme(legend.title=element_blank(), axis.title.x = element_blank()) +
  ylab("Average sentiment score") + 
  ggtitle("Sentiment During the Week")


#Sentiment During the Year
tweets$month <- month(tweets$timestamp, label = TRUE)
monthlysentiment <- tweets %>% group_by(month) %>% 
  summarise(anger = mean(anger), 
            anticipation = mean(anticipation), 
            disgust = mean(disgust), 
            fear = mean(fear), 
            joy = mean(joy), 
            sadness = mean(sadness), 
            surprise = mean(surprise), 
            trust = mean(trust)) %>% melt
names(monthlysentiment) <- c("month", "sentiment", "meanvalue")

#Sentiment During the Year
ggplot(data = monthlysentiment, aes(x = month, y = meanvalue, group = sentiment)) +
  geom_line(size = 2.5, alpha = 0.7, aes(color = sentiment)) +
  geom_point(size = 0.5) +
  ylim(0, NA) +
  theme(legend.title=element_blank(), axis.title.x = element_blank()) +
  ylab("Average sentiment score") + 
  ggtitle("Sentiment During the Year")

I will be writing about WhatsApp sentiment analysis in the next post.

Thanks for visiting my blog. I always love to hear constructive feedback. Please give your feedback in the comment section below or write to me personally here.

WEEK_3: Twitter tweet analysis

Hi there! Welcome to the week 3 session.

Today we will discuss Twitter tweet analysis.

Pr-requisites:

Windows/Mac/Linux machine with r-base and RStudio installed (if you don't have it yet, you can refer my previous post and get them on your PC).
Basic understanding of R data types and syntax.
And finally, YOU.

The very first thing is creating data set for our operations. We need to download our twitter archive for this purpose. Follow below instructions to get your twitter archive.

Step: 1
Navigate to your twitter account settings page by following this link.

Step: 2
Request your twitter archive by clicking on Request Your Archive link.
Twitter will send your archive via email, check your email inbox (associated with twitter account) and download the archive file.

Step3:
Extract the zipped file and find tweets.csv file. Copy the file to your working directory.
By default, your RStudio will set Documents folder as working directory. But you can change the working directory by executing setwd() command in your RStudio.
Ex:
setwd("C:/Users/Sharath/Downloads")

So, now we have the data source. Let us jump into the R code.

We need to use 3 packages.

Install those libraries first.


install.packages("ggplot2")
install.packages("lubridate")

install.packages("scales")

Let us load those libraries.


library(ggplot2)
library(lubridate)
library(scales)

Read data from tweets.csv


tweets <- read.csv("tweets.csv", stringsAsFactors = FALSE)

convert timestamp to date-time object


tweets$timestamp <- ymd_hms(tweets$timestamp)
tweets$timestamp <- with_tz(tweets$timestamp, "America/Chicago")

Now let us analyze the your tweeting trend, like when do you tweet more etc.

#basic histogram showing the distribution of my tweets over time
ggplot(data = tweets, aes(x = timestamp)) +
  geom_histogram(aes(fill = ..count..)) +
  theme(legend.position = "none") +
  xlab("Time") + ylab("Number of tweets") + 
  scale_fill_gradient(low = "midnightblue", high = "aquamarine4")



#tweets by year
ggplot(data = tweets, aes(x = year(timestamp))) +
  geom_histogram(breaks = seq(2007.5, 2016.2, by =1), aes(fill = ..count..)) +
  theme(legend.position = "none") +
  xlab("Time") + ylab("Number of tweets") + 
  scale_fill_gradient(low = "midnightblue", high = "aquamarine4")

#group by week days
ggplot(data = tweets, aes(x = wday(timestamp, label = TRUE))) +
  geom_histogram(breaks = seq(0.5, 7.5, by =1), aes(fill = ..count..)) +
  theme(legend.position = "none") +
  xlab("Day of the Week") + ylab("Number of tweets") + 
  scale_fill_gradient(low = "midnightblue", high = "aquamarine4")






#chi-square test to test the distribution of my tweets over week days
chisq.test(table(wday(tweets$timestamp, label = TRUE)))


###
myTable <- table(wday(tweets$timestamp, label = TRUE))
mean(myTable[c(2:5)])/mean(myTable[c(1,6,7)])

###
chisq.test(table(wday(tweets$timestamp, label = TRUE)), p = c(4, 5, 5, 5, 5, 4, 4)/32)

#tweets by months
ggplot(data = tweets, aes(x = month(timestamp, label = TRUE))) +
  geom_histogram(aes(fill = ..count..)) +
  theme(legend.position = "none") +
  xlab("Month") + ylab("Number of tweets") + 
  scale_fill_gradient(low = "midnightblue", high = "aquamarine4")



###
chisq.test(table(month(tweets$timestamp, label = TRUE)))


#fetch time of tweet and add it to existing tweet holder
tweets$timeonly <- as.numeric(tweets$timestamp - trunc(tweets$timestamp, "days"))

tweets[(minute(tweets$timestamp) == 0 & second(tweets$timestamp) == 0),11] <- NA
mean(is.na(tweets$timeonly))


class(tweets$timeonly) <- "POSIXct"

#number of tweets by time
ggplot(data = tweets, aes(x = timeonly)) +
  geom_histogram(aes(fill = ..count..)) +
  theme(legend.position = "none") +
  xlab("Time") + ylab("Number of tweets") + 
  scale_x_datetime(breaks = date_breaks("3 hours"), 
                   labels = date_format("%H:00")) +
  scale_fill_gradient(low = "midnightblue", high = "aquamarine4")

#late night tweets by year
latenighttweets <- tweets[(hour(tweets$timestamp) < 6),]
ggplot(data = latenighttweets, aes(x = timestamp)) +
  geom_histogram(aes(fill = ..count..)) +
  theme(legend.position = "none") +
  xlab("Time") + ylab("Number of tweets") + ggtitle("Late Night Tweets") +
  scale_fill_gradient(low = "midnightblue", high = "aquamarine4")

#number of tweets with hashtags
ggplot(tweets, aes(factor(grepl("#", tweets$text)))) +
  geom_bar(fill = "midnightblue") + 
  theme(legend.position="none", axis.title.x = element_blank()) +
  ylab("Number of tweets") + 
  ggtitle("Tweets with Hashtags") +
  scale_x_discrete(labels=c("No hashtags", "Tweets with hashtags"))

#number of tweets retweeted

ggplot(tweets, aes(factor(!is.na(retweeted_status_id)))) +
geom_bar(fill = "midnightblue") +
theme(legend.position="none", axis.title.x = element_blank()) +
ylab("Number of tweets") +
ggtitle("Retweeted Tweets") +
scale_x_discrete(labels=c("Not retweeted", "Retweeted tweets"))

#number of replied tweets
ggplot(tweets, aes(factor(!is.na(in_reply_to_status_id)))) +
  geom_bar(fill = "midnightblue") + 
  theme(legend.position="none", axis.title.x = element_blank()) +
  ylab("Number of tweets") + 
  ggtitle("Replied Tweets") +
  scale_x_discrete(labels=c("Not in reply", "Replied tweets"))


#categorize tweets under types
tweets$type <- "tweet"
tweets[(!is.na(tweets$retweeted_status_id)),12] <- "RT"
tweets[(!is.na(tweets$in_reply_to_status_id)),12] <- "reply"
tweets$type <- as.factor(tweets$type)
tweets$type = factor(tweets$type,levels(tweets$type)[c(3,1,2)])

#plot with types tweeting, retweeting, and replying
ggplot(data = tweets, aes(x = timestamp, fill = type)) +
geom_histogram() +
xlab("Time") + ylab("Number of tweets") +
scale_fill_manual(values = c("midnightblue", "deepskyblue4", "aquamarine3"))


#proportion of tweets among them
ggplot(data = tweets, aes(x = timestamp, fill = type)) +
  geom_bar(position = "fill") +
  xlab("Time") + ylab("Proportion of tweets") +
  scale_fill_manual(values = c("midnightblue", "deepskyblue4", "aquamarine3"))


#calculate characters per tweet
tweets$charsintweet <- sapply(tweets$text, function(x) nchar(x))


#plot char per tweet
ggplot(data = tweets, aes(x = charsintweet)) +
  geom_histogram(aes(fill = ..count..), binwidth = 8) +
  theme(legend.position = "none") +
  xlab("Characters per Tweet") + ylab("Number of tweets") + 
  scale_fill_gradient(low = "midnightblue", high = "aquamarine4")

We will discuss about twitter sentiment analysis in the next post.

Thanks for visiting my blog. I always love to hear constructive feedback. Please give your feedback in the comment section below or write to me personally here.

Wednesday, February 24, 2016

WEEK_2: Introduction to R programming

Hi there, welcome to week 2 session.

Today we will learn,

Why did I chose R over python
Introduction to R language
Basics of R

Why R over python?

We can choose R or python for data analysis. If you are already familiar with python, you can go with python. But I was newbie in both technologies.

I selected R because of the following reasons.

R is object-oriented
R is a functional programming language
Operator overloading is much easier in R than in Python
Parallelism in R has been much further developed than in Python
R is designed for statistical analysis
R is great for exploratory work
R has huge number of packages and readily usable tests that often provide you with the necessary tools to get up and running quickly
R can even be part of a big data solution

Introduction to R language

R was created by Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand, and is currently developed by the R Development Core Team, of which Chambers is a member.

As you know, we need an environment to run any program. You need to have r-base to run R programs.

You can download r-base by following below links.

For Windows machine, click here

For mac OSX machine, click here

For Linux machine, click here

(if any of the link is broken, get the r-base from cran website)

Now we have r-base. We can start coding! But we always prefer to work with IDEs than working on command line. Even R has a beautiful IDE called RStudio.

RStudio is an open source IDE. You can download it from their website. Here is the link.

Basics of R

Hope you have installed r-base and RStudio on your machine. Now launch RStudio or r-base interface.

After R is started, there is a console awaiting for input. At the prompt (>), you can enter numbers and perform calculations.

eg:

> 1 + 2

output:
[1] 3

Variable assignment

We assign values to variables with the assignment operator "=". Just typing the variable by itself at the prompt will print out the value. We should note that another form of assignment operator "<-" is also in use. I prefer using "<-" operator, for no specific reason!

eg:

> x = 1
> x

output:
[1] 1

Comments

All text after the pound sign "#" within the same line is considered as a comment.

eg:

> 1 + 1 # this is a comment

output:
[1] 2

Functions

R functions are invoked by its name, then followed by the parenthesis, and zero or more arguments. The following apply the function c to combine three numeric values into a vector.

eg:

> c(1, 2, 3)

output:
[1] 1 2 3

Extension Package

Sometimes we need additional functionality beyond those offered by the core R library. In order to install an extension package, you should invoke the install.packages function at the prompt and follow the instruction.

eg:

> install.packages("package_name")

Getting Help

R provides extensive documentation. For example, entering ?c or help(c) at the prompt gives documentation of the function c in R.

eg:

> help(c)

If you are not sure about the name of the function you are looking for, you can perform a fuzzy search with the apropos function.

eg:

> apropos("can")

output:

[1] ".rs.scanFiles" "canCoerce" "cancor" "scan" "volcano"

I will be writing about Sentiment analysis of twitter and WhatsApp data in the next post.

Thanks for visiting my blog. I always love to hear constructive feedback. Please give your feedback in the comment section below or write to me personally here.

(I use R-Bloggers for updates on R, consider visiting this blog too!)

Wednesday, February 17, 2016

WEEK_1: Introduction to Data Science

Hi there!

I am Sharath G S. I have started to learn Data Science.

This booming field was introduced to me by the organization I am working with.

I want to be a master of Data Science. So I have done a lot of research about Data Science. I will be sharing my learnings here. I will post on weekly basis. I will try to summarize my learnings of the week in a single post.

First of all, we need to understand what is Data Science. The very first thing that we do is just 'Google it'. Even I did the same.

Here is how wikipedia defines Data Science.

"Data Science is an interdisciplinary field about processes and systems to extract knowledge or insights from data in various forms, either structured or unstructured, which is a continuation of some of the data analysis fields such as statistics, data mining, and predictive analysis, similar to Knowledge Discovery in Databases (KDD)."

Data Science often involves using mathematic and algorithmic techniques to solve some of the most analytically complex business problems, leveraging troves of raw information to figure out hidden insight that lies beneath the surface. It centers around evidence-based analytical rigor and building robust decision capabilities.

Data Science enables companies to operate and strategize more intelligently. That is the reason why Data Science is the booming field.

Here is an image which will summarize the role of Data Science.

Who is Data Scientist?

"A data scientist is simply someone who is highly adept at studying large amounts of often unorganized/undigested data."

Another definition for a Data Scientist.

"A data scientist is someone who is better at statistics than any software engineer and better at software engineering than any statistician."

I found a Data Scientist's learning map. You don't have to worry about this now. This is just for your reference!

Data Science learner's path

You need to be good with statistics to become a good Data Scientist. You can refer the Probability and statistics course by Khanacademy. Follow this link to access the course.

We will start with Data Analysis.

This is how this page define data analysis.

"Data Analysis is the process of systematically applying statistical and/or logical techniques to describe and illustrate, condense and recap, and evaluate data."

We can use R language or Python for this purpose. I would like to go with R.
Let us start with R language next week. We will be doing text mining and analysis in the next session. And you know what? It is real fun! You will be doing sentiment analysis of your Twitter tweets and WhatsApp chats.

Don't miss it, subscribe to the blog for all updates.

Thanks for visiting my blog. I always love to hear constructive feedbacks. Please give your feedback in the comment section below or write to me personally here.
(I use R-Bloggers for updates on R, consider visiting this blog too!)