Data Science

Monday, May 2, 2016

WEEK_13: Forecasting Methods

Hi there!
Today we will discuss different forecasting methods.

What is forecasting?

Forecasting is the process of making predictions of the future based on past and present data and analysis of trends.

We have different methods to do that. Here are the two most used methods.

Moving average
Exponential smoothing

Moving average:

The moving average forecast is based on the assumption of a constant model.

We estimate the single parameter of the model at time T as average of the last m observations, where m is the moving average interval.

Since the model assumes a constant underlying mean, the forecast for any number of periods in the future is the same as the estimate of the parameter:

In practice the moving average will provide a good estimate of the mean of the time series if the mean is constant or slowly changing. In the case of a constant mean, the largest value of m will give the best estimates of the underlying mean. A longer observation period will average out the effects of variability.

For more Stat and Math of Moving average, please check this website.

Exponential smoothing:

As for the moving average, this method assumes that the time series follows a constant model.

The value of b is estimated as the weighted average of the last observation and the last estimate. Here

is a parameter in the interval [0, 1].

Rearranging, obtains an alternative form.

The new estimate is the old estimate plus a proportion of the observed error.
Because we are supposing a constant model, the forecast is the same as the estimate.

For more Stat and Math of Exponential smoothing, please check this website.

References:

Youtube video: Forecasting time series using R by Prof Rob J Hyndman at Melbourne R Users

Note:

Find R code on forecasting here.

Thank you for visiting my blog.

Wednesday, April 27, 2016

WEEK_12: Logistic regression analysis using R

Hi there!

We have discussed about Linear and Multiple linear regressions in the previous post. Today we will discuss about the Logistic regression.

The Logistic Regression is a regression model in which the response variable (dependent variable) has categorical values such as True/False or 0/1. It actually measures the probability of a binary response as the value of response variable based on the mathematical equation relating it with the predictor variables.

The general mathematical equation for logistic regression is:

y = 1/(1+e^-(a+b1x1+b2x2+b3x3+...))

y is the response variable.
x is the predictor variable.
a and b are the coefficients which are numeric constants.

We use the glm() function to create the regression model and get its summary for analysis.

Here is the syntax of the glm() function.

glm(formula,data,family)

Let us consider one example. Here, we will use an rStudio inbuilt data set called as mtcars.

input <- mtcars[,c("am","cyl","hp","wt")]
am.data = glm(formula = am ~ cyl + hp + wt, data = input, family = binomial)
print(summary(am.data))

Output:

Call:
glm(formula = am ~ cyl + hp + wt, family = binomial, data = input)

Deviance Residuals: 
     Min        1Q      Median        3Q       Max  
-2.17272     -0.14907  -0.01464     0.14116   1.27641  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)  
(Intercept) 19.70288    8.11637   2.428   0.0152 *
cyl          0.48760    1.07162   0.455   0.6491  
hp           0.03259    0.01886   1.728   0.0840 .
wt          -9.14947    4.15332  -2.203   0.0276 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 43.2297  on 31  degrees of freedom
Residual deviance:  9.8415  on 28  degrees of freedom
AIC: 17.841

Number of Fisher Scoring iterations: 8

Let us analyze the output:

Call section reminds us the formula we used.
Deviance residuals are a measure of model fit. This part of output shows the distribution of the deviance residuals for individual cases used in the model.
The next part of the output shows the coefficients, their standard errors, the z-statistic (sometimes called a Wald z-statistic), and the associated p-values.
The Akaike information criterion (AIC) is a measure of the relative quality of a statistical model for a given set of data. As such, AIC provides a means for model selection.
AIC deals with the trade-off between the goodness of fit of the model and the complexity of the model. It is founded on information theory: it offers a relative estimate of the information lost when a given model is used to represent the process that generates the data.
Null deviance shows how well the response variable is predicted by a model that includes only the intercept.
Residual deviance shows the degree of freedom after the addition of independent variables.
Fisher scoring iterations has to do with how the model was estimated. A linear model can be fit by solving closed form equations. Unfortunately, that cannot be done with logistic regression. Instead, an iterative approach (the Newton-Raphson algorithm by default) is used. Loosely, the model is fit based on a guess about what the estimates might be. The algorithm then looks around to see if the fit would be improved by using different estimates instead. If so, it moves in that direction (say, using a higher value for the estimate) and then fits the model again. The algorithm stops when it doesn't perceive that moving again would yield much additional improvement. This line tells you how many iterations there were before the process stopped and output the results.

Thanks for visiting my blog. I always love to hear constructive feedback. Please give your feedback in the comment section below or write to me personally here.

Monday, April 25, 2016

WEEK_11: Multiple Regression using R

Hi there!

Today we will discuss about Multiple Regression using R. In the previous post, we have discussed about Linear Regression using R. You need to know about Linear Regression to understand Multiple Regression better. If you had missed my previous post, find it here.

Multiple regression is an extension of linear regression into relationship between more than two variables. In simple linear relation we have one predictor and one response variable, but in multiple regression we have more than one predictor variable and one response variable.

The general mathematical equation for multiple regression is:

y = a + b1x1 + b2x2 +...bnxn

y is the response variable.
a, b1, b2 ... bn are coefficients.
and x1, x2, ... xn are predictor variables.

We use same lm() function(which we used for Linear Regression) to create the regression model.
Here we use lm function with different parameters.
Basic syntax is:

lm(y ~ x1+x2+x3..., data)

Let us consider one example. Here we will analyze a data set which contains information used to estimate undergraduate enrollment at the University of New Mexico.Download the data set here.




#read data into variable
datavar <- read.csv("dataset_enrollmentForecast.csv")
  
#attach data variable
attach(datavar)
  
#predict the fall enrollment (ROLL) using the unemployment rate (UNEM) and number #of spring high school graduates (HGRAD).
twoPredictorModel <- lm(ROLL ~ UNEM + HGRAD, datavar)

#display model
twoPredictorModel

twoPredictorModel

From this output, we can determine that the intercept is -8255.8, the coefficient for the unemployment rate is 698.2, and the coefficient for number of spring high school graduates is 0.9. Therefore, the complete regression equation is Fall Enrollment = -8255.8 + 698.2 * Unemployment Rate + 0.9 * Number of Spring High School Graduates. This equation tells us that the predicted fall enrollment for the University of New Mexico will increase by 698.2 students for every one percent increase in the unemployment rate and 0.9 students for every one high school graduate.

#predict the fall enrollment (ROLL) using the unemployment rate (UNEM), number of #spring high school graduates (HGRAD), and per capita income (INC)
threePredictorModel <- lm(ROLL ~ UNEM + HGRAD + INC, datavar)

#display model
threePredictorModel


threePredictorModel

From this output, we can determine that the intercept is -9153.3, the coefficient for the unemployment rate is 450.1, the coefficient for number of spring high school graduates is 0.4, and the coefficient for per capita income is 4.3. Therefore, the complete regression equation is Fall Enrollment = -9153.3 + 450.1 * Unemployment Rate + 0.4 * Number of Spring High School Graduates + 4.3 * Per Capita Income. This equation tells us that the predicted fall enrollment for the University of New Mexico will increase by 450.1 students for every one percent increase in the unemployment rate, 0.4 students for every one high school graduate, and 4.3 students for every one dollar of per capita income.

#generate model summaries
summary(twoPredictorModel)

Summary of twoPredictorModel


summary(threePredictorModel)

Summary of threePredictorModel

 
#Meaning of these output values are same ad that of Linear Regression model.

#Please refer my previous post for more info.

Thanks for visiting my blog. I always love to hear constructive feedback. Please give your feedback in the comment section below or write to me personally here.

Friday, April 22, 2016

WEEK_10: Linear Regression

Hi there!

Regression analysis is a very widely used statistical tool to establish a relationship model between two variables. One of these variable is called predictor variable whose value is gathered through experiments. The other variable is called response variable whose value is derived from the predictor variable.

Ex: y=mx+a

In the above example, y is the response variable and x is the predictor variable.
m is the slope and a is called as y intercept. a becomes equal to y when x=0.

There are many types of regression models. Like linear regression, Multiple linear regression, Logistic regression etc.

Today we will discuss about the linear regression.

In Linear Regression response and predictor variables are related through an equation, where exponent (power) of both these variables is 1. Mathematically a linear relationship represents a straight line when plotted as a graph.

In general, we represent linear regression using the formula y=mx+a.
For an example, let us analyze weight and average life span relationship and try to predict the life span.

Create data:


#create weight data
x <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)

#create average lifespan data
y <- c(70, 67, 65, 62, 61, 68, 79, 80, 70, 62)


# Apply the lm() function.
relation <- lm(y~x)

print(relation)

#Output:

#Here intercept is the value of response variable when predictor is zero.

#In other terms it is the y intercept, point where the line meets y axis.

#And another coefficient is the slope of the line.

#print summary of the relation

print(summary(relation))

#Output:

#The difference between the observed value of the dependent variable and the

#predicted value is called the residual.

#R-squared = Explained variation / Total variation.

#To know more about R-square follow this link.

#let us predict lifespan of a person who weighs 75kgs

a <- data.frame(x = 75)
result <-  predict(relation,a)
print(result)

#Output:

#let us visualize the regression graphically

#plot the chart

plot(x,y,col = "blue",main = "Weight & lifespan Regression",
     abline(lm(x~y)),cex = 1.3,pch = 16,xlab = "Weight in Kg",

+ylab = "age in years")

#save the file

dev.off()

#Output:

Thanks for visiting my blog. I always love to hear constructive feedback. Please give your feedback in the comment section below or write to me personally here

Wednesday, March 30, 2016

WEEK_9: Amazon review analysis –Ranking reviews, filtering duplicate and irrelevant reviews

Hi there!

Nowadays the ecommerce business is becoming very popular and trying to take a very huge market share. One of the reason why customer likes ecommerce platform is product review feature offered by sellers. Customer can review the product that they have purchased. It will help both the seller as well as other customers.

But because of the competition between sellers, manufacturers, we come across many spam reviews. Paid reviewers will post biased reviews, duplicate reviews or reviews which are irrelevant to the product.

It becomes a very hectic process for the seller to filter those reviews. So let us try to find a solution for this using R language.

Logic used:

In this example, I have taken reviews of an iPhone in Amzon website.

Step 1:

We need to have a list of keywords. This is the keywords which we will expect in the reviews. As we are doing review analysis of an iPhone, our keyword list will contain keys like, "camera" "battery" "life" "screen" "heat" etc.

Step 2:

In the second step, we should find how many times these keys are repeated in those reviews build a matrix to store them.

If a review doesn’t contain any of the keyword, then that review is possibly a spam. That review is not useful for the seller or the customer. If a review contains many of the keywords, that review should be considered first. That reviewer might be talking about some serious issue with the product.

Step 3:

Depending on the number of keywords found, calculate their relevance score and sort them.

Step 4:

In the final step we need to eliminate duplicate reviews. I have used selection sort method to compare reviews.

Now let us get into coding.

Here is the R script.
(You can find materials and scripts used in this post on my Github repo.)

#cleanup the work space

rm(list = setdiff(ls(), lsf.str()))
library(stringr)
#################################################################################
#read important keywords
#################################################################################
keywords = scan('KeyWords.txt',
                what='character', comment.char=';',sep = "\n")
#################################################################################
#read the data to be valuated, review.txt contains 11 review, each separated by new line character
#################################################################################
reviews <- scan('review.txt',
                what='character', comment.char=';',sep = "\n")
#################################################################################
#score it and compare
#################################################################################
findScore <- function(review,k) {
  keyLength <- length(keywords)
  matScore <- c()
  tDF <- c()
  for(i in 1:keyLength) {
    tDF <- c(k[i],sum(str_count(review,k[i])))
    matScore <- rbind(matScore,tDF)
  }
  return(matScore)
}


reviewLength <- length(reviews)
score <- c()
score <- keywords
for (i in 1:reviewLength) {
  tScore <-c()
  tScore <- findScore(reviews[i],keywords)
  score <- cbind(score,tScore[,2])
}

View(score)
#################################################################################
#function to calculate relevance of reviews
#################################################################################

findRel <- function(reviewScore) {
  totalScore <- 0
  keyLength <- length(keywords)
  for(i in 1:keyLength) {
    totalScore <- as.numeric(reviewScore[i])+totalScore
  }
  return(totalScore)
}

#find irrelevant reviews
totalScoreR <-c()
for (i in 2:dim(score)[2]) {
  totalScoreR[i-1] <- findRel(score[,i])
}

findIfRel <- function (totalScoreR) {
  for (i in 1:length(totalScoreR)) {
    if(as.numeric(totalScoreR[i]==0))
      cat("Review:",i,"is irrelevant\n")
    else
      cat("Number of keywords found in review:",i,"is ", totalScoreR[i],"\n")
  }
}

#call the above function to find relevance of reviews
findIfRel(totalScoreR)

#################################################################################
#sort reviews according to their importance#################################################################################

#high value review
  highValueR <- c()
  highValueR <- cbind(totalScoreR,c(1:length(totalScoreR)))
  highValueR <- data.frame(highValueR)
  highValueR <- highValueR[order(highValueR$totalScoreR, decreasing = TRUE),]
  
  print("Reviews in decresing order of importance:") 
  for (i in 1:dim(highValueR)[1]) {
    cat("Rank ",i,":\n") 
    cat(reviews[highValueR[i,2]])
    cat("\n##########################################\n")
  }
  
#################################################################################
#find similar reviews
#################################################################################
for (i in 1:length(reviews)) {
  for (j in i:length(reviews))  {
    if(!i==j)
      if(identical(reviews[i],reviews[j]))
        cat("\nReview ",i," and review ",j," are same\n")
  }
}

Thanks for visiting my blog. I always love to hear constructive feedback. Please give your feedback in the comment section below or write to me personally here.

WEEK_8: Review analysis

Hi there!

This post is about review analysis. Review analysis is very useful tool for sellers especially for those who sell online.
Customers give feedback after every purchase. It is very essential to analyze those reviews and take necessary actions to improve the business.
But unfortunately there are many spammers. Spammers may post unrelated reviews or post same review multiple times. It is very difficult task for the seller to categorize which is the genuine review.
So, here is a solution using R. This R script will eliminate the irrelevant reviews.

Logic of the code:
Build a list of all words in which we are interested. For example, for a phone reviews the keywords list will be camera,display,heating,battery etc.

Read all reviews to be analyzed and score them using simple algorithm using the keywords list. The score is about number of occurrences of the keywords in those reviews.

Depending on the scores take proper actions.

Here is the R script:

#cleanup the work space
rm(list = setdiff(ls(), lsf.str()))

#load stringr library for string operations
library(stringr)
#################################################################################
#read the important keywords
#################################################################################
keywords = scan('keywords.txt',
                what='character', comment.char=';',sep = "\n")
#################################################################################
#read the data to be valuated (you can use any method to read these files, you can combine all reviews in just one file too)
#################################################################################
review1 = scan('review1.txt',
               what='character', comment.char=';',sep = "\n")
review2 = scan('review2.txt',
               what='character', comment.char=';',sep = "\n")
review3 = scan('review3.txt',
               what='character', comment.char=';',sep = "\n")
review4 = scan('review4.txt',
               what='character', comment.char=';',sep = "\n")
#################################################################################
#score it and compare
#################################################################################
findScore <- function(review,k) {
  keyLength <- length(keywords)
  matScore <- 0
  for(i in 1:keyLength) {
    tDF <- c(k[i],sum(str_count(review,k[i])))
    matScore <- rbind(matScore,tDF)
  }
  return(matScore)
}

#call above function and score reviews
matScoreR1 <- findScore(review1,keywords)
matScoreR2 <- findScore(review2,keywords)
matScoreR3 <- findScore(review3,keywords)
matScoreR4 <- findScore(review4,keywords)

#view score matrices
View(matScoreR1)
View(matScoreR2)
View(matScoreR3)
View(matScoreR4)

findRel <- function(reviewScore) {
  totalScore <- 0
  keyLength <- length(keywords)
  for(i in 1:keyLength) {
    totalScore <- as.numeric(reviewScore[i,2])+totalScore
  }
  return(totalScore)
}
#################################################################################
#find irrelevant reviews
#################################################################################

totalScoreR1 <- findRel(matScoreR1)
totalScoreR2 <- findRel(matScoreR2)
totalScoreR3 <- findRel(matScoreR3)
totalScoreR4 <- findRel(matScoreR4)

#function to find if the review is relevant or not
findIfRel <- function (score,name) {
  if(as.numeric(score==0))
    cat(name," is irrelevant")
  else
    cat("Number of keywords found in ",name," :", score)
}

#check review relevance of each review
findIfRel(totalScoreR1,"review 1")
findIfRel(totalScoreR2,"review 2")
findIfRel(totalScoreR3,"review 3")
findIfRel(totalScoreR4,"review 4")

Just eliminating irrelevant reviews is not enough for any seller. We have to analyze a lot other aspects. We will discuss them in next posts.
Find materials related to this post on my Github repo here.
Thanks for visiting my blog. I always love to hear constructive feedback. Please give your feedback in the comment section below or write to me personally here.

WEEK_7: Spam message filtering

Hi there!

Today we will discuss about spam message filtering using R.

Hi there!

We have seen the word spam filtering in many places. Let it be email, SMS or any kind of communication media, spammers will try get your attention.

We need to filter those spam messages. There are many algorithms to do that.

I use Naive Bayes method in this example. If you are not familiar with Naïve Bayes, you can learn more here.

In Naïve Bayes method, the system will learn with its experience. Initially we need to teach the system, how to categorize spam and ham, by providing it some sample spam and ham messages.

Logic used:

Step 1:

We need some sample spam and ham messages. Load them into separate variables first.

Then we need keyword list. This is the list of keywords, which we might encounter in spam and ham messages.

Ex: money, account password, urgent etc.

Step 2:

Build a matrix which stores keywords in one dimension and number of times they appear in spam and ham messages in other dimensions.

Step 3:

Now load the new message which is yet to be filtered.

Calculate how many times each keyword has repeated.

Step 4:

Use above calculated matrix as reference and use Naïve Bayes formula to find out if the new message is spam or ham.

Here is the R script:

#clean up the workplace


rm(list = setdiff(ls(), lsf.str()))

library(stringr) #load required library, stringr is used in string comparison
#################################################################################
#read the sample spam,ham and keywords list
#################################################################################
ham = scan('ham.txt',
           what='character', comment.char=';',sep = "\n")
spam = scan('spam.txt',
           what='character', comment.char=';',sep = "\n")
keywords = scan('KeyWords.txt',
            what='character', comment.char=';',sep = "\n")

#################################################################################
#Calculate spam matrix
#################################################################################
keyLength <- length(keywords)
matSpam <- 0
matHam <- 0
for(i in 1:keyLength) {
  tDF <- c(keywords[i],sum(str_count(spam,keywords[i])))
  matSpam <- rbind(matSpam,tDF)
  tDF <- c(keywords[i],sum(str_count(ham,keywords[i])))
  matHam <- rbind(matHam,tDF)
}
#################################################################################
#read the data to be valuated, mesage.txt contains all messages which are to be valuated. Each message is delimited by new line.
#################################################################################
message = scan('message.txt',
                what='character', comment.char=';',sep = "\n")
#################################################################################
#categorize the message as spam or ham
#################################################################################

#score it and build matrix
keyLength <- length(keywords)
matScore <- 0
for(i in 1:keyLength) {
  tDF <- c(keywords[i],sum(str_count(message,keywords[i])))
  matScore <- rbind(matScore,tDF)
}

#apply formula and find if it is spam
lengthSpam <- length(spam)
lengthHam <- length(ham)
totalScore <- 0
for (i in 1:keyLength) {
    totalScore <- totalScore+as.numeric(matScore[i,2])*((as.numeric(matSpam[i,2])/lengthSpam)-(as.numeric(matHam[i,2])/lengthHam))
}
if(totalScore<0)
    totalScore <- totalScore*(-1)
totalScore <- totalScore*100
print("Percentage of being spam")
100-totalScore

Find materials for this post on my Github.
We will discuss about review analysis in the next post.

Thanks for visiting my blog. I always love to hear constructive feedback. Please give your feedback in the comment section below or write to me personally here.