Data Science: WEEK_9: Amazon review analysis –Ranking reviews, filtering duplicate and irrelevant reviews

Hi there!

Nowadays the ecommerce business is becoming very popular and trying to take a very huge market share. One of the reason why customer likes ecommerce platform is product review feature offered by sellers. Customer can review the product that they have purchased. It will help both the seller as well as other customers.

But because of the competition between sellers, manufacturers, we come across many spam reviews. Paid reviewers will post biased reviews, duplicate reviews or reviews which are irrelevant to the product.

It becomes a very hectic process for the seller to filter those reviews. So let us try to find a solution for this using R language.

Logic used:

In this example, I have taken reviews of an iPhone in Amzon website.

Step 1:

We need to have a list of keywords. This is the keywords which we will expect in the reviews. As we are doing review analysis of an iPhone, our keyword list will contain keys like, "camera" "battery" "life" "screen" "heat" etc.

Step 2:

In the second step, we should find how many times these keys are repeated in those reviews build a matrix to store them.

If a review doesn’t contain any of the keyword, then that review is possibly a spam. That review is not useful for the seller or the customer. If a review contains many of the keywords, that review should be considered first. That reviewer might be talking about some serious issue with the product.

Step 3:

Depending on the number of keywords found, calculate their relevance score and sort them.

Step 4:

In the final step we need to eliminate duplicate reviews. I have used selection sort method to compare reviews.

Now let us get into coding.

Here is the R script.
(You can find materials and scripts used in this post on my Github repo.)

#cleanup the work space

rm(list = setdiff(ls(), lsf.str()))
library(stringr)
#################################################################################
#read important keywords
#################################################################################
keywords = scan('KeyWords.txt',
                what='character', comment.char=';',sep = "\n")
#################################################################################
#read the data to be valuated, review.txt contains 11 review, each separated by new line character
#################################################################################
reviews <- scan('review.txt',
                what='character', comment.char=';',sep = "\n")
#################################################################################
#score it and compare
#################################################################################
findScore <- function(review,k) {
  keyLength <- length(keywords)
  matScore <- c()
  tDF <- c()
  for(i in 1:keyLength) {
    tDF <- c(k[i],sum(str_count(review,k[i])))
    matScore <- rbind(matScore,tDF)
  }
  return(matScore)
}


reviewLength <- length(reviews)
score <- c()
score <- keywords
for (i in 1:reviewLength) {
  tScore <-c()
  tScore <- findScore(reviews[i],keywords)
  score <- cbind(score,tScore[,2])
}

View(score)
#################################################################################
#function to calculate relevance of reviews
#################################################################################

findRel <- function(reviewScore) {
  totalScore <- 0
  keyLength <- length(keywords)
  for(i in 1:keyLength) {
    totalScore <- as.numeric(reviewScore[i])+totalScore
  }
  return(totalScore)
}

#find irrelevant reviews
totalScoreR <-c()
for (i in 2:dim(score)[2]) {
  totalScoreR[i-1] <- findRel(score[,i])
}

findIfRel <- function (totalScoreR) {
  for (i in 1:length(totalScoreR)) {
    if(as.numeric(totalScoreR[i]==0))
      cat("Review:",i,"is irrelevant\n")
    else
      cat("Number of keywords found in review:",i,"is ", totalScoreR[i],"\n")
  }
}

#call the above function to find relevance of reviews
findIfRel(totalScoreR)

#################################################################################
#sort reviews according to their importance#################################################################################

#high value review
  highValueR <- c()
  highValueR <- cbind(totalScoreR,c(1:length(totalScoreR)))
  highValueR <- data.frame(highValueR)
  highValueR <- highValueR[order(highValueR$totalScoreR, decreasing = TRUE),]
  
  print("Reviews in decresing order of importance:") 
  for (i in 1:dim(highValueR)[1]) {
    cat("Rank ",i,":\n") 
    cat(reviews[highValueR[i,2]])
    cat("\n##########################################\n")
  }
  
#################################################################################
#find similar reviews
#################################################################################
for (i in 1:length(reviews)) {
  for (j in i:length(reviews))  {
    if(!i==j)
      if(identical(reviews[i],reviews[j]))
        cat("\nReview ",i," and review ",j," are same\n")
  }
}

Thanks for visiting my blog. I always love to hear constructive feedback. Please give your feedback in the comment section below or write to me personally here.

Data Science

Wednesday, March 30, 2016

WEEK_9: Amazon review analysis –Ranking reviews, filtering duplicate and irrelevant reviews

No comments:

Post a Comment