Data Science: April 2016

Wednesday, April 27, 2016

WEEK_12: Logistic regression analysis using R

Hi there!

We have discussed about Linear and Multiple linear regressions in the previous post. Today we will discuss about the Logistic regression.

The Logistic Regression is a regression model in which the response variable (dependent variable) has categorical values such as True/False or 0/1. It actually measures the probability of a binary response as the value of response variable based on the mathematical equation relating it with the predictor variables.

The general mathematical equation for logistic regression is:

y = 1/(1+e^-(a+b1x1+b2x2+b3x3+...))

y is the response variable.
x is the predictor variable.
a and b are the coefficients which are numeric constants.

We use the glm() function to create the regression model and get its summary for analysis.

Here is the syntax of the glm() function.

glm(formula,data,family)

Let us consider one example. Here, we will use an rStudio inbuilt data set called as mtcars.

input <- mtcars[,c("am","cyl","hp","wt")]
am.data = glm(formula = am ~ cyl + hp + wt, data = input, family = binomial)
print(summary(am.data))

Output:

Call:
glm(formula = am ~ cyl + hp + wt, family = binomial, data = input)

Deviance Residuals: 
     Min        1Q      Median        3Q       Max  
-2.17272     -0.14907  -0.01464     0.14116   1.27641  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)  
(Intercept) 19.70288    8.11637   2.428   0.0152 *
cyl          0.48760    1.07162   0.455   0.6491  
hp           0.03259    0.01886   1.728   0.0840 .
wt          -9.14947    4.15332  -2.203   0.0276 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 43.2297  on 31  degrees of freedom
Residual deviance:  9.8415  on 28  degrees of freedom
AIC: 17.841

Number of Fisher Scoring iterations: 8

Let us analyze the output:

Call section reminds us the formula we used.
Deviance residuals are a measure of model fit. This part of output shows the distribution of the deviance residuals for individual cases used in the model.
The next part of the output shows the coefficients, their standard errors, the z-statistic (sometimes called a Wald z-statistic), and the associated p-values.
The Akaike information criterion (AIC) is a measure of the relative quality of a statistical model for a given set of data. As such, AIC provides a means for model selection.
AIC deals with the trade-off between the goodness of fit of the model and the complexity of the model. It is founded on information theory: it offers a relative estimate of the information lost when a given model is used to represent the process that generates the data.
Null deviance shows how well the response variable is predicted by a model that includes only the intercept.
Residual deviance shows the degree of freedom after the addition of independent variables.
Fisher scoring iterations has to do with how the model was estimated. A linear model can be fit by solving closed form equations. Unfortunately, that cannot be done with logistic regression. Instead, an iterative approach (the Newton-Raphson algorithm by default) is used. Loosely, the model is fit based on a guess about what the estimates might be. The algorithm then looks around to see if the fit would be improved by using different estimates instead. If so, it moves in that direction (say, using a higher value for the estimate) and then fits the model again. The algorithm stops when it doesn't perceive that moving again would yield much additional improvement. This line tells you how many iterations there were before the process stopped and output the results.

Thanks for visiting my blog. I always love to hear constructive feedback. Please give your feedback in the comment section below or write to me personally here.

Monday, April 25, 2016

WEEK_11: Multiple Regression using R

Hi there!

Today we will discuss about Multiple Regression using R. In the previous post, we have discussed about Linear Regression using R. You need to know about Linear Regression to understand Multiple Regression better. If you had missed my previous post, find it here.

Multiple regression is an extension of linear regression into relationship between more than two variables. In simple linear relation we have one predictor and one response variable, but in multiple regression we have more than one predictor variable and one response variable.

The general mathematical equation for multiple regression is:

y = a + b1x1 + b2x2 +...bnxn

y is the response variable.
a, b1, b2 ... bn are coefficients.
and x1, x2, ... xn are predictor variables.

We use same lm() function(which we used for Linear Regression) to create the regression model.
Here we use lm function with different parameters.
Basic syntax is:

lm(y ~ x1+x2+x3..., data)

Let us consider one example. Here we will analyze a data set which contains information used to estimate undergraduate enrollment at the University of New Mexico.Download the data set here.




#read data into variable
datavar <- read.csv("dataset_enrollmentForecast.csv")
  
#attach data variable
attach(datavar)
  
#predict the fall enrollment (ROLL) using the unemployment rate (UNEM) and number #of spring high school graduates (HGRAD).
twoPredictorModel <- lm(ROLL ~ UNEM + HGRAD, datavar)

#display model
twoPredictorModel

twoPredictorModel

From this output, we can determine that the intercept is -8255.8, the coefficient for the unemployment rate is 698.2, and the coefficient for number of spring high school graduates is 0.9. Therefore, the complete regression equation is Fall Enrollment = -8255.8 + 698.2 * Unemployment Rate + 0.9 * Number of Spring High School Graduates. This equation tells us that the predicted fall enrollment for the University of New Mexico will increase by 698.2 students for every one percent increase in the unemployment rate and 0.9 students for every one high school graduate.

#predict the fall enrollment (ROLL) using the unemployment rate (UNEM), number of #spring high school graduates (HGRAD), and per capita income (INC)
threePredictorModel <- lm(ROLL ~ UNEM + HGRAD + INC, datavar)

#display model
threePredictorModel


threePredictorModel

From this output, we can determine that the intercept is -9153.3, the coefficient for the unemployment rate is 450.1, the coefficient for number of spring high school graduates is 0.4, and the coefficient for per capita income is 4.3. Therefore, the complete regression equation is Fall Enrollment = -9153.3 + 450.1 * Unemployment Rate + 0.4 * Number of Spring High School Graduates + 4.3 * Per Capita Income. This equation tells us that the predicted fall enrollment for the University of New Mexico will increase by 450.1 students for every one percent increase in the unemployment rate, 0.4 students for every one high school graduate, and 4.3 students for every one dollar of per capita income.

#generate model summaries
summary(twoPredictorModel)

Summary of twoPredictorModel


summary(threePredictorModel)

Summary of threePredictorModel

 
#Meaning of these output values are same ad that of Linear Regression model.

#Please refer my previous post for more info.

Thanks for visiting my blog. I always love to hear constructive feedback. Please give your feedback in the comment section below or write to me personally here.

Friday, April 22, 2016

WEEK_10: Linear Regression

Hi there!

Regression analysis is a very widely used statistical tool to establish a relationship model between two variables. One of these variable is called predictor variable whose value is gathered through experiments. The other variable is called response variable whose value is derived from the predictor variable.

Ex: y=mx+a

In the above example, y is the response variable and x is the predictor variable.
m is the slope and a is called as y intercept. a becomes equal to y when x=0.

There are many types of regression models. Like linear regression, Multiple linear regression, Logistic regression etc.

Today we will discuss about the linear regression.

In Linear Regression response and predictor variables are related through an equation, where exponent (power) of both these variables is 1. Mathematically a linear relationship represents a straight line when plotted as a graph.

In general, we represent linear regression using the formula y=mx+a.
For an example, let us analyze weight and average life span relationship and try to predict the life span.

Create data:


#create weight data
x <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)

#create average lifespan data
y <- c(70, 67, 65, 62, 61, 68, 79, 80, 70, 62)


# Apply the lm() function.
relation <- lm(y~x)

print(relation)

#Output:

#Here intercept is the value of response variable when predictor is zero.

#In other terms it is the y intercept, point where the line meets y axis.

#And another coefficient is the slope of the line.

#print summary of the relation

print(summary(relation))

#Output:

#The difference between the observed value of the dependent variable and the

#predicted value is called the residual.

#R-squared = Explained variation / Total variation.

#To know more about R-square follow this link.

#let us predict lifespan of a person who weighs 75kgs

a <- data.frame(x = 75)
result <-  predict(relation,a)
print(result)

#Output:

#let us visualize the regression graphically

#plot the chart

plot(x,y,col = "blue",main = "Weight & lifespan Regression",
     abline(lm(x~y)),cex = 1.3,pch = 16,xlab = "Weight in Kg",

+ylab = "age in years")

#save the file

dev.off()

#Output:

Thanks for visiting my blog. I always love to hear constructive feedback. Please give your feedback in the comment section below or write to me personally here