Introduction

Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement - a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it.

In this project, I will use data from accelerometers on the belt, forearm, arm, and dumbell as illustrated in the image below from 6 participants.

source: Weight Lifting Exercises Dataset

They were asked to perform barbell lifts correctly and incorrectly in 5 different ways:

Class A corresponds to the specified execution of the exercise, while the other 4 classes correspond to common mistakes. I will design a model to use the collected data from these accelerometers to predict which class a participant was performing.

Data Source and Data Preparation

The data for this project come from this source: http://groupware.les.inf.puc-rio.br/har. Before starting to fit a model, data preparation is performed.

I created a data folder under this working directory. and downloaded the dataset into this folder. A few simple data explorative exercises are performed.

traindata <- read.csv("data/pml-training.csv")
str(traindata)

From the data exploration, we learned that there are 19622 observations and 160 variables. However, we have noticed that there are some data are “NA”, blank or “#DIV/0!”. Since we don’t have too much information how to manage these values, we will treat them all as not available data and therefore we will reimport the data with na.strings parameter. Also, the first few of variables: x, user_name, raw_timestamp_part_1, raw_timestamp_part_2, cvtd_timestamp, new_window, and Num_window, they are seems not very useful to the analysis and I decided to leave them out.

traindata <- read.csv("data/pml-training.csv",na.strings = c("NA", "", "#DIV/0!"))
traindata <- traindata[,-c(1:7)]
str(traindata)

Now, since here we don’t really want to deal with missing data variables. We would like to fit our model only with those variables without any missing data.

traindata <- read.csv("data/pml-training.csv",na.strings = c("NA", "", "#DIV/0!"))
traindata <- traindata[,-c(1:7)]
myNAColTrain <- function(i) {sum(is.na(traindata[,i]) * 1) } # create a function to find any variable with missing data
usableColsTrain <- sapply(1:dim(traindata)[2],myNAColTrain) == 0 # get only variables with full observations
traindata <- traindata[,usableColsTrain]
str(traindata)
## 'data.frame':    19622 obs. of  53 variables:
##  $ roll_belt           : num  1.41 1.41 1.42 1.48 1.48 1.45 1.42 1.42 1.43 1.45 ...
##  $ pitch_belt          : num  8.07 8.07 8.07 8.05 8.07 8.06 8.09 8.13 8.16 8.17 ...
##  $ yaw_belt            : num  -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 ...
##  $ total_accel_belt    : int  3 3 3 3 3 3 3 3 3 3 ...
##  $ gyros_belt_x        : num  0 0.02 0 0.02 0.02 0.02 0.02 0.02 0.02 0.03 ...
##  $ gyros_belt_y        : num  0 0 0 0 0.02 0 0 0 0 0 ...
##  $ gyros_belt_z        : num  -0.02 -0.02 -0.02 -0.03 -0.02 -0.02 -0.02 -0.02 -0.02 0 ...
##  $ accel_belt_x        : int  -21 -22 -20 -22 -21 -21 -22 -22 -20 -21 ...
##  $ accel_belt_y        : int  4 4 5 3 2 4 3 4 2 4 ...
##  $ accel_belt_z        : int  22 22 23 21 24 21 21 21 24 22 ...
##  $ magnet_belt_x       : int  -3 -7 -2 -6 -6 0 -4 -2 1 -3 ...
##  $ magnet_belt_y       : int  599 608 600 604 600 603 599 603 602 609 ...
##  $ magnet_belt_z       : int  -313 -311 -305 -310 -302 -312 -311 -313 -312 -308 ...
##  $ roll_arm            : num  -128 -128 -128 -128 -128 -128 -128 -128 -128 -128 ...
##  $ pitch_arm           : num  22.5 22.5 22.5 22.1 22.1 22 21.9 21.8 21.7 21.6 ...
##  $ yaw_arm             : num  -161 -161 -161 -161 -161 -161 -161 -161 -161 -161 ...
##  $ total_accel_arm     : int  34 34 34 34 34 34 34 34 34 34 ...
##  $ gyros_arm_x         : num  0 0.02 0.02 0.02 0 0.02 0 0.02 0.02 0.02 ...
##  $ gyros_arm_y         : num  0 -0.02 -0.02 -0.03 -0.03 -0.03 -0.03 -0.02 -0.03 -0.03 ...
##  $ gyros_arm_z         : num  -0.02 -0.02 -0.02 0.02 0 0 0 0 -0.02 -0.02 ...
##  $ accel_arm_x         : int  -288 -290 -289 -289 -289 -289 -289 -289 -288 -288 ...
##  $ accel_arm_y         : int  109 110 110 111 111 111 111 111 109 110 ...
##  $ accel_arm_z         : int  -123 -125 -126 -123 -123 -122 -125 -124 -122 -124 ...
##  $ magnet_arm_x        : int  -368 -369 -368 -372 -374 -369 -373 -372 -369 -376 ...
##  $ magnet_arm_y        : int  337 337 344 344 337 342 336 338 341 334 ...
##  $ magnet_arm_z        : int  516 513 513 512 506 513 509 510 518 516 ...
##  $ roll_dumbbell       : num  13.1 13.1 12.9 13.4 13.4 ...
##  $ pitch_dumbbell      : num  -70.5 -70.6 -70.3 -70.4 -70.4 ...
##  $ yaw_dumbbell        : num  -84.9 -84.7 -85.1 -84.9 -84.9 ...
##  $ total_accel_dumbbell: int  37 37 37 37 37 37 37 37 37 37 ...
##  $ gyros_dumbbell_x    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ gyros_dumbbell_y    : num  -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 ...
##  $ gyros_dumbbell_z    : num  0 0 0 -0.02 0 0 0 0 0 0 ...
##  $ accel_dumbbell_x    : int  -234 -233 -232 -232 -233 -234 -232 -234 -232 -235 ...
##  $ accel_dumbbell_y    : int  47 47 46 48 48 48 47 46 47 48 ...
##  $ accel_dumbbell_z    : int  -271 -269 -270 -269 -270 -269 -270 -272 -269 -270 ...
##  $ magnet_dumbbell_x   : int  -559 -555 -561 -552 -554 -558 -551 -555 -549 -558 ...
##  $ magnet_dumbbell_y   : int  293 296 298 303 292 294 295 300 292 291 ...
##  $ magnet_dumbbell_z   : num  -65 -64 -63 -60 -68 -66 -70 -74 -65 -69 ...
##  $ roll_forearm        : num  28.4 28.3 28.3 28.1 28 27.9 27.9 27.8 27.7 27.7 ...
##  $ pitch_forearm       : num  -63.9 -63.9 -63.9 -63.9 -63.9 -63.9 -63.9 -63.8 -63.8 -63.8 ...
##  $ yaw_forearm         : num  -153 -153 -152 -152 -152 -152 -152 -152 -152 -152 ...
##  $ total_accel_forearm : int  36 36 36 36 36 36 36 36 36 36 ...
##  $ gyros_forearm_x     : num  0.03 0.02 0.03 0.02 0.02 0.02 0.02 0.02 0.03 0.02 ...
##  $ gyros_forearm_y     : num  0 0 -0.02 -0.02 0 -0.02 0 -0.02 0 0 ...
##  $ gyros_forearm_z     : num  -0.02 -0.02 0 0 -0.02 -0.03 -0.02 0 -0.02 -0.02 ...
##  $ accel_forearm_x     : int  192 192 196 189 189 193 195 193 193 190 ...
##  $ accel_forearm_y     : int  203 203 204 206 206 203 205 205 204 205 ...
##  $ accel_forearm_z     : int  -215 -216 -213 -214 -214 -215 -215 -213 -214 -215 ...
##  $ magnet_forearm_x    : int  -17 -18 -18 -16 -17 -9 -18 -9 -16 -22 ...
##  $ magnet_forearm_y    : num  654 661 658 658 655 660 659 660 653 656 ...
##  $ magnet_forearm_z    : num  476 473 469 469 473 478 470 474 476 473 ...
##  $ classe              : Factor w/ 5 levels "A","B","C","D",..: 1 1 1 1 1 1 1 1 1 1 ...

After this data cleaning process, we got 53 variables without any missing data which includes our classifier outcome: classe.

We also prepare same for our test dataset:

testdata <- read.csv("data/pml-testing.csv",na.strings = c("NA", "", "#DIV/0!"))
testdata <- testdata[,-c(1:7)]
myNAColTest <- function(i) {sum(is.na(testdata[,i]) * 1) }
usableColsTest <- sapply(1:dim(testdata)[2],myNAColTest) == 0 # get only variables with full observations
testdata <- testdata[,usableColsTest]

Constructing the Prediction Model

in order to compare the prediction accuracy. I will 2 different predictive Algorithms: Recursive Partitioning and Regression Trees (rpart) and random forest, and compare the accuracy of the outcome.

Data Partition

in order to perform across validation, a data splitting process is performed by using createDataPartition functions. I will use the k-fold Cross Validation trainControl resampling method here and the default is 10 folds. (since it takes too long to process the random forest model with 10 folds, I set 5 folds here to reduce the process time which in my laptop takes 9 minutes.)

set.seed(130308)
library(caret);library(rattle)
inTrain <- createDataPartition(traindata$classe, p = 0.7, list = FALSE)
trainPart <- traindata[inTrain,]
validPart <- traindata[-inTrain,]
kfoldCV <- trainControl(method = "cv",number = 5) # use 5-fold cross validation change to smaller if run too long

Use RPART Classification

modfit_rpart <- train(classe ~ ., data = trainPart, method = "rpart" , trControl = kfoldCV )
print(modfit_rpart)
## CART 
## 
## 13737 samples
##    52 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 10989, 10991, 10990, 10989, 10989 
## Resampling results across tuning parameters:
## 
##   cp          Accuracy   Kappa       Accuracy SD  Kappa SD  
##   0.03651714  0.4998186  0.34672620  0.01665164   0.02212731
##   0.06004815  0.4150185  0.20754391  0.06560999   0.11144164
##   0.11565456  0.3326794  0.07361485  0.04404796   0.06721649
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was cp = 0.03651714.

Now i will use the model trained to validate the data.

valid_rpart <- predict(modfit_rpart, newdata = validPart)
# get the confusionMatrix Results for rpart classification
(cmOutcome <-confusionMatrix(validPart$classe,valid_rpart))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1524   21  124    0    5
##          B  502  369  268    0    0
##          C  461   36  529    0    0
##          D  443  172  349    0    0
##          E  158  131  308    0  485
## 
## Overall Statistics
##                                           
##                Accuracy : 0.494           
##                  95% CI : (0.4811, 0.5068)
##     No Information Rate : 0.5247          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.3383          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.4935   0.5062  0.33523       NA  0.98980
## Specificity            0.9464   0.8507  0.88461   0.8362  0.88934
## Pos Pred Value         0.9104   0.3240  0.51559       NA  0.44824
## Neg Pred Value         0.6286   0.9241  0.78411       NA  0.99896
## Prevalence             0.5247   0.1239  0.26814   0.0000  0.08326
## Detection Rate         0.2590   0.0627  0.08989   0.0000  0.08241
## Detection Prevalence   0.2845   0.1935  0.17434   0.1638  0.18386
## Balanced Accuracy      0.7199   0.6784  0.60992       NA  0.93957

Based on the outcome of the confusionMatrix, the rpart model accuracy rate is 0.4939677. So the out-of-sample error rate for this classification model is 0.5060323. This is far below my expectation, and I will look forward to a second prediction algorithm to perform prediction better.

Use Random Forest

# ct <- proc.time() # get start time
modfit_rf <- train(classe ~ ., data = trainPart, method = "rf" , trControl = kfoldCV )
# proc.time() - ct # check the time elapsed for the rf model
# print(modfit_rpart)
valid_rf <- predict(modfit_rf, newdata = validPart)
# get the confusionMatrix Results for rpart classification
(cmOutcome <-confusionMatrix(validPart$classe,valid_rf))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1671    1    0    0    2
##          B   11 1126    2    0    0
##          C    0   12 1014    0    0
##          D    0    0   19  944    1
##          E    0    0    1    3 1078
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9912          
##                  95% CI : (0.9884, 0.9934)
##     No Information Rate : 0.2858          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9888          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9935   0.9886   0.9788   0.9968   0.9972
## Specificity            0.9993   0.9973   0.9975   0.9959   0.9992
## Pos Pred Value         0.9982   0.9886   0.9883   0.9793   0.9963
## Neg Pred Value         0.9974   0.9973   0.9955   0.9994   0.9994
## Prevalence             0.2858   0.1935   0.1760   0.1609   0.1837
## Detection Rate         0.2839   0.1913   0.1723   0.1604   0.1832
## Detection Prevalence   0.2845   0.1935   0.1743   0.1638   0.1839
## Balanced Accuracy      0.9964   0.9929   0.9881   0.9964   0.9982

Based on the outcome of the confusionMatrix, the random forest model accuracy rate is 0.991164. So the out-of-sample error rate for this model is 0.008836.

This is a very impressive accuracy rate. In comparison to the Weight Lifting Exercises normalized confusion matrix chart online, you can tell that this confusion matrix outcome from random forest model is much better.

source: Weight Lifting Exercises Confusion Matrix

cmOutcome <- confusionMatrix(validPart$classe,valid_rf,dnn=c("RF Prediction","Actual Class"))
# transform the column and rows to make it comparable witht the chart above
t(as.matrix(round(cmOutcome$table / colSums(cmOutcome$table),2)))
##             RF Prediction
## Actual Class    A    B    C    D    E
##            A 0.99 0.01 0.00 0.00 0.00
##            B 0.00 0.99 0.01 0.00 0.00
##            C 0.00 0.00 0.98 0.02 0.00
##            D 0.00 0.00 0.00 1.00 0.00
##            E 0.00 0.00 0.00 0.00 1.00

Predict Test Data

Since the model based on random forest results in a much better accuracy rate, I will use this model to predict the test dataset.

test_rf <- predict(modfit_rf, newdata = testdata)
test_rf
##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
table(test_rf)
## test_rf
## A B C D E 
## 7 8 1 1 3