Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement - a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it.
In this project, I will use data from accelerometers on the belt, forearm, arm, and dumbell as illustrated in the image below from 6 participants.
source: Weight Lifting Exercises Dataset
They were asked to perform barbell lifts correctly and incorrectly in 5 different ways:
Class A corresponds to the specified execution of the exercise, while the other 4 classes correspond to common mistakes. I will design a model to use the collected data from these accelerometers to predict which class a participant was performing.
The data for this project come from this source: http://groupware.les.inf.puc-rio.br/har. Before starting to fit a model, data preparation is performed.
I created a data folder under this working directory. and downloaded the dataset into this folder. A few simple data explorative exercises are performed.
traindata <- read.csv("data/pml-training.csv")
str(traindata)
From the data exploration, we learned that there are 19622 observations and 160 variables. However, we have noticed that there are some data are “NA”, blank or “#DIV/0!”. Since we don’t have too much information how to manage these values, we will treat them all as not available data and therefore we will reimport the data with na.strings parameter. Also, the first few of variables: x, user_name, raw_timestamp_part_1, raw_timestamp_part_2, cvtd_timestamp, new_window, and Num_window, they are seems not very useful to the analysis and I decided to leave them out.
traindata <- read.csv("data/pml-training.csv",na.strings = c("NA", "", "#DIV/0!"))
traindata <- traindata[,-c(1:7)]
str(traindata)
Now, since here we don’t really want to deal with missing data variables. We would like to fit our model only with those variables without any missing data.
traindata <- read.csv("data/pml-training.csv",na.strings = c("NA", "", "#DIV/0!"))
traindata <- traindata[,-c(1:7)]
myNAColTrain <- function(i) {sum(is.na(traindata[,i]) * 1) } # create a function to find any variable with missing data
usableColsTrain <- sapply(1:dim(traindata)[2],myNAColTrain) == 0 # get only variables with full observations
traindata <- traindata[,usableColsTrain]
str(traindata)
## 'data.frame': 19622 obs. of 53 variables:
## $ roll_belt : num 1.41 1.41 1.42 1.48 1.48 1.45 1.42 1.42 1.43 1.45 ...
## $ pitch_belt : num 8.07 8.07 8.07 8.05 8.07 8.06 8.09 8.13 8.16 8.17 ...
## $ yaw_belt : num -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 ...
## $ total_accel_belt : int 3 3 3 3 3 3 3 3 3 3 ...
## $ gyros_belt_x : num 0 0.02 0 0.02 0.02 0.02 0.02 0.02 0.02 0.03 ...
## $ gyros_belt_y : num 0 0 0 0 0.02 0 0 0 0 0 ...
## $ gyros_belt_z : num -0.02 -0.02 -0.02 -0.03 -0.02 -0.02 -0.02 -0.02 -0.02 0 ...
## $ accel_belt_x : int -21 -22 -20 -22 -21 -21 -22 -22 -20 -21 ...
## $ accel_belt_y : int 4 4 5 3 2 4 3 4 2 4 ...
## $ accel_belt_z : int 22 22 23 21 24 21 21 21 24 22 ...
## $ magnet_belt_x : int -3 -7 -2 -6 -6 0 -4 -2 1 -3 ...
## $ magnet_belt_y : int 599 608 600 604 600 603 599 603 602 609 ...
## $ magnet_belt_z : int -313 -311 -305 -310 -302 -312 -311 -313 -312 -308 ...
## $ roll_arm : num -128 -128 -128 -128 -128 -128 -128 -128 -128 -128 ...
## $ pitch_arm : num 22.5 22.5 22.5 22.1 22.1 22 21.9 21.8 21.7 21.6 ...
## $ yaw_arm : num -161 -161 -161 -161 -161 -161 -161 -161 -161 -161 ...
## $ total_accel_arm : int 34 34 34 34 34 34 34 34 34 34 ...
## $ gyros_arm_x : num 0 0.02 0.02 0.02 0 0.02 0 0.02 0.02 0.02 ...
## $ gyros_arm_y : num 0 -0.02 -0.02 -0.03 -0.03 -0.03 -0.03 -0.02 -0.03 -0.03 ...
## $ gyros_arm_z : num -0.02 -0.02 -0.02 0.02 0 0 0 0 -0.02 -0.02 ...
## $ accel_arm_x : int -288 -290 -289 -289 -289 -289 -289 -289 -288 -288 ...
## $ accel_arm_y : int 109 110 110 111 111 111 111 111 109 110 ...
## $ accel_arm_z : int -123 -125 -126 -123 -123 -122 -125 -124 -122 -124 ...
## $ magnet_arm_x : int -368 -369 -368 -372 -374 -369 -373 -372 -369 -376 ...
## $ magnet_arm_y : int 337 337 344 344 337 342 336 338 341 334 ...
## $ magnet_arm_z : int 516 513 513 512 506 513 509 510 518 516 ...
## $ roll_dumbbell : num 13.1 13.1 12.9 13.4 13.4 ...
## $ pitch_dumbbell : num -70.5 -70.6 -70.3 -70.4 -70.4 ...
## $ yaw_dumbbell : num -84.9 -84.7 -85.1 -84.9 -84.9 ...
## $ total_accel_dumbbell: int 37 37 37 37 37 37 37 37 37 37 ...
## $ gyros_dumbbell_x : num 0 0 0 0 0 0 0 0 0 0 ...
## $ gyros_dumbbell_y : num -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 ...
## $ gyros_dumbbell_z : num 0 0 0 -0.02 0 0 0 0 0 0 ...
## $ accel_dumbbell_x : int -234 -233 -232 -232 -233 -234 -232 -234 -232 -235 ...
## $ accel_dumbbell_y : int 47 47 46 48 48 48 47 46 47 48 ...
## $ accel_dumbbell_z : int -271 -269 -270 -269 -270 -269 -270 -272 -269 -270 ...
## $ magnet_dumbbell_x : int -559 -555 -561 -552 -554 -558 -551 -555 -549 -558 ...
## $ magnet_dumbbell_y : int 293 296 298 303 292 294 295 300 292 291 ...
## $ magnet_dumbbell_z : num -65 -64 -63 -60 -68 -66 -70 -74 -65 -69 ...
## $ roll_forearm : num 28.4 28.3 28.3 28.1 28 27.9 27.9 27.8 27.7 27.7 ...
## $ pitch_forearm : num -63.9 -63.9 -63.9 -63.9 -63.9 -63.9 -63.9 -63.8 -63.8 -63.8 ...
## $ yaw_forearm : num -153 -153 -152 -152 -152 -152 -152 -152 -152 -152 ...
## $ total_accel_forearm : int 36 36 36 36 36 36 36 36 36 36 ...
## $ gyros_forearm_x : num 0.03 0.02 0.03 0.02 0.02 0.02 0.02 0.02 0.03 0.02 ...
## $ gyros_forearm_y : num 0 0 -0.02 -0.02 0 -0.02 0 -0.02 0 0 ...
## $ gyros_forearm_z : num -0.02 -0.02 0 0 -0.02 -0.03 -0.02 0 -0.02 -0.02 ...
## $ accel_forearm_x : int 192 192 196 189 189 193 195 193 193 190 ...
## $ accel_forearm_y : int 203 203 204 206 206 203 205 205 204 205 ...
## $ accel_forearm_z : int -215 -216 -213 -214 -214 -215 -215 -213 -214 -215 ...
## $ magnet_forearm_x : int -17 -18 -18 -16 -17 -9 -18 -9 -16 -22 ...
## $ magnet_forearm_y : num 654 661 658 658 655 660 659 660 653 656 ...
## $ magnet_forearm_z : num 476 473 469 469 473 478 470 474 476 473 ...
## $ classe : Factor w/ 5 levels "A","B","C","D",..: 1 1 1 1 1 1 1 1 1 1 ...
After this data cleaning process, we got 53 variables without any missing data which includes our classifier outcome: classe.
We also prepare same for our test dataset:
testdata <- read.csv("data/pml-testing.csv",na.strings = c("NA", "", "#DIV/0!"))
testdata <- testdata[,-c(1:7)]
myNAColTest <- function(i) {sum(is.na(testdata[,i]) * 1) }
usableColsTest <- sapply(1:dim(testdata)[2],myNAColTest) == 0 # get only variables with full observations
testdata <- testdata[,usableColsTest]
in order to compare the prediction accuracy. I will 2 different predictive Algorithms: Recursive Partitioning and Regression Trees (rpart) and random forest, and compare the accuracy of the outcome.
in order to perform across validation, a data splitting process is performed by using createDataPartition functions. I will use the k-fold Cross Validation trainControl resampling method here and the default is 10 folds. (since it takes too long to process the random forest model with 10 folds, I set 5 folds here to reduce the process time which in my laptop takes 9 minutes.)
set.seed(130308)
library(caret);library(rattle)
inTrain <- createDataPartition(traindata$classe, p = 0.7, list = FALSE)
trainPart <- traindata[inTrain,]
validPart <- traindata[-inTrain,]
kfoldCV <- trainControl(method = "cv",number = 5) # use 5-fold cross validation change to smaller if run too long
modfit_rpart <- train(classe ~ ., data = trainPart, method = "rpart" , trControl = kfoldCV )
print(modfit_rpart)
## CART
##
## 13737 samples
## 52 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 10989, 10991, 10990, 10989, 10989
## Resampling results across tuning parameters:
##
## cp Accuracy Kappa Accuracy SD Kappa SD
## 0.03651714 0.4998186 0.34672620 0.01665164 0.02212731
## 0.06004815 0.4150185 0.20754391 0.06560999 0.11144164
## 0.11565456 0.3326794 0.07361485 0.04404796 0.06721649
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.03651714.
Now i will use the model trained to validate the data.
valid_rpart <- predict(modfit_rpart, newdata = validPart)
# get the confusionMatrix Results for rpart classification
(cmOutcome <-confusionMatrix(validPart$classe,valid_rpart))
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1524 21 124 0 5
## B 502 369 268 0 0
## C 461 36 529 0 0
## D 443 172 349 0 0
## E 158 131 308 0 485
##
## Overall Statistics
##
## Accuracy : 0.494
## 95% CI : (0.4811, 0.5068)
## No Information Rate : 0.5247
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.3383
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.4935 0.5062 0.33523 NA 0.98980
## Specificity 0.9464 0.8507 0.88461 0.8362 0.88934
## Pos Pred Value 0.9104 0.3240 0.51559 NA 0.44824
## Neg Pred Value 0.6286 0.9241 0.78411 NA 0.99896
## Prevalence 0.5247 0.1239 0.26814 0.0000 0.08326
## Detection Rate 0.2590 0.0627 0.08989 0.0000 0.08241
## Detection Prevalence 0.2845 0.1935 0.17434 0.1638 0.18386
## Balanced Accuracy 0.7199 0.6784 0.60992 NA 0.93957
Based on the outcome of the confusionMatrix, the rpart model accuracy rate is 0.4939677. So the out-of-sample error rate for this classification model is 0.5060323. This is far below my expectation, and I will look forward to a second prediction algorithm to perform prediction better.
# ct <- proc.time() # get start time
modfit_rf <- train(classe ~ ., data = trainPart, method = "rf" , trControl = kfoldCV )
# proc.time() - ct # check the time elapsed for the rf model
# print(modfit_rpart)
valid_rf <- predict(modfit_rf, newdata = validPart)
# get the confusionMatrix Results for rpart classification
(cmOutcome <-confusionMatrix(validPart$classe,valid_rf))
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1671 1 0 0 2
## B 11 1126 2 0 0
## C 0 12 1014 0 0
## D 0 0 19 944 1
## E 0 0 1 3 1078
##
## Overall Statistics
##
## Accuracy : 0.9912
## 95% CI : (0.9884, 0.9934)
## No Information Rate : 0.2858
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9888
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9935 0.9886 0.9788 0.9968 0.9972
## Specificity 0.9993 0.9973 0.9975 0.9959 0.9992
## Pos Pred Value 0.9982 0.9886 0.9883 0.9793 0.9963
## Neg Pred Value 0.9974 0.9973 0.9955 0.9994 0.9994
## Prevalence 0.2858 0.1935 0.1760 0.1609 0.1837
## Detection Rate 0.2839 0.1913 0.1723 0.1604 0.1832
## Detection Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Balanced Accuracy 0.9964 0.9929 0.9881 0.9964 0.9982
Based on the outcome of the confusionMatrix, the random forest model accuracy rate is 0.991164. So the out-of-sample error rate for this model is 0.008836.
This is a very impressive accuracy rate. In comparison to the Weight Lifting Exercises normalized confusion matrix chart online, you can tell that this confusion matrix outcome from random forest model is much better.
source: Weight Lifting Exercises Confusion Matrix
cmOutcome <- confusionMatrix(validPart$classe,valid_rf,dnn=c("RF Prediction","Actual Class"))
# transform the column and rows to make it comparable witht the chart above
t(as.matrix(round(cmOutcome$table / colSums(cmOutcome$table),2)))
## RF Prediction
## Actual Class A B C D E
## A 0.99 0.01 0.00 0.00 0.00
## B 0.00 0.99 0.01 0.00 0.00
## C 0.00 0.00 0.98 0.02 0.00
## D 0.00 0.00 0.00 1.00 0.00
## E 0.00 0.00 0.00 0.00 1.00
Since the model based on random forest results in a much better accuracy rate, I will use this model to predict the test dataset.
test_rf <- predict(modfit_rf, newdata = testdata)
test_rf
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
table(test_rf)
## test_rf
## A B C D E
## 7 8 1 1 3