A dataset consisting of sensory readings collected from six males with ages between 20-28 years performing unilateral dumbbell bicep curls in five different ways was used to fit a model that can classify which of the five forms was being performed. The dataset was cleaned, split into training/testing sets, and pre-processed with principal component analysis before being fitted with a boosted trees model with the gbm method in caret. A 5-fold cross-validation scheme was performed to gauge the performance of the model. The final model was estimated to have an accuracy of about 75%.
The dataset was composed of 19622 observations of 160 variables. An inspection of the data showed that around 6 of these variables were identification data, 1 was for time tracking, and 9 were filled with #DIV!/0 errors. These variables were discarded. The dataset was filled with NA values arising probably from the activity-dependent non-stimulation of sensors. These NA values were imputed with zeroes to prevent errors from the training functions.
The dataset was split into a 70/30 training and testing set. With a large number of features available and limited computational power, a principal component analysis was performed to lower the dimensionality of the data. The pre-processing was instructed to capture 80% of the variance, reducing the number of features from 144 to 30.
A boosted model with trees was chosen as a robust classification model. This was implemented through the caret gbm method with settings on default. A 5-fold cross-validation scheme was performed to gauge the accuracy of the model on unseen data. A k-fold cross-validation was chosen since it gives a relatively low bias estimate without too high a variance. A k=5 was chosen as this number is widely regarded as having an acceptable bias-variance tradeoff.
## Stochastic Gradient Boosting
##
## 13737 samples
## 144 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## Pre-processing: principal component signal extraction (144),
## centered (144), scaled (144)
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 10990, 10989, 10990, 10989, 10990
## Resampling results across tuning parameters:
##
## interaction.depth n.trees Accuracy Kappa
## 1 50 0.5172891 0.3824744
## 1 100 0.5599477 0.4403546
## 1 150 0.5838251 0.4719326
## 2 50 0.6044268 0.4984237
## 2 100 0.6616442 0.5715237
## 2 150 0.6979689 0.6178074
## 3 50 0.6536363 0.5615387
## 3 100 0.7182069 0.6434025
## 3 150 0.7564975 0.6918088
##
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
##
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 150,
## interaction.depth = 3, shrinkage = 0.1 and n.minobsinnode = 10.
The model is estimated to have an accuracy of about 75%.
The model was applied to the testing set to estimate the out-of-sample error. A confusion matrix was applied to the model predictions and actual classifications.
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1410 126 80 67 68
## B 65 789 93 45 99
## C 78 128 782 120 72
## D 86 60 41 706 72
## E 35 36 30 26 771
##
## Overall Statistics
##
## Accuracy : 0.7575
## 95% CI : (0.7464, 0.7684)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6929
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.8423 0.6927 0.7622 0.7324 0.7126
## Specificity 0.9190 0.9364 0.9181 0.9474 0.9736
## Pos Pred Value 0.8053 0.7232 0.6627 0.7316 0.8586
## Neg Pred Value 0.9361 0.9270 0.9481 0.9476 0.9376
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2396 0.1341 0.1329 0.1200 0.1310
## Detection Prevalence 0.2975 0.1854 0.2005 0.1640 0.1526
## Balanced Accuracy 0.8807 0.8145 0.8401 0.8399 0.8431
Similar to the accuracy estimated from the 5-fold cross-validation, the model is expected to have an accuracy of about 75%.
With the given dataset, it is possible to predict with acceptable accuracy the way in which a unilateral dumbbell bicep curl is performed. An accuracy of about 75% was achieved with a boosted trees model on data pre-processed with principal component analysis. A better accuracy may be achieved with other ensemble classification schemes and with more powerful computing equipment.