Abstract

A dataset consisting of sensory readings collected from six males with ages between 20-28 years performing unilateral dumbbell bicep curls in five different ways was used to fit a model that can classify which of the five forms was being performed. The dataset was cleaned, split into training/testing sets, and pre-processed with principal component analysis before being fitted with a boosted trees model with the gbm method in caret. A 5-fold cross-validation scheme was performed to gauge the performance of the model. The final model was estimated to have an accuracy of about 75%.

Exploratory Data Analysis and Cleaning

The dataset was composed of 19622 observations of 160 variables. An inspection of the data showed that around 6 of these variables were identification data, 1 was for time tracking, and 9 were filled with #DIV!/0 errors. These variables were discarded. The dataset was filled with NA values arising probably from the activity-dependent non-stimulation of sensors. These NA values were imputed with zeroes to prevent errors from the training functions.

Fitting the Model

The dataset was split into a 70/30 training and testing set. With a large number of features available and limited computational power, a principal component analysis was performed to lower the dimensionality of the data. The pre-processing was instructed to capture 80% of the variance, reducing the number of features from 144 to 30.

A boosted model with trees was chosen as a robust classification model. This was implemented through the caret gbm method with settings on default. A 5-fold cross-validation scheme was performed to gauge the accuracy of the model on unseen data. A k-fold cross-validation was chosen since it gives a relatively low bias estimate without too high a variance. A k=5 was chosen as this number is widely regarded as having an acceptable bias-variance tradeoff.

## Stochastic Gradient Boosting 
## 
## 13737 samples
##   144 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## Pre-processing: principal component signal extraction (144),
##  centered (144), scaled (144) 
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 10990, 10989, 10990, 10989, 10990 
## Resampling results across tuning parameters:
## 
##   interaction.depth  n.trees  Accuracy   Kappa    
##   1                   50      0.5172891  0.3824744
##   1                  100      0.5599477  0.4403546
##   1                  150      0.5838251  0.4719326
##   2                   50      0.6044268  0.4984237
##   2                  100      0.6616442  0.5715237
##   2                  150      0.6979689  0.6178074
##   3                   50      0.6536363  0.5615387
##   3                  100      0.7182069  0.6434025
##   3                  150      0.7564975  0.6918088
## 
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 150,
##  interaction.depth = 3, shrinkage = 0.1 and n.minobsinnode = 10.

The model is estimated to have an accuracy of about 75%.

Estimating Out-of-Sample Error

The model was applied to the testing set to estimate the out-of-sample error. A confusion matrix was applied to the model predictions and actual classifications.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1410  126   80   67   68
##          B   65  789   93   45   99
##          C   78  128  782  120   72
##          D   86   60   41  706   72
##          E   35   36   30   26  771
## 
## Overall Statistics
##                                           
##                Accuracy : 0.7575          
##                  95% CI : (0.7464, 0.7684)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6929          
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.8423   0.6927   0.7622   0.7324   0.7126
## Specificity            0.9190   0.9364   0.9181   0.9474   0.9736
## Pos Pred Value         0.8053   0.7232   0.6627   0.7316   0.8586
## Neg Pred Value         0.9361   0.9270   0.9481   0.9476   0.9376
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2396   0.1341   0.1329   0.1200   0.1310
## Detection Prevalence   0.2975   0.1854   0.2005   0.1640   0.1526
## Balanced Accuracy      0.8807   0.8145   0.8401   0.8399   0.8431

Similar to the accuracy estimated from the 5-fold cross-validation, the model is expected to have an accuracy of about 75%.

Conclusion

With the given dataset, it is possible to predict with acceptable accuracy the way in which a unilateral dumbbell bicep curl is performed. An accuracy of about 75% was achieved with a boosted trees model on data pre-processed with principal component analysis. A better accuracy may be achieved with other ensemble classification schemes and with more powerful computing equipment.