A simple example of building a Random Forest model
I wanted to write a thought process one could take to practice Random Forest method using a dataset from an actual research. The intended audience is narrow and specific: Someone who learned how Random Forest works, but not so sure what the actual steps to create a model and validate it. The reader can follow the steps by reading through the article, then refer to the source code 1 later to see how it is implemented in R.
The example dataset, “Weight Lifting Exercise Dataset” was made publicly available by Velloso, et al. 2 I chose this dataset because it is actually easy to achieve high accuracy with a simple use of Random Forest, yet we learn some practical aspect of building machine learning models.
The dataset records the measurements from the sensors attached to the glove, armband, lumbar belt and dumbbell while the participants lift the dumbbell in 5 different ways: One exactly according to the specification and 4 representing the common mistakes. 3
Given a training dataset, the goal is to generate a machine learning model to accurately classify the dataset into the 5 classes.
The data file has 39242 observations from 6 participants, recording 10 repetitions of the unilateral dumbbell biceps curl. Each weight lifting activity is labeled as Class A (according to the specification), Class B (throwing the elbows to the front), Class C (lifting the dumbbell only halfway), Class D (lowering the dumbbell only halfway), or Class E(throwing the hips to the front).
The 60% of the observation was picked randomly to create a training dataset, and the rest was kept as test dataset. Only training data set was used to build the model, and the test dataset was used to benchmark the out-of-sample accuracy.
As soon as I glanced the data, it became clear that some clean-up of the data was necessary: There are 159 columns in the data table, however; 58 of them contained N/A only. Besides, the first 7 columns (user_name, raw_timestamp_part_1, raw_timestamp_part_2, cvtd_timestamp, new_window, num_window, roll_belt) are IDs for users and observations, time stamps, or window information that should not be used as predictors.
As an initial model building attempt, Random Forest with 10 trees was chosen for model building. Due to the memory implications, a decision was made to drop the columns that has more than 53 levels of factors. 4
Total of 51 predictors were used to create the initial model. Figure 1 shows the mean decrease of Gini impurity
for each variable. The Gini impurity seems to be decreasing exponentially.
In order to avoid over-fitting to the training and alleviate the computation, I wanted to pick the most important variables. To have a clear rule in deciding where to cut off the mean decrease of Gini impurity, Fig. 2 was created by taking log of the mean decrease of Gini impurity, then the regression line was added. The figure indicates the top 5 important variables are especially contributing to the decrease of Gini impurity.
The 5 most important variables are used to make the final model. The variables used in the final model are pitch_forearm, yaw_belt, pitch_belt, magnet_dumbbell_z, and magnet_dumbbell_y.
A K-fold cross-validation
was performed to measure the accuracy of the model (k=10). Table 1 shows the confusion matrix from the K-fold cross-validation. The overall accuracy was 97.21% (or estimated error rate of 2.79%).
Expected out of sample error
The high accuracy result from K-fold cross-validation did not require further search of the models. Using the 10 chosen predictors, the final model was created from the entire training data set.
The in-sample accuracy of the final model was 95.59%. So the out of sample error is expected to be higher than 4.41%. The accuracy seemed to be promising for this model to be used to classify the test data set.
Predicting with the test dataset
The final model was used to classify the test dataset. Out of 15694, the model classified 15227 cases correctly, achieving 97.02% accuracy.
Notes on performance considerations
The model was built with R version 3.1.2 (2014-10-31) on x86_64-apple-darwin13.4.0 (64-bit). The hardware was an Apple MacBook Pro with 2.4GHz Intel Core i7 CPU with 8GB 1333 MHz DDR3 RAM.
With the order of magnitude smaller number of predictors than the initial model, it took only 0.37 seconds of elapsed time to build the model.
It should also be noted that I found running caret package with rf option
to be significantly slower than directly running randomForest package. With a similar reason, K-fold cross-validation was coded by authors instead of using rfcv function from randomForest package. See the source code 1 for details of the implementation.
In this article, I wrote a steps to create a Random Forest model to classfy the weight lifting exercise. The model was built with 5 most important predictors determined through the analysis of the mean decrease of Gini impurity.
Because the data used in this example was recorded in a controlled experiment, the final model classified the test data with a high accuracy without iterations of model improvement. In reality, we should expect many more iterations of data analysis before achieving the satisfactory out-of-sample error rate.
References and notes
- The R source code produced the analysis and this report is available from github page [^]
- Data set can be downloaded from here [^]
- Velloso, E.; Bulling, A.; Gellersen, H.; Ugulino, W.; Fuks, H. Qualitative Activity Recognition of Weight Lifting Exercises. Proceedings of 4th International Conference in Cooperation with SIGCHI (Augmented Human ‘13) . Stuttgart, Germany: ACM SIGCHI, 2013. [^]
- randomForest package of R cannot handle the factor variable with over 53 levels. [^]
- See Gini impurity subsection in Metrics section of Decision tree learning on Wikipedia [^]
- See k-fold cross-validation section of Cross-validation on Wikipedia [^]
- Random Forest Models - The caret Package [^]
Original post: Feb. 25, 2015 | Last updated: April 21, 2015