Seeing the Random Forest behind the Trees
This article is part of a BAI series exploring 10 basic machine learning algorithms*
How can you predict for which machine learning algorithm I’m looking? In the parlor game of twenty questions the goal is to find the answer by asking the minimum number of categorical yes/no questions.[i] Which question should you ask first, and on what criteria? Which question should be asked next? What is the minimum number of questions needed to insure the correct prediction? Keep this game in mind as we explore how the Random Forest algorithm is used in machine learning, its basic assumptions and use scenarios, as well as what precautions we should take when relying on this methodology.
Random Forest is a supervised learning algorithm used in data science for both regression and classification tasks. As conceptualized by Leo Brieman and Adele Cutler, the algorithm combines a large number of decision trees into a single model to produce more accurate and stable predictions.[ii] Each decision tree in the forest represents a random set of features of a subset of the training data. Random Forests integrate an ensembling method called Bagging (Bootstrap Aggregating) that produces a direct relationship between the quantity of decision trees used in the calculation and the quality of the final results — i.e. the larger the number of trees the better the prediction.
The basic buildings block of the Random Forest algorithm is a Decision Tree, a decision support tool, that is represented as a tree-like model of decisions including utility and chance event outcomes and their possible consequences.[iii] We use decision trees to sequence a series of questions to predict an outcome, with each question narrowing the potential values until we can make an accurate prediction. Graphically, in a decision tree each internal node represents a test of a feature, each branch the outcome, and each leaf represents a class label. When used in machine learning, decision trees can list all potential alternatives to a given question. Decision trees can be improved through experience, for each application of the tool provides new data points that improve the model.
Random Forest mirrors the logic of decision trees with one telling exception: rather than searching for the key feature when splitting a node, it searches for the most important feature among a random subset of data and features.[iv], Based on a logic similar to that which predicates the wisdom of the crowds, the Random Forest algorithm builds a large number of trees from subsets of the training data to leverage the learning from multiple experiments. When used for regression analysis, the algorithm averages all the individual decision tree estimates. When used for classification, random forest takes a majority vote for the predicted class. Variants of Random Forests include Random MNLs based on linear models rather than decision trees, and Kernal Random Forests, which integrate kernel methods to improve the interpretability of the results.
As one of the most popular machine learning algorithms today, Random Forests are used in a wide variety of industrial settings and situations. In e-commerce, the algorithm is used for predicting consumer preferences based on the experience of similar customers. In banking, Random Forest is used both to identify loyal customers and isolate potentially fraudulent transactions. When studying investment scenarios in finance, Random Forest is used to characterize an asset’s behavior and its potential future performance. Finally, in health analytics, the Random Forests are used to identify pathologies based on the patient’s medical records.
As with other machine learning methods, there are advantages and disadvantages of using Random Forests over other algorithms for regression and classification. On the upside, the algorithm’s use of bagging avoids much of the overfitting that can be found in single decision trees, linear regression models or even neural networks. Random Forests can be used to quickly identify the most important features of a training dataset. The algorithm can also handle missing values with ease— it also incorporates methods for balancing errors in class populations as well as unbalanced data sets. Finally, it runs efficiently on large or even very large databases as it can manage thousands of input variables. On the downside, the multiplicity of trees and features in Random Forests often make them difficult to interpret, computationally expensive, and for practical purposes too slow for real-time predictions.
Coming back to our game of twenty questions, imagine which questions or features would have permitted you to predict that I was indeed looking at Random Forests. If the sequence of questions is indeed efficient, the resulting decision tree can act as a decision support tool for predicting the choices of each person playing the game. Now imagine rather than just asking the questions yourself, you ran multiple experiments with friends each choosing their own sequence of questions. By averaging the results, you may well be able to see the logic of how individuals choose their machine learning algorithms and even get a glimpse of the forest of concepts hidden behind the trees.