Kaggle remains a public center of big data training through public competitions. I participated a few years ago, despite a nearly complete absence of domain-specific knowledge, by working on their introductory problem of predicting Titanic survivors and, if I recall properly, working on one of the challenges with a prize. Between lack of knowledge and an insistence in coding in Mythryl, I had a big hill to climb, and never really made it up the first ridge. A lack of time and other interests finally won the day, but if I were still a young programmer, with lots of free time, I would have pursued it with great interest.
But I do occasionally take advantage of the tutorials offered by the leading Kagglers. This month the Kaggle blog has an interview with one of their top members, Marios Michailidis, and his description of using a stacking strategy when solving big data problems, an approach he believes has become dominant, is interesting:
Can you give a brief introduction to stacking and why it’s important?
Stacking or Stacked Generalization is the process of combining various machine learning algorithms using holdout data. It is attributed to Wolpert 1992. It normally involves a four-stage process. Consider 3 datasets A, B, C. For A and B we know the ground truth (or in other words the target variable y). We can use stacking as follows:
- We train various machine learning algorithms (regressors or classifiers) in dataset A
- We make predictions for each one of the algorithms for datasets B and C and we create new datasets B1 and C1 that contain only these predictions. So if we ran 10 models then B1 and C1 have 10 columns each.
- We train a new machine learning algorithm (often referred to as Meta learner or Super learner) using B1
- We make predictions using the Meta learner on C1
For a large scale implementation of stacking, one may further read or use stacked ensembles.
Stacking is important because (experimentally) it has been found to improve performance in various machine learning problems. I believe most winning solutions on Kaggle the past 4 years included some form of stacking.
Additionally, the advent of increased computing power and parallelism has made it possible to run many algorithms together. Most algorithms rely on certain parameters or assumptions to perform best, hence each one has advantages and disadvantages. Stacking is a mechanism that tries to leverage the benefits of each algorithm while disregarding (to some extent) or correcting for their disadvantages. In its most abstract form, stacking can be seen as a mechanism that corrects the errors of your algorithms.
So, given enough test data with a known solution, segment it randomly, build predictive functions using one segment, and use another segment to evaluate the performance of each of the functions. Correct the performance as possible.
And hope your data sample is representative of the entire dataset.
I particularly liked his example:
I often like to explain stacking on multiple levels with the following (albeit simplistic) example:
Let’s assume there are three students named LR, SVM, KNN and they argue about a physics question where they have different opinions of what the right answer might be:
They decide there is no way to convince one another about their case and they do the democratic thing via taking an average of their estimates which is this case is 14. They used one of the simplest form of ensembling–AKA model averaging.
Their teacher, Miss DL–a maths teacher–bears witness to the argument the students are having and decides to help. She asks “what was the question?”, but the students refuse to tell her (because they know it is not in their benefit to give all the information, besides they think she might find silly they are arguing for such a trivial matter). However they do tell her that it is a physics related argument.
In this scenario, the teacher does not have access to the initial data as she does not know what the question was. However she does know the students very well–their strengths and weakness and she decides she can still help in solving this matter. Using historical information of how well the students have done in the past, plus the fact that she knows SVM loves physics and always does well in this subject (plus her father worked in a physics institute of excellence for young scientists), she thinks the most appropriate answer would be more like 17.
And there’s more to that example, but in the interests of fair use, I shan’t tell it. Go see the Kaggle blog link.