{"id":10206,"date":"2017-07-05T07:53:51","date_gmt":"2017-07-05T12:53:51","guid":{"rendered":"http:\/\/huewhite.com\/umb\/?p=10206"},"modified":"2017-07-05T07:53:51","modified_gmt":"2017-07-05T12:53:51","slug":"keeping-up-with-those-big-data-algorithms","status":"publish","type":"post","link":"https:\/\/huewhite.com\/umb\/2017\/07\/05\/keeping-up-with-those-big-data-algorithms\/","title":{"rendered":"Keeping Up With Those Big Data Algorithms"},"content":{"rendered":"<p><em><strong>Kaggle<\/strong><\/em> remains a public center of big data training through public competitions. I participated a few years ago, despite a nearly complete absence of domain-specific knowledge, by working on their introductory problem of predicting Titanic survivors and, if I recall properly, working on one of the challenges with a prize. Between lack of knowledge and an insistence in coding in <em><strong>Mythryl<\/strong><\/em>, I had a big hill to climb, and never really made it up the first ridge. A lack of time and other interests finally won the day, but if I were still a young programmer, with lots of free time, I would have pursued it with great interest.<\/p>\n<p>But I do occasionally take advantage of the tutorials offered by the leading Kagglers. This month the <em><strong>Kaggle<\/strong><\/em> blog has an <a href=\"http:\/\/blog.kaggle.com\/2017\/06\/15\/stacking-made-easy-an-introduction-to-stacknet-by-competitions-grandmaster-marios-michailidis-kazanova\/?utm_source=Mailing+list&amp;utm_campaign=479f35fbcc-Kaggle_Newsletter_06-29-2017&amp;utm_medium=email&amp;utm_term=0_f42f9df1e1-479f35fbcc-398728533\" target=\"_blank\" rel=\"noopener\">interview<\/a> with one of their top members,\u00a0Marios Michailidis, and his description of using a <em>stacking<\/em> strategy when solving big data problems, an approach he believes has become dominant, is interesting:<\/p>\n<blockquote>\n<h3 id=\"can-you-give-a-brief-introduction-to-stacking-and-why-its-important\">Can you give a brief introduction to stacking and why it\u2019s important?<\/h3>\n<p><strong>Stacking<\/strong> or <em>Stacked Generalization<\/em> is the process of combining various machine learning algorithms using holdout data. It is attributed to <a href=\"http:\/\/www.machine-learning.martinsewell.com\/ensembles\/stacking\/Wolpert1992.pdf\" target=\"_blank\" rel=\"noopener\" data-vivaldi-spatnav-clickable=\"1\">Wolpert 1992<\/a>. It normally involves a four-stage process. Consider 3 datasets A, B, C. For A and B we know the ground truth (or in other words the target variable y). We can use stacking as follows:<\/p>\n<ol>\n<li>We train various machine learning algorithms (regressors or classifiers) in dataset A<\/li>\n<li>We make predictions for each one of the algorithms for datasets B and C and we create new datasets B1 and C1 that contain only these predictions. So if we ran 10 models then B1 and C1 have 10 columns each.<\/li>\n<li>We train a new machine learning algorithm (often referred to as <em>Meta learner<\/em> or <em>Super learner<\/em>) using B1<\/li>\n<li>We make predictions using the Meta learner on C1<\/li>\n<\/ol>\n<p>For a <strong>large scale<\/strong> implementation of stacking, one may further read or use <a href=\"https:\/\/h2o-release.s3.amazonaws.com\/h2o\/rel-ueno\/2\/docs-website\/h2o-docs\/data-science\/stacked-ensembles.html\" target=\"_blank\" rel=\"noopener\" data-vivaldi-spatnav-clickable=\"1\">stacked ensembles<\/a>.<\/p>\n<p>Stacking is important because (experimentally) it has been found to <strong>improve performance<\/strong> in various machine learning problems. I believe most winning solutions on Kaggle the past 4 years included some form of stacking.<\/p>\n<p>Additionally, the advent of <strong>increased computing power<\/strong> and parallelism has made it possible to run many algorithms together. Most algorithms rely on certain parameters or assumptions to perform best, hence each one has advantages and disadvantages. Stacking is a mechanism that tries to leverage the benefits of each algorithm while disregarding (to some extent) or <em>correcting<\/em> for their disadvantages. In its most abstract form, stacking can be seen as a mechanism that <em>corrects the errors of your algorithms<\/em>.<\/p><\/blockquote>\n<p>So, given enough test data with a known solution, segment it randomly, build predictive functions using one segment, and use another segment to evaluate the performance of each of the functions. Correct the performance as possible.<\/p>\n<p>And hope your data sample is representative of the entire dataset.<\/p>\n<p>I particularly liked his example:<\/p>\n<blockquote><p>I often like to explain stacking on multiple levels with the following (albeit simplistic) example:<\/p>\n<p>Let\u2019s assume there are three students named <strong>LR<\/strong>, <strong>SVM<\/strong>, <strong>KNN<\/strong> and they argue about a physics question where they have different opinions of what the right answer might be:<\/p>\n<p><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-6769\" src=\"https:\/\/i0.wp.com\/5047-presscdn.pagely.netdna-cdn.com\/wp-content\/uploads\/2017\/06\/image1.png?resize=560%2C203\" sizes=\"auto, (max-width: 560px) 100vw, 560px\" srcset=\"http:\/\/5047-presscdn.pagely.netdna-cdn.com\/wp-content\/uploads\/2017\/06\/image1.png 560w, http:\/\/5047-presscdn.pagely.netdna-cdn.com\/wp-content\/uploads\/2017\/06\/image1-300x109.png 300w, http:\/\/5047-presscdn.pagely.netdna-cdn.com\/wp-content\/uploads\/2017\/06\/image1-100x36.png 100w\" alt=\"Stacking illustration\" width=\"560\" height=\"203\" \/><\/p>\n<p>They decide there is no way to convince one another about their case and they do the democratic thing via taking an <strong>average<\/strong> of their estimates which is this case is <strong>14.<\/strong> They used one of the simplest form of ensembling\u2013AKA <strong>model averaging<\/strong>.<\/p>\n<p>Their teacher, <strong>Miss DL<\/strong>\u2013a maths teacher\u2013bears witness to the argument the students are having and decides to help. She asks \u201cwhat was the question?\u201d, but the students refuse to tell her (because they know it is not in their benefit to give all the information, besides they think she might find silly they are arguing for such a trivial matter). However they do tell her that it is a physics related argument.<\/p>\n<p>In this scenario, the teacher does not have access to the initial data as she does not know what the question was. However she does know the students very well\u2013their <strong>strengths<\/strong> and <strong>weakness<\/strong> and she decides she can still help in solving this matter. Using historical information of how well the students have done in the past, plus the fact that she knows SVM loves physics and always does well in this subject (plus her father worked in a physics institute of excellence for young scientists), she thinks the most appropriate answer would be more like <strong>17<\/strong>.<\/p><\/blockquote>\n<p>And there&#8217;s more to that example, but in the interests of fair use, I shan&#8217;t tell it. Go see the <em><strong>Kaggle<\/strong><\/em> blog <a href=\"http:\/\/blog.kaggle.com\/2017\/06\/15\/stacking-made-easy-an-introduction-to-stacknet-by-competitions-grandmaster-marios-michailidis-kazanova\/?utm_source=Mailing+list&amp;utm_campaign=479f35fbcc-Kaggle_Newsletter_06-29-2017&amp;utm_medium=email&amp;utm_term=0_f42f9df1e1-479f35fbcc-398728533\" target=\"_blank\" rel=\"noopener\">link<\/a>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Kaggle remains a public center of big data training through public competitions. I participated a few years ago, despite a nearly complete absence of domain-specific knowledge, by working on their introductory problem of predicting Titanic survivors and, if I recall properly, working on one of the challenges with a prize. \u2026 <a class=\"continue-reading-link\" href=\"https:\/\/huewhite.com\/umb\/2017\/07\/05\/keeping-up-with-those-big-data-algorithms\/\"> Continue reading <span class=\"meta-nav\">&rarr; <\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"nf_dc_page":"","_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[1],"tags":[],"class_list":["post-10206","post","type-post","status-publish","format-standard","hentry","category-uncategorized"],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/huewhite.com\/umb\/wp-json\/wp\/v2\/posts\/10206","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/huewhite.com\/umb\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/huewhite.com\/umb\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/huewhite.com\/umb\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/huewhite.com\/umb\/wp-json\/wp\/v2\/comments?post=10206"}],"version-history":[{"count":2,"href":"https:\/\/huewhite.com\/umb\/wp-json\/wp\/v2\/posts\/10206\/revisions"}],"predecessor-version":[{"id":10208,"href":"https:\/\/huewhite.com\/umb\/wp-json\/wp\/v2\/posts\/10206\/revisions\/10208"}],"wp:attachment":[{"href":"https:\/\/huewhite.com\/umb\/wp-json\/wp\/v2\/media?parent=10206"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/huewhite.com\/umb\/wp-json\/wp\/v2\/categories?post=10206"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/huewhite.com\/umb\/wp-json\/wp\/v2\/tags?post=10206"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}