[{"data":1,"prerenderedAt":609},["ShallowReactive",2],{"content-query-Wb4tc1FDgi":3},{"_path":4,"_dir":5,"_draft":6,"_partial":6,"_locale":7,"title":8,"description":9,"date":10,"draft":6,"tags":11,"thumbnail":13,"alt_description":14,"slug":15,"body":16,"_type":603,"_id":604,"_source":605,"_file":606,"_stem":607,"_extension":608},"/posts/two-approaches-to-create-a-dummy-variable","posts",false,"","Two approaches to a dummy variables","Learn about the difference between supervised and unsupervised machine learning and importance of creating dummy variable approaches","2026-02-21T00:00:00.000Z",[12],"the shape of data","/img/two_approaches_to_create_a_dummy_data.png","Creating dummy variables for different purposes","two-approaches-to-create-a-dummy-variable",{"type":17,"children":18,"toc":594},"root",[19,37,42,49,54,59,64,70,75,80,86,91,257,262,295,300,472,477,483,488,498,503,511,520,526,531,539,544,552,557,565,571,576,584,589],{"type":20,"tag":21,"props":22,"children":23},"element","blockquote",{},[24,31],{"type":20,"tag":25,"props":26,"children":27},"p",{},[28],{"type":29,"value":30},"text","Perfection is a myth. The best solution is simply the one with tradeoffs you can live with.",{"type":20,"tag":25,"props":32,"children":34},{"align":33},"right",[35],{"type":29,"value":36},"-- Smartest guy in the room",{"type":20,"tag":25,"props":38,"children":39},{},[40],{"type":29,"value":41},"There are a lot of different Machine Learning algorithms being created each and every day. For the sake of simplicity let's say there are just two categories - they are supervised and unsupervised.",{"type":20,"tag":43,"props":44,"children":46},"h2",{"id":45},"supervised-learning",[47],{"type":29,"value":48},"Supervised Learning",{"type":20,"tag":25,"props":50,"children":51},{},[52],{"type":29,"value":53},"The one we use when we have a specific number or target to predict, it's usually called a dependent variable, as opposed to the data points we use to predict the dependent variable which are called independent variables or predictors. When we are predicting a numerical value it's called regression; when we're predicting a categorical variable it's called classification.",{"type":20,"tag":25,"props":55,"children":56},{},[57],{"type":29,"value":58},"You may have heard of k-nearest neighbours (k-NN), support vector machines, random forest, etc. Using a supervised learning algorithm usually involves splitting your data into two sets. There's training data, where the algorithm tries to adjust the parameters in the function so that the predicted values are as close as possible to the actual values.",{"type":20,"tag":25,"props":60,"children":61},{},[62],{"type":29,"value":63},"After a supervised learning algorithm has been trained, we can use it to make new predictions and to estimate the impact of each independent variable on the dependent variable.",{"type":20,"tag":43,"props":65,"children":67},{"id":66},"unsupervised-learning",[68],{"type":29,"value":69},"Unsupervised Learning",{"type":20,"tag":25,"props":71,"children":72},{},[73],{"type":29,"value":74},"These algorithms tend to focus on data exploration - for example, reducing the dimensionality of a dataset to better visualize it, finding how the data points are related to each other, or detecting anomalous data points. In unsupervised learning there is no dependent or independent variable, and there is no split of the dataset; instead the entire dataset is served to the unsupervised algorithm.",{"type":20,"tag":25,"props":76,"children":77},{},[78],{"type":29,"value":79},"My favourite techniques of k-means, hierarchical clustering and principal component analysis are good examples where unsupervised learning is used for market segmentation or reducing dimensions as a preprocessing step to improve a supervised learning algorithm.",{"type":20,"tag":43,"props":81,"children":83},{"id":82},"structured-data-and-dummy-variables",[84],{"type":29,"value":85},"Structured Data and Dummy Variables",{"type":20,"tag":25,"props":87,"children":88},{},[89],{"type":29,"value":90},"Data science and statistical methods usually operate on structured data such as dataframes, which are basically Excel-like tables with columns, rows and strict rules on how the observed data points are inserted.",{"type":20,"tag":92,"props":93,"children":94},"table",{},[95,125],{"type":20,"tag":96,"props":97,"children":98},"thead",{},[99],{"type":20,"tag":100,"props":101,"children":102},"tr",{},[103,109,115,120],{"type":20,"tag":104,"props":105,"children":106},"th",{},[107],{"type":29,"value":108},"Variable 1",{"type":20,"tag":104,"props":110,"children":112},{"align":111},"center",[113],{"type":29,"value":114},"Variable 2",{"type":20,"tag":104,"props":116,"children":117},{"align":111},[118],{"type":29,"value":119},"Variable 3",{"type":20,"tag":104,"props":121,"children":122},{"align":111},[123],{"type":29,"value":124},"Outcome",{"type":20,"tag":126,"props":127,"children":128},"tbody",{},[129,153,175,196,216,236],{"type":20,"tag":100,"props":130,"children":131},{},[132,138,143,148],{"type":20,"tag":133,"props":134,"children":135},"td",{},[136],{"type":29,"value":137},"0",{"type":20,"tag":133,"props":139,"children":140},{"align":111},[141],{"type":29,"value":142},"1",{"type":20,"tag":133,"props":144,"children":145},{"align":111},[146],{"type":29,"value":147},"Male",{"type":20,"tag":133,"props":149,"children":150},{"align":111},[151],{"type":29,"value":152},"3.21",{"type":20,"tag":100,"props":154,"children":155},{},[156,160,165,170],{"type":20,"tag":133,"props":157,"children":158},{},[159],{"type":29,"value":142},{"type":20,"tag":133,"props":161,"children":162},{"align":111},[163],{"type":29,"value":164},"5",{"type":20,"tag":133,"props":166,"children":167},{"align":111},[168],{"type":29,"value":169},"Female",{"type":20,"tag":133,"props":171,"children":172},{"align":111},[173],{"type":29,"value":174},"3.99",{"type":20,"tag":100,"props":176,"children":177},{},[178,182,187,191],{"type":20,"tag":133,"props":179,"children":180},{},[181],{"type":29,"value":142},{"type":20,"tag":133,"props":183,"children":184},{"align":111},[185],{"type":29,"value":186},"4",{"type":20,"tag":133,"props":188,"children":189},{"align":111},[190],{"type":29,"value":169},{"type":20,"tag":133,"props":192,"children":193},{"align":111},[194],{"type":29,"value":195},"2.70",{"type":20,"tag":100,"props":197,"children":198},{},[199,203,207,211],{"type":20,"tag":133,"props":200,"children":201},{},[202],{"type":29,"value":137},{"type":20,"tag":133,"props":204,"children":205},{"align":111},[206],{"type":29,"value":142},{"type":20,"tag":133,"props":208,"children":209},{"align":111},[210],{"type":29,"value":147},{"type":20,"tag":133,"props":212,"children":213},{"align":111},[214],{"type":29,"value":215},"3.19",{"type":20,"tag":100,"props":217,"children":218},{},[219,223,227,231],{"type":20,"tag":133,"props":220,"children":221},{},[222],{"type":29,"value":142},{"type":20,"tag":133,"props":224,"children":225},{"align":111},[226],{"type":29,"value":164},{"type":20,"tag":133,"props":228,"children":229},{"align":111},[230],{"type":29,"value":169},{"type":20,"tag":133,"props":232,"children":233},{"align":111},[234],{"type":29,"value":235},"4.00",{"type":20,"tag":100,"props":237,"children":238},{},[239,243,247,252],{"type":20,"tag":133,"props":240,"children":241},{},[242],{"type":29,"value":142},{"type":20,"tag":133,"props":244,"children":245},{"align":111},[246],{"type":29,"value":186},{"type":20,"tag":133,"props":248,"children":249},{"align":111},[250],{"type":29,"value":251},"Non-binary",{"type":20,"tag":133,"props":253,"children":254},{"align":111},[255],{"type":29,"value":256},"1.29",{"type":20,"tag":25,"props":258,"children":259},{},[260],{"type":29,"value":261},"This is an example of three independent variables and the dependent variable that we are trying to predict, which is the \"Outcome\" column. Let's examine the data types of each column:",{"type":20,"tag":263,"props":264,"children":265},"ul",{},[266,277,286],{"type":20,"tag":267,"props":268,"children":269},"li",{},[270,275],{"type":20,"tag":271,"props":272,"children":273},"strong",{},[274],{"type":29,"value":108},{"type":29,"value":276}," - is a binary numerical column containing only 1 and 0",{"type":20,"tag":267,"props":278,"children":279},{},[280,284],{"type":20,"tag":271,"props":281,"children":282},{},[283],{"type":29,"value":114},{"type":29,"value":285}," - is a discrete numerical value column that takes whole numbers only",{"type":20,"tag":267,"props":287,"children":288},{},[289,293],{"type":20,"tag":271,"props":290,"children":291},{},[292],{"type":29,"value":119},{"type":29,"value":294}," - is a categorical variable representing gender",{"type":20,"tag":25,"props":296,"children":297},{},[298],{"type":29,"value":299},"Some algorithms accept categorical values, but most of them do not. Therefore data manipulation is required to convert categorical text data into numerical values.",{"type":20,"tag":92,"props":301,"children":302},{},[303,331],{"type":20,"tag":96,"props":304,"children":305},{},[306],{"type":20,"tag":100,"props":307,"children":308},{},[309,313,317,322,327],{"type":20,"tag":104,"props":310,"children":311},{},[312],{"type":29,"value":108},{"type":20,"tag":104,"props":314,"children":315},{"align":111},[316],{"type":29,"value":114},{"type":20,"tag":104,"props":318,"children":319},{"align":111},[320],{"type":29,"value":321},"Variable Male",{"type":20,"tag":104,"props":323,"children":324},{"align":111},[325],{"type":29,"value":326},"Variable Female",{"type":20,"tag":104,"props":328,"children":329},{"align":111},[330],{"type":29,"value":124},{"type":20,"tag":126,"props":332,"children":333},{},[334,357,380,403,426,449],{"type":20,"tag":100,"props":335,"children":336},{},[337,341,345,349,353],{"type":20,"tag":133,"props":338,"children":339},{},[340],{"type":29,"value":137},{"type":20,"tag":133,"props":342,"children":343},{"align":111},[344],{"type":29,"value":142},{"type":20,"tag":133,"props":346,"children":347},{"align":111},[348],{"type":29,"value":142},{"type":20,"tag":133,"props":350,"children":351},{"align":111},[352],{"type":29,"value":137},{"type":20,"tag":133,"props":354,"children":355},{"align":111},[356],{"type":29,"value":152},{"type":20,"tag":100,"props":358,"children":359},{},[360,364,368,372,376],{"type":20,"tag":133,"props":361,"children":362},{},[363],{"type":29,"value":142},{"type":20,"tag":133,"props":365,"children":366},{"align":111},[367],{"type":29,"value":164},{"type":20,"tag":133,"props":369,"children":370},{"align":111},[371],{"type":29,"value":137},{"type":20,"tag":133,"props":373,"children":374},{"align":111},[375],{"type":29,"value":142},{"type":20,"tag":133,"props":377,"children":378},{"align":111},[379],{"type":29,"value":174},{"type":20,"tag":100,"props":381,"children":382},{},[383,387,391,395,399],{"type":20,"tag":133,"props":384,"children":385},{},[386],{"type":29,"value":142},{"type":20,"tag":133,"props":388,"children":389},{"align":111},[390],{"type":29,"value":186},{"type":20,"tag":133,"props":392,"children":393},{"align":111},[394],{"type":29,"value":137},{"type":20,"tag":133,"props":396,"children":397},{"align":111},[398],{"type":29,"value":142},{"type":20,"tag":133,"props":400,"children":401},{"align":111},[402],{"type":29,"value":195},{"type":20,"tag":100,"props":404,"children":405},{},[406,410,414,418,422],{"type":20,"tag":133,"props":407,"children":408},{},[409],{"type":29,"value":137},{"type":20,"tag":133,"props":411,"children":412},{"align":111},[413],{"type":29,"value":142},{"type":20,"tag":133,"props":415,"children":416},{"align":111},[417],{"type":29,"value":142},{"type":20,"tag":133,"props":419,"children":420},{"align":111},[421],{"type":29,"value":137},{"type":20,"tag":133,"props":423,"children":424},{"align":111},[425],{"type":29,"value":215},{"type":20,"tag":100,"props":427,"children":428},{},[429,433,437,441,445],{"type":20,"tag":133,"props":430,"children":431},{},[432],{"type":29,"value":142},{"type":20,"tag":133,"props":434,"children":435},{"align":111},[436],{"type":29,"value":164},{"type":20,"tag":133,"props":438,"children":439},{"align":111},[440],{"type":29,"value":137},{"type":20,"tag":133,"props":442,"children":443},{"align":111},[444],{"type":29,"value":142},{"type":20,"tag":133,"props":446,"children":447},{"align":111},[448],{"type":29,"value":235},{"type":20,"tag":100,"props":450,"children":451},{},[452,456,460,464,468],{"type":20,"tag":133,"props":453,"children":454},{},[455],{"type":29,"value":142},{"type":20,"tag":133,"props":457,"children":458},{"align":111},[459],{"type":29,"value":186},{"type":20,"tag":133,"props":461,"children":462},{"align":111},[463],{"type":29,"value":137},{"type":20,"tag":133,"props":465,"children":466},{"align":111},[467],{"type":29,"value":137},{"type":20,"tag":133,"props":469,"children":470},{"align":111},[471],{"type":29,"value":256},{"type":20,"tag":25,"props":473,"children":474},{},[475],{"type":29,"value":476},"Notice how the Non-binary observation has 0 in both columns - Variable Male and Variable Female. That's where we have two different ways to feed the algorithm with gender values.",{"type":20,"tag":43,"props":478,"children":480},{"id":479},"n-value-dummy-variables",[481],{"type":29,"value":482},"n-value dummy variables",{"type":20,"tag":25,"props":484,"children":485},{},[486],{"type":29,"value":487},"We have three distinct values such as Male, Female and Non-binary, hence n equals 3. With n-value dummy variables we map gender in the following manner:",{"type":20,"tag":489,"props":490,"children":492},"pre",{"code":491},"Male       → (1, 0, 0)\nFemale     → (0, 1, 0)\nNon-binary → (0, 0, 1)\n",[493],{"type":20,"tag":494,"props":495,"children":496},"code",{"__ignoreMap":7},[497],{"type":29,"value":491},{"type":20,"tag":25,"props":499,"children":500},{},[501],{"type":29,"value":502},"This ensures that each category is equally distant from the others. The Euclidean distance between any two categories is:",{"type":20,"tag":489,"props":504,"children":506},{"code":505},"distance(Male, Female)       = sqrt((1-0)^2 + (0-1)^2 + (0-0)^2) = sqrt(2) ≈ 1.41\ndistance(Male, Non-binary)   = sqrt((1-0)^2 + (0-0)^2 + (0-1)^2) = sqrt(2) ≈ 1.41\ndistance(Female, Non-binary) = sqrt((0-0)^2 + (1-0)^2 + (0-1)^2) = sqrt(2) ≈ 1.41\n",[507],{"type":20,"tag":494,"props":508,"children":509},{"__ignoreMap":7},[510],{"type":29,"value":505},{"type":20,"tag":25,"props":512,"children":513},{},[514],{"type":20,"tag":515,"props":516,"children":519},"img",{"alt":517,"src":518},"Three value dummy variable","/img/img44.png",[],{"type":20,"tag":43,"props":521,"children":523},{"id":522},"n-1-dummy-variables",[524],{"type":29,"value":525},"n-1 dummy variables",{"type":20,"tag":25,"props":527,"children":528},{},[529],{"type":29,"value":530},"Using only 2 dummy variables means we drop one category:",{"type":20,"tag":489,"props":532,"children":534},{"code":533},"Male       → (1, 0)\nFemale     → (0, 1)\nNon-binary → (0, 0)\n",[535],{"type":20,"tag":494,"props":536,"children":537},{"__ignoreMap":7},[538],{"type":29,"value":533},{"type":20,"tag":25,"props":540,"children":541},{},[542],{"type":29,"value":543},"Now let's check the distances on a two-dimensional plane:",{"type":20,"tag":489,"props":545,"children":547},{"code":546},"distance(Male, Female)       = sqrt((1-0)^2 + (0-1)^2) = sqrt(2) ≈ 1.41\ndistance(Male, Non-binary)   = sqrt((1-0)^2 + (0-0)^2) = sqrt(1) = 1.00\ndistance(Female, Non-binary) = sqrt((0-0)^2 + (1-0)^2) = sqrt(1) = 1.00\n",[548],{"type":20,"tag":494,"props":549,"children":550},{"__ignoreMap":7},[551],{"type":29,"value":546},{"type":20,"tag":25,"props":553,"children":554},{},[555],{"type":29,"value":556},"Both Male and Female are closer to Non-binary than they are to each other, distorting the true relationships between categories.",{"type":20,"tag":25,"props":558,"children":559},{},[560],{"type":20,"tag":515,"props":561,"children":564},{"alt":562,"src":563},"Two value dummy variable","/img/img43.png",[],{"type":20,"tag":43,"props":566,"children":568},{"id":567},"tradeoff-between-n-and-n-1-approaches",[569],{"type":29,"value":570},"Tradeoff between n and n-1 approaches",{"type":20,"tag":25,"props":572,"children":573},{},[574],{"type":29,"value":575},"Have you heard of multicollinearity? That's exactly what happens with n-value dummy variables — each variable is completely and linearly determined by the others. Algebraically it's a linear dependence, meaning one column is a linear combination of the other columns.",{"type":20,"tag":489,"props":577,"children":579},{"code":578},"Variable_Male + Variable_Female + Variable_NonBinary = 1  (always)\n",[580],{"type":20,"tag":494,"props":581,"children":582},{"__ignoreMap":7},[583],{"type":29,"value":578},{"type":20,"tag":25,"props":585,"children":586},{},[587],{"type":29,"value":588},"Multicollinearity causes computational problems for linear and logistic regression, so for those algorithms we should use n-1 dummy variables rather than all n.",{"type":20,"tag":25,"props":590,"children":591},{},[592],{"type":29,"value":593},"On the other hand, for algorithms like k-NN where distances between data points are crucial, we don't want to drop any category, as that would skew the distances and lead to suboptimal performance.",{"title":7,"searchDepth":595,"depth":595,"links":596},2,[597,598,599,600,601,602],{"id":45,"depth":595,"text":48},{"id":66,"depth":595,"text":69},{"id":82,"depth":595,"text":85},{"id":479,"depth":595,"text":482},{"id":522,"depth":595,"text":525},{"id":567,"depth":595,"text":570},"markdown","content:posts:two-approaches-to-create-a-dummy-variable.md","content","posts/two-approaches-to-create-a-dummy-variable.md","posts/two-approaches-to-create-a-dummy-variable","md",1775831731491]