Flirt With Julia

Learn the Julia programming language through real-world examples.

Random Forest In Julia

21 Jan 2019 »

In this post we’ll learn to build a random forest 🌲🌳 in Julia and use it to make simple predictions about income levels. We’ll pull in some data through an HTTP get request, view it as a DataFrame, do some minor cleanup work, then build & test our model. Get ready to embrace your inner (decision) tree-hugger! 🌎🌺✌️

Install/Load Dependencies

# using Pkg
# Pkg.add(["CSV", "DataFrames", DecisionTree", "HTTP"])
using CSV, DataFrames, DecisionTree, HTTP


  • The dataset for this post is in CSV format so we’ll use the CSV package to parse it
  • As usual, I’ll be using DataFrames to view the data that we’ll be working with in a nice format (this step isn’t imperative, I just like to inspect my data in this format before going further)
  • DecisionTree is the awesome package that we’ll use to build our random forest model
  • HTTP is the package that will allow us to pull in the data from the URL

Okay, now that we have everything that we need, let’s get the data that we’ll be working with and then take a look at it. We’ll be using our random forest to solve a classification problem so I’ve selected a dataset from the UC Irvine Machine Learning Repository that is derived from Census Bureau data and includes 14 features for a list of individuals along with labels indicating whether or not the indivual’s income is > 50k or <= 50k (U.S. dollars).

Get The Data

res = HTTP.get("https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data")
training_data = CSV.read(IOBuffer(res.body), datarow = 1)


Output:

32,562 rows × 15 columns

Column1Column2Column3Column4Column5Column6Column7Column8Column9Column10Column11Column12Column13Column14Column15
Int64⍰String⍰Int64⍰String⍰Int64⍰String⍰String⍰String⍰String⍰String⍰Int64⍰Int64⍰Int64⍰String⍰String⍰
139 State-gov77516 Bachelors13 Never-married Adm-clerical Not-in-family White Male2174040 United-States <=50K
250 Self-emp-not-inc83311 Bachelors13 Married-civ-spouse Exec-managerial Husband White Male0013 United-States <=50K
338 Private215646 HS-grad9 Divorced Handlers-cleaners Not-in-family White Male0040 United-States <=50K
453 Private234721 11th7 Married-civ-spouse Handlers-cleaners Husband Black Male0040 United-States <=50K
528 Private338409 Bachelors13 Married-civ-spouse Prof-specialty Wife Black Female0040 Cuba <=50K
637 Private284582 Masters14 Married-civ-spouse Exec-managerial Wife White Female0040 United-States <=50K
749 Private160187 9th5 Married-spouse-absent Other-service Not-in-family Black Female0016 Jamaica <=50K
852 Self-emp-not-inc209642 HS-grad9 Married-civ-spouse Exec-managerial Husband White Male0045 United-States >50K
931 Private45781 Masters14 Never-married Prof-specialty Not-in-family White Female14084050 United-States >50K
1042 Private159449 Bachelors13 Married-civ-spouse Exec-managerial Husband White Male5178040 United-States >50K
1137 Private280464 Some-college10 Married-civ-spouse Exec-managerial Husband Black Male0080 United-States >50K
1230 State-gov141297 Bachelors13 Married-civ-spouse Prof-specialty Husband Asian-Pac-Islander Male0040 India >50K
1323 Private122272 Bachelors13 Never-married Adm-clerical Own-child White Female0030 United-States <=50K
1432 Private205019 Assoc-acdm12 Never-married Sales Not-in-family Black Male0050 United-States <=50K
1540 Private121772 Assoc-voc11 Married-civ-spouse Craft-repair Husband Asian-Pac-Islander Male0040 ? >50K
1634 Private245487 7th-8th4 Married-civ-spouse Transport-moving Husband Amer-Indian-Eskimo Male0045 Mexico <=50K
1725 Self-emp-not-inc176756 HS-grad9 Never-married Farming-fishing Own-child White Male0035 United-States <=50K
1832 Private186824 HS-grad9 Never-married Machine-op-inspct Unmarried White Male0040 United-States <=50K
1938 Private28887 11th7 Married-civ-spouse Sales Husband White Male0050 United-States <=50K
2043 Self-emp-not-inc292175 Masters14 Divorced Exec-managerial Unmarried White Female0045 United-States >50K
2140 Private193524 Doctorate16 Married-civ-spouse Prof-specialty Husband White Male0060 United-States >50K
2254 Private302146 HS-grad9 Separated Other-service Unmarried Black Female0020 United-States <=50K
2335 Federal-gov76845 9th5 Married-civ-spouse Farming-fishing Husband Black Male0040 United-States <=50K
2443 Private117037 11th7 Married-civ-spouse Transport-moving Husband White Male0204240 United-States <=50K
2559 Private109015 HS-grad9 Divorced Tech-support Unmarried White Female0040 United-States <=50K
2656 Local-gov216851 Bachelors13 Married-civ-spouse Tech-support Husband White Male0040 United-States >50K
2719 Private168294 HS-grad9 Never-married Craft-repair Own-child White Male0040 United-States <=50K
2854 ?180211 Some-college10 Married-civ-spouse ? Husband Asian-Pac-Islander Male0060 South >50K
2939 Private367260 HS-grad9 Divorced Exec-managerial Not-in-family White Male0080 United-States <=50K
3049 Private193366 HS-grad9 Married-civ-spouse Craft-repair Husband White Male0040 United-States <=50K


In this block, we defined a variable res to store the response from an HTTP GET request that we made to the supplied URL. Then, we defined another variable training_data to store our DataFrame, which we constructed using the CSV package. We did so by making use of the IOBuffer() function, passing to it the body of our response object (which is of type String), and then reading this in via CSV.read. We included the datarow = 1 argument to let our read function know that the data starts on the first row (there is no header row in this dataset). To learn more about IOBuffer, check out the Julia docs for a complete explanation and plenty of examples. For this post, all you need to know is that you can use IOBuffer() to read from a string into an Array/DataFrame.

Rename Columns

Since our dataset didn’t come with a header row, this next step adds meaningful column names to our training_data DataFrame. As with the creation of the DataFrame itself, this step isn’t necessary; it’s just something that I like to do to be able to inspect the data before working with it further.


column_names = [
    :age,
    :work_class,
    :final_weight,
    :education,
    :education_num,
    :marital_status,
    :occupation,
    :relationship,
    :race,
    :sex,
    :capital_gain,
    :capital_loss,
    :hours_per_week,
    :native_country,
    :threshhold
]
for (i, name) in enumerate(names(training_data))
   rename!(training_data, name => column_names[i])
end
training_data # this line is here just to get the output to print in Jupyter


Output:

32,562 rows × 15 columns

agework_classfinal_weighteducationeducation_nummarital_statusoccupationrelationshipracesexcapital_gaincapital_losshours_per_weeknative_countrythreshhold
Int64⍰String⍰Int64⍰String⍰Int64⍰String⍰String⍰String⍰String⍰String⍰Int64⍰Int64⍰Int64⍰String⍰String⍰
139 State-gov77516 Bachelors13 Never-married Adm-clerical Not-in-family White Male2174040 United-States <=50K
250 Self-emp-not-inc83311 Bachelors13 Married-civ-spouse Exec-managerial Husband White Male0013 United-States <=50K
338 Private215646 HS-grad9 Divorced Handlers-cleaners Not-in-family White Male0040 United-States <=50K
453 Private234721 11th7 Married-civ-spouse Handlers-cleaners Husband Black Male0040 United-States <=50K
528 Private338409 Bachelors13 Married-civ-spouse Prof-specialty Wife Black Female0040 Cuba <=50K
637 Private284582 Masters14 Married-civ-spouse Exec-managerial Wife White Female0040 United-States <=50K
749 Private160187 9th5 Married-spouse-absent Other-service Not-in-family Black Female0016 Jamaica <=50K
852 Self-emp-not-inc209642 HS-grad9 Married-civ-spouse Exec-managerial Husband White Male0045 United-States >50K
931 Private45781 Masters14 Never-married Prof-specialty Not-in-family White Female14084050 United-States >50K
1042 Private159449 Bachelors13 Married-civ-spouse Exec-managerial Husband White Male5178040 United-States >50K
1137 Private280464 Some-college10 Married-civ-spouse Exec-managerial Husband Black Male0080 United-States >50K
1230 State-gov141297 Bachelors13 Married-civ-spouse Prof-specialty Husband Asian-Pac-Islander Male0040 India >50K
1323 Private122272 Bachelors13 Never-married Adm-clerical Own-child White Female0030 United-States <=50K
1432 Private205019 Assoc-acdm12 Never-married Sales Not-in-family Black Male0050 United-States <=50K
1540 Private121772 Assoc-voc11 Married-civ-spouse Craft-repair Husband Asian-Pac-Islander Male0040 ? >50K
1634 Private245487 7th-8th4 Married-civ-spouse Transport-moving Husband Amer-Indian-Eskimo Male0045 Mexico <=50K
1725 Self-emp-not-inc176756 HS-grad9 Never-married Farming-fishing Own-child White Male0035 United-States <=50K
1832 Private186824 HS-grad9 Never-married Machine-op-inspct Unmarried White Male0040 United-States <=50K
1938 Private28887 11th7 Married-civ-spouse Sales Husband White Male0050 United-States <=50K
2043 Self-emp-not-inc292175 Masters14 Divorced Exec-managerial Unmarried White Female0045 United-States >50K
2140 Private193524 Doctorate16 Married-civ-spouse Prof-specialty Husband White Male0060 United-States >50K
2254 Private302146 HS-grad9 Separated Other-service Unmarried Black Female0020 United-States <=50K
2335 Federal-gov76845 9th5 Married-civ-spouse Farming-fishing Husband Black Male0040 United-States <=50K
2443 Private117037 11th7 Married-civ-spouse Transport-moving Husband White Male0204240 United-States <=50K
2559 Private109015 HS-grad9 Divorced Tech-support Unmarried White Female0040 United-States <=50K
2656 Local-gov216851 Bachelors13 Married-civ-spouse Tech-support Husband White Male0040 United-States >50K
2719 Private168294 HS-grad9 Never-married Craft-repair Own-child White Male0040 United-States <=50K
2854 ?180211 Some-college10 Married-civ-spouse ? Husband Asian-Pac-Islander Male0060 South >50K
2939 Private367260 HS-grad9 Divorced Exec-managerial Not-in-family White Male0080 United-States <=50K
3049 Private193366 HS-grad9 Married-civ-spouse Craft-repair Husband White Male0040 United-States <=50K


First, we define an array of appropriately-named Symbols that will become our column names. The names came from the description of the dataset, available at the URL above. Next, we loop through an array of our training_data DataFrame column names (which we obtained via the names function) using the enumerate function, which allows us to get both the index i and value name of each item. Inside of our loop we use the rename! function, passing in our DataFrame, and we change each column name to the corresponding (by index) name from the column_names array.

Now that we have our data dressed up and looking sexy 👗💅💄👄👠, let’s just admire it a bit, get to know it, and then make our move…towards building the model! 😜

Build Model

Before we can actually build the model, we need to separate our data into one Array containing the features, and another containing the labels. This is pretty straightforward in Julia with the convert function. I chose to drop rows that have missing values, but you can choose to leave them in if you’d like.


features = convert(Array, dropmissing(training_data)[1:14])
labels = convert(Array, dropmissing(training_data)[15])
model = build_forest(labels, features)


Output:

Ensemble of Decision Trees
Trees:      10
Avg Leaves: 3376.1
Avg Depth:  48.2


As you can see from this code block, we’re converting the first 14 columns of training_data to an Array called features and then converting the last column to a labels Array. Then, we call the build_forest function from the DecisionTree package, passing in the labels and features. There are several additional parameters available in build_forest, but I chose to use the default values for this post. Check out the docs to learn more.

The output from build_forest informs us of the primary characteristics of our random forest, which we will put to the test in the next step!

Test Model

One of the nice things about this dataset is that the test data is already separated out into its own dataset. If you’re new to machine learning, check out this article on Wikipedia about training, validation, and test datasets to learn more. Let’s pull in our test dataset using the same functions that we used to pull in the training data (with the caveat that our data starts on the second row, so we pass datarow = 2 and header = false to our function) and then rename the columns.


res_test = HTTP.get("https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test")
test_data = CSV.read(IOBuffer(String(res_test.body)), datarow = 2, header = false)
for (i, name) in enumerate(names(test_data))
   rename!(test_data, name => column_names[i])
end
test_data # this line is here just to get the output to print in Jupyter


Output:

16,282 rows × 15 columns

agework_classfinal_weighteducationeducation_nummarital_statusoccupationrelationshipracesexcapital_gaincapital_losshours_per_weeknative_countrythreshhold
Int64⍰String⍰Int64⍰String⍰Int64⍰String⍰String⍰String⍰String⍰String⍰Int64⍰Int64⍰Int64⍰String⍰String⍰
125 Private226802 11th7 Never-married Machine-op-inspct Own-child Black Male0040 United-States <=50K.
238 Private89814 HS-grad9 Married-civ-spouse Farming-fishing Husband White Male0050 United-States <=50K.
328 Local-gov336951 Assoc-acdm12 Married-civ-spouse Protective-serv Husband White Male0040 United-States >50K.
444 Private160323 Some-college10 Married-civ-spouse Machine-op-inspct Husband Black Male7688040 United-States >50K.
518 ?103497 Some-college10 Never-married ? Own-child White Female0030 United-States <=50K.
634 Private198693 10th6 Never-married Other-service Not-in-family White Male0030 United-States <=50K.
729 ?227026 HS-grad9 Never-married ? Unmarried Black Male0040 United-States <=50K.
863 Self-emp-not-inc104626 Prof-school15 Married-civ-spouse Prof-specialty Husband White Male3103032 United-States >50K.
924 Private369667 Some-college10 Never-married Other-service Unmarried White Female0040 United-States <=50K.
1055 Private104996 7th-8th4 Married-civ-spouse Craft-repair Husband White Male0010 United-States <=50K.
1165 Private184454 HS-grad9 Married-civ-spouse Machine-op-inspct Husband White Male6418040 United-States >50K.
1236 Federal-gov212465 Bachelors13 Married-civ-spouse Adm-clerical Husband White Male0040 United-States <=50K.
1326 Private82091 HS-grad9 Never-married Adm-clerical Not-in-family White Female0039 United-States <=50K.
1458 ?299831 HS-grad9 Married-civ-spouse ? Husband White Male0035 United-States <=50K.
1548 Private279724 HS-grad9 Married-civ-spouse Machine-op-inspct Husband White Male3103048 United-States >50K.
1643 Private346189 Masters14 Married-civ-spouse Exec-managerial Husband White Male0050 United-States >50K.
1720 State-gov444554 Some-college10 Never-married Other-service Own-child White Male0025 United-States <=50K.
1843 Private128354 HS-grad9 Married-civ-spouse Adm-clerical Wife White Female0030 United-States <=50K.
1937 Private60548 HS-grad9 Widowed Machine-op-inspct Unmarried White Female0020 United-States <=50K.
2040 Private85019 Doctorate16 Married-civ-spouse Prof-specialty Husband Asian-Pac-Islander Male0045 ? >50K.
2134 Private107914 Bachelors13 Married-civ-spouse Tech-support Husband White Male0047 United-States >50K.
2234 Private238588 Some-college10 Never-married Other-service Own-child Black Female0035 United-States <=50K.
2372 ?132015 7th-8th4 Divorced ? Not-in-family White Female006 United-States <=50K.
2425 Private220931 Bachelors13 Never-married Prof-specialty Not-in-family White Male0043 Peru <=50K.
2525 Private205947 Bachelors13 Married-civ-spouse Prof-specialty Husband White Male0040 United-States <=50K.
2645 Self-emp-not-inc432824 HS-grad9 Married-civ-spouse Craft-repair Husband White Male7298090 United-States >50K.
2722 Private236427 HS-grad9 Never-married Adm-clerical Own-child White Male0020 United-States <=50K.
2823 Private134446 HS-grad9 Separated Machine-op-inspct Unmarried Black Male0054 United-States <=50K.
2954 Private99516 HS-grad9 Married-civ-spouse Craft-repair Husband White Male0035 United-States <=50K.
3032 Self-emp-not-inc109282 Some-college10 Never-married Prof-specialty Not-in-family White Male0060 United-States <=50K.


Now that we have our test dataset, we need to again create two Arrays, one for our features and another for our labels, and then make use of the apply_forest function to obtain our predictions. Since I used dropmissing for our training data, I did the same for our test data. Note that our test labels include a period at the end, which I guess someone thought would be funny way back in 1996 when the dataset was created. As a result, we have to perform the extra step of chopping it off 🔪.

After that, we can construct an Array of predictions with the apply_forest function. We do this by making use of Julia’s array comprehension feature. We define an Array predictions and construct it by looping through every row of our features_test Array (a two-dimensional Array), grabbing all of the columns for the given row, and passing them (along with our model from the previous step) to apply_forest. Note that size(features_test)[1] is how we define the length of our 2d Array. The size function returns the dimensions of the Array, the first of which (# of rows) being what we need.

features_test = convert(Array, dropmissing(test_data)[1:14])
labels_test = convert(Array, dropmissing(test_data)[15])
labels_test_formatted = [chop(x) for x in labels_test]

predictions = [apply_forest(model, features_test[i, :]) for i = 1:size(features_test)[1]]


Output:

16281-element Array{String,1}:
 " <=50K"
 " <=50K"
 " <=50K"
 " >50K" 
 " <=50K"
 " <=50K"
 " <=50K"
 " >50K" 
 " <=50K"
 " <=50K"
 " >50K" 
 " >50K" 
 " <=50K"
 ⋮       
 " <=50K"
 " <=50K"
 " <=50K"
 " <=50K"
 " <=50K"
 " >50K" 
 " <=50K"
 " <=50K"
 " <=50K"
 " >50K" 
 " <=50K"
 " >50K" 


You can see from the output above that we now have an Array of predicted labels for our test features. To check how good of a job our model did, we’ll simply compute the percentage of correct answers as follows:


corrects = predictions .== labels_test_formatted
percent_correct = count(i -> i == true, corrects) / length(corrects)


Output:

0.8514219028315214


In the code above, we check whether or not each item in predictions is the same as its corresponding item in labels_test_formatted by broadcasting the == equality operator via dot syntax. This results in the creation of a BitArray(this is not a new cryptocurrency), which we then use to compute the percentage of correct predictions. The latter part is achieved by using the count function on the corrects BitArray, passing to it an iterator that checks whether or not i == true. It then returns the number of elements that passed the test, which we divide by the length of the BitArray. On this run you can see that we achieved 85% accuracy with just default settings!! 👍

With that, I’m going to consider this post finished 🏁 Until next time, I hope you’ve enjoyed flirting with Julia! 💘