Flirt With Julia

Learn the Julia programming language through real-world examples.

K Means Clustering

12 Jan 2019 »

Install & Import Dependencies


using Pkg
Pkg.add(["Clustering", "CSV", "DataFrames", "StatPlots"])

using Clustering, CSV, DataFrames, StatPlots
plotly() # set Plotly as the StatPlots backend
  • Clustering is the package we will use to do our k-means clustering
  • CSV will help us read in our data from the .csv file we will be working with
  • DataFrames will allow us to visualize our data as a Data Frame
  • We’ll use StatPlots to generate scatter plots of our data

For this project, we’ll be using the Iris Species dataset, which you can download here. Once you’ve saved the dataset, go ahead and read it into a Data Frame and set normalizenames to true so that the spaces in the column titles don’t cause you problems later on.

Read in the data


iris_dataset = CSV.read("iris.csv", normalizenames = true)


Output:

150 rows × 6 columns

IdSepalLengthCmSepalWidthCmPetalLengthCmPetalWidthCmSpecies
Int64⍰Float64⍰Float64⍰Float64⍰Float64⍰String⍰
115.13.51.40.2Iris-setosa
224.93.01.40.2Iris-setosa
334.73.21.30.2Iris-setosa
444.63.11.50.2Iris-setosa
555.03.61.40.2Iris-setosa
665.43.91.70.4Iris-setosa
774.63.41.40.3Iris-setosa
885.03.41.50.2Iris-setosa
994.42.91.40.2Iris-setosa
10104.93.11.50.1Iris-setosa
11115.43.71.50.2Iris-setosa
12124.83.41.60.2Iris-setosa
13134.83.01.40.1Iris-setosa
14144.33.01.10.1Iris-setosa
15155.84.01.20.2Iris-setosa
16165.74.41.50.4Iris-setosa
17175.43.91.30.4Iris-setosa
18185.13.51.40.3Iris-setosa
19195.73.81.70.3Iris-setosa
20205.13.81.50.3Iris-setosa
21215.43.41.70.2Iris-setosa
22225.13.71.50.4Iris-setosa
23234.63.61.00.2Iris-setosa
24245.13.31.70.5Iris-setosa
25254.83.41.90.2Iris-setosa
26265.03.01.60.2Iris-setosa
27275.03.41.60.4Iris-setosa
28285.23.51.50.2Iris-setosa
29295.23.41.40.2Iris-setosa
30304.73.21.60.2Iris-setosa


You can see from the output above that the dataset includes an Id column that we don’t need. Let’s delete it:

deletecols!(iris_dataset, 1)


In Julia, functions with ! at the end are functions that modify their arguments, rather than just copying them. This serves as a warning so that the programmer is aware that the contents of whatever the function is being applied to will be changed.

Output:

150 rows × 5 columns

SepalLengthCmSepalWidthCmPetalLengthCmPetalWidthCmSpecies
Float64⍰Float64⍰Float64⍰Float64⍰String⍰
15.13.51.40.2Iris-setosa
24.93.01.40.2Iris-setosa
34.73.21.30.2Iris-setosa
44.63.11.50.2Iris-setosa
55.03.61.40.2Iris-setosa
65.43.91.70.4Iris-setosa
74.63.41.40.3Iris-setosa
85.03.41.50.2Iris-setosa
94.42.91.40.2Iris-setosa
104.93.11.50.1Iris-setosa
115.43.71.50.2Iris-setosa
124.83.41.60.2Iris-setosa
134.83.01.40.1Iris-setosa
144.33.01.10.1Iris-setosa
155.84.01.20.2Iris-setosa
165.74.41.50.4Iris-setosa
175.43.91.30.4Iris-setosa
185.13.51.40.3Iris-setosa
195.73.81.70.3Iris-setosa
205.13.81.50.3Iris-setosa
215.43.41.70.2Iris-setosa
225.13.71.50.4Iris-setosa
234.63.61.00.2Iris-setosa
245.13.31.70.5Iris-setosa
254.83.41.90.2Iris-setosa
265.03.01.60.2Iris-setosa
275.03.41.60.4Iris-setosa
285.23.51.50.2Iris-setosa
295.23.41.40.2Iris-setosa
304.73.21.60.2Iris-setosa


That’s better 👍

Plot the petal lengths/widths

Let’s go ahead and generate a scatter plot to see what our petal data look like. We make use of the @df macro available via StatPlots to generate our scatter plot from the iris_dataset Data Frame, passing in the two column names (Symbols) for the data that we want to plot:

@df iris_dataset scatter(:PetalLengthCm, :PetalWidthCm)


2019-01-12-scatter.png

We can see from the above scatter plot that there are definitely at least two distinct clusters (in reality, we already know that there are 3 clusters/iris types because the data are labeled but, shhhh, don’t tell anyone 🤐).

We’re getting closer to being able to unleash the power of k-means clustering on our data, but there are still a couple of additional steps that we must take.

Format our data

First, the kmeans function from our Clustering package is expecting a Matrix as the first argument, and the number of clusters as the second argument. The second part we can handle, but we don’t have a Matrix yet - we have a Data Frame. No worries though, we can convert our Data Frame to a Matrix like so (note that it really just ends up as a two-dimensional Array):

iris_matrix = convert(Matrix, iris_dataset)

Output:

    150×5 Array{Any,2}:
     5.1  3.5  1.4  0.2  "Iris-setosa"   
     4.9  3.0  1.4  0.2  "Iris-setosa"   
     4.7  3.2  1.3  0.2  "Iris-setosa"   
     4.6  3.1  1.5  0.2  "Iris-setosa"   
     5.0  3.6  1.4  0.2  "Iris-setosa"   
     5.4  3.9  1.7  0.4  "Iris-setosa"   
     4.6  3.4  1.4  0.3  "Iris-setosa"   
     5.0  3.4  1.5  0.2  "Iris-setosa"   
     4.4  2.9  1.4  0.2  "Iris-setosa"   
     4.9  3.1  1.5  0.1  "Iris-setosa"   
     5.4  3.7  1.5  0.2  "Iris-setosa"   
     4.8  3.4  1.6  0.2  "Iris-setosa"   
     4.8  3.0  1.4  0.1  "Iris-setosa"   
     ⋮                                   
     6.0  3.0  4.8  1.8  "Iris-virginica"
     6.9  3.1  5.4  2.1  "Iris-virginica"
     6.7  3.1  5.6  2.4  "Iris-virginica"
     6.9  3.1  5.1  2.3  "Iris-virginica"
     5.8  2.7  5.1  1.9  "Iris-virginica"
     6.8  3.2  5.9  2.3  "Iris-virginica"
     6.7  3.3  5.7  2.5  "Iris-virginica"
     6.7  3.0  5.2  2.3  "Iris-virginica"
     6.3  2.5  5.0  1.9  "Iris-virginica"
     6.5  3.0  5.2  2.0  "Iris-virginica"
     6.2  3.4  5.4  2.3  "Iris-virginica"
     5.9  3.0  5.1  1.8  "Iris-virginica"


Before we go any further, let’s have a quick look at the Clustering docs (you don’t really have to, I already did it for you 😉). The docs state that each column of our Matrix is treated as a sample.

Oh 💩!! In our Matrix, each row is a sample. What ever shall we do?!? 😱 We shall permute the dimensions of the Matrix - that’s what. WTF does permute the dimensions mean? It’s just a geeky way of saying that we need to re-order the dimensions of the Matrix. In other words, we just need to change the shape of our Matrix. The next block of code handles this:

iris_matrix_reshaped = permutedims(convert(Matrix{Float64}, iris_matrix[:, 1:4]), [2, 1])

Output:

    4×150 Array{Float64,2}:
     5.1  4.9  4.7  4.6  5.0  5.4  4.6  5.0  …  6.8  6.7  6.7  6.3  6.5  6.2  5.9
     3.5  3.0  3.2  3.1  3.6  3.9  3.4  3.4     3.2  3.3  3.0  2.5  3.0  3.4  3.0
     1.4  1.4  1.3  1.5  1.4  1.7  1.4  1.5     5.9  5.7  5.2  5.0  5.2  5.4  5.1
     0.2  0.2  0.2  0.2  0.2  0.4  0.3  0.2     2.3  2.5  2.3  1.9  2.0  2.3  1.8

You might be tempted to use transpose in this situation - don’t. Transpose is intended specifically to be used for linear algebra. You should use permutedims for general data manipulation.

So what’s happening here? We are converting the first four columns (as the last column contains Strings of the species name) of every row to a Matrix that holds data of type Float64, and then passing this Matrix to permutedims. The : in iris_matrix[:, 1:4] just means that for every row, we want columns 1:4 (as an aside, if I wrote it like this [1:4, :], it would mean ‘give me every column for rows 1:4’). The second argument that we’re feeding to permutedims is a vector specifying a permutation of length ndims(iris_matrix) (the number of dimensions of iris_matrix). Since iris_matrix is a 2D array, we pass the vector [2, 1] as we want to reverse our rows/columns.

Get your cluster on 🙈

Now that our Matrix is shaped correctly, let’s do some k-means clustering. We’ll pass our Matrix into the kmeans function, specify 3 as the number of clusters (we’re cheating since we already know the right k to choose), and then plot the results:

results = kmeans(iris_matrix_reshaped, 3)
@df iris_dataset scatter(:PetalLengthCm, :PetalWidthCm, color = results.assignments, hover = iris_dataset[5])


2019-01-12-scatter.png

Pretty sweet, right? The algorithm did a pretty good job with the only errors occuring in the area where virginica and versicolor measurements start to become blurred (this is to be expected).

There are a lot of great resources online that explain how to choose the right value for k when doing k-means clustering. In a future post, I’ll explore how to do this with Julia.

Until next time, I hope you’ve enjoyed flirting with Julia! 💘