Flirt With Julia

Learn the Julia programming language through real-world examples.

Csv To Json

28 Jan 2019 »

In this post we’ll produce a JSON file with Julia. “Why?”, you ask. Maybe you are writing an API in Julia that returns JSON responses, or maybe you just need to convert some data into JSON format so that you can use it with a cool JavaScript visualization library (like me 😎). To be honest, I would pretty much always use JavaScript (Node.js) to format data that I am feeding to a JavaScript library, but I thought it would be a lot of fun to do it with Julia 🙊.

If you’re not interested in building a JSON object with Julia, allow me to let you in on a little secret - what we’re really doing here is manipulating data from a CSV file into a nested Dictionary. Surely you want to learn about that, right!?

For this post I grabbed some 2016 employment data from the state of Rhode Island. The data live in a CSV file, so we’ll use Julia to produce a JSON file in a very specific format (more on this in a minute).

Let’s get started!

Install & Load Dependencies

# using Pkg
# Pkg.add(["CSV", "DataFrames", "JSON", "Statistics"])
using CSV, DataFrames, JSON, Statistics


  • CSV will allow us to easily read our CSV file
  • We’ll view our data as a DataFrame and use several functions from this package to work with our data
  • No surprises here, JSON will allow us to view/write Dicts as JSON objects
  • Statistics will be used sparingly to sum some values for us


df = CSV.read("./rhode_island_employment.csv", normalizenames = true)
df[:RI_Employment] = map(x -> parse(Int64, replace(x, "," => "")), df[:RI_Employment])
df

Output:

34 rows × 3 columns

NAICS2NAICS3RI_Employment
String⍰String⍰Int64
1Agriculture, Forestry, Fishing and HuntingCrop Production508
2Agriculture, Forestry, Fishing and HuntingAnimal Production and Aquaculture145
3Agriculture, Forestry, Fishing and HuntingFishing, Hunting and Trapping74
4Agriculture, Forestry, Fishing and HuntingSupport Activities for Agriculture and Forestry43
5ConstructionConstruction of Buildings4345
6ConstructionHeavy and Civil Engineering Construction1822
7ConstructionSpecialty Trade Contractors11967
8ManufacturingFood Manufacturing3508
9ManufacturingWood Product Manufacturing481
10ManufacturingPrinting and Related Support Activities1667
11Wholesale TradaeMerchant Wholesalers, Nondurable Goods5274
12Retail TradeMotor Vehicle and Parts Dealers5684
13Retail TradeFood and Beverage Stores11990
14Retail TradeGasoline Stations1862
15Retail TradeClothing and Clothing Accessories Stores4251
16Retail TradeGeneral Merchandise Stores7585
17Retail TradeMiscellaneous Store Retailers2674
18Retail TradeNonstore Retailers1778
19Transportation & WarehousingTruck Transportation2375
20Transportation & WarehousingSupport Activities for Transportation1116
21Transportation & WarehousingWarehousing and Storage1273
22InformationData Processing, Hosting, and Related Services2436
23Professional, Scientific & Technical ServicesProfessional, Scientific, and Technical Services24392
24Administrative & Support & Waste Mgmt/Remediation ServicesAdministrative and Support Services26095
25Health Care and Social AssistanceAmbulatory Health Care Services25829
26Health Care and Social AssistanceHospitals23234
27Health Care and Social AssistanceNursing and Residential Care Facilities18352
28Health Care and Social AssistanceSocial Assistance12609
29Arts, Entertainment & RecreationPerforming Arts, Spectator Sports, and Related Industries1252
30Arts, Entertainment & RecreationAmusement, Gambling, and Recreation Industries7093

So what’s going on here? First, we read in our CSV file with CSV.read, passing in the normalizenames = true argument to eliminate any spaces in our column headers. CSV.read conveniently takes our data from the CSV file and spits out a Data Frame. Next, I parsed our employment data values to integers (Int64) and removed the commas which were present in the Strings in the original file. This was accomplished by using the map and replace functions on the values in the :RI_Employment column of our Data Frame.

Okay, now that we have our values parsed and our data in a Data Frame, let’s talk about what we’re going to do with it. As mentioned in the opening section of this post, the goal is to convert this data into a very specific JSON format (as required by the JavaScript visualization library). The goal is to end up with a group of nodes and links. Our nodes are our industries (columns 1 and 2 in the Data Frame), and the links will be created by us and will show the relationship between the nodes, as well as the associated values (column 3). In short, what we want to end up with is a JSON object like this:

{
    "nodes": [
        { "name": "2016 Employment" },
        { "name": "Agriculture, Forestry, Fishing and Hunting" },
        { "name": "Construction" },
        // more nodes here
    ],
    "links": [
        { "source": 0, "target": 1, "value": 770 },
        { "source": 0, "target": 2, "value": 18134 },
        // more links here
    ]
}

Let’s take a closer look at this object to understand what it is that we want to build. nodes should be an array of JSON objects with name as the key, and our industries as the values. Our links should also be an array but each object within it should contain three key/value pairs. source is the source node, target is the target node, and value represents the amount ‘flowing’ from the source to the target. Since JavaScript indexes from zero and not one (as Julia does), we are using 0 as the source value above to represent ‘flows’ between 2016 Employment and the NAICS2 industries (targets). Flows from NAICS2 to NAICS3 must also be represented in this JSON object, and we’ll do that very soon.

Now that your head is spinning 😵, let’s just dive into this and it will all make sense by the end 🍀.


emp = Dict("nodes" => [
    "name" => "2016 Employment"
  ],
    "links" => []
)


What we’ve done in this block is create a ‘skeleton’ Dictionary that we will build upon as we move through the rest of this post. ‘2016 Employment’ is our root node, so I went ahead and added that to our Dict, and then created a ‘links’ key that holds an empty Array as its value. With that out of the way, let’s move forward.


industry_parent = unique(convert(Array, df[1]))
industry_child = unique(convert(Array, df[2]))


The code above makes use of the unique and convert functions to get arrays of only the unique NAICS2 and NAICS3 values from our Data Frame. First, we convert a column of our Data Frame to an Array, and then pass that to unique to ensure there are no duplicate values. We’ll need this to be sure that we don’t have duplicate nodes in our final JSON object.

Now we are ready to start building out our ‘skeleton’ Dict from above!


map(x -> push!(emp["nodes"], ("name" => x)), industry_parent)
JSON.print(emp, 2)


Output:

    {
      "nodes": [
        {
          "name": "2016 Employment"
        },
        {
          "name": "Agriculture, Forestry, Fishing and Hunting"
        },
        {
          "name": "Construction"
        },
        {
          "name": "Manufacturing"
        },
        {
          "name": "Wholesale Tradae"
        },
        {
          "name": "Retail Trade"
        },
        {
          "name": "Transportation & Warehousing"
        },
        {
          "name": "Information"
        },
        {
          "name": "Professional, Scientific & Technical Services"
        },
        {
          "name": "Administrative & Support & Waste Mgmt/Remediation Services"
        },
        {
          "name": "Health Care and Social Assistance"
        },
        {
          "name": "Arts, Entertainment & Recreation"
        },
        {
          "name": "Accommodation & Food Services"
        },
        {
          "name": "Other Services"
        }
      ],
      "links": []
    }

Instead of trying to handle everything at once, we’re going to make this an iterative process so that it’s easier to understand what’s happening. For the first iteration (above), we simply make use of the map function, which iterates through each element in our industry_parent Array, and pushes a key/value pair to our emp Dict (more specifically, to the Array that is the value of the nodes key). We are always pushing the same key, name, but the value is the element in the industry_parent Array.

Finally, we use JSON to print our Dict to have a look at its current state. The second argument in JSON.print simply defines the level of indentation that we want, making our JSON object easier to read.

Now that we have our NAICS2 nodes taken care of, lets build links between our root node (2016 Employment) and these NAICS2 nodes.


for (i, industry) in enumerate(industry_parent)
    value = Statistics.sum([r[3] for r in eachrow(df) if r[1] == industry])
    push!(emp["links"], Dict("source" => 0, "target" => i, "value" => value))
end
JSON.print(emp, 2)


In this block, we use a for loop, along with enumerate, to loop through industry_parent again, this time building our links Dicts, and pushing them to emp. Using enumerate allows us to access not only the index of the item being iterated, but also its value.

Inside of our loop we define a variable value that is the sum of all :RI_Employment values where the NAICS2 industry (column 1 in our Data Frame) equals the value of the item currently being iterated in our loop. How did we do this? We used an array comprehension: [r[3] for r in eachrow(df) if r[1] == industry] along with the Statistics.sum function. Just looking at the array comprehension piece, on each iteration of our for loop, we build our Array with values from column 3 in the Data Frame, where the value in column 1 is equal to industry, which is the value of the current element in the iteration. We then assign the sum of this array to our value variable, to be used in the next line.

In the next line, we build a Dict with 3 key/value pairs and then push it to the Array that is the value of the links key in our emp Dict. Since all of these ‘flows’ are from our root node, the value of source will be 0 each time (since JavaScript indexes from 0). The value for target will simply be the current index in our loop, since this will correspond to the index in the Array that is the value for our nodes key in emp. Lastly, the value of our value key is the sum that we computed in our value variable.

If you’re feeling like this 😫, don’t worry. We’re getting close to being done and you’re getting closer to earning your black belt in Julia jiu-jitsu!!

At this stage, let’s see what emp is looking like:

{
  "nodes": [
    {
      "name": "2016 Employment"
    },
    {
      "name": "Agriculture, Forestry, Fishing and Hunting"
    },
    {
      "name": "Construction"
    },
    {
      "name": "Manufacturing"
    },
    {
      "name": "Wholesale Tradae"
    },
    {
      "name": "Retail Trade"
    },
    {
      "name": "Transportation & Warehousing"
    },
    {
      "name": "Information"
    },
    {
      "name": "Professional, Scientific & Technical Services"
    },
    {
      "name": "Administrative & Support & Waste Mgmt/Remediation Services"
    },
    {
      "name": "Health Care and Social Assistance"
    },
    {
      "name": "Arts, Entertainment & Recreation"
    },
    {
      "name": "Accommodation & Food Services"
    },
    {
      "name": "Other Services"
    }
  ],
  "links": [
    {
      "source": 0,
      "target": 1,
      "value": 770
    },
    {
      "source": 0,
      "target": 2,
      "value": 18134
    },
    {
      "source": 0,
      "target": 3,
      "value": 5656
    },
    {
      "source": 0,
      "target": 4,
      "value": 5274
    },
    {
      "source": 0,
      "target": 5,
      "value": 35824
    },
    {
      "source": 0,
      "target": 6,
      "value": 4764
    },
    {
      "source": 0,
      "target": 7,
      "value": 2436
    },
    {
      "source": 0,
      "target": 8,
      "value": 24392
    },
    {
      "source": 0,
      "target": 9,
      "value": 26095
    },
    {
      "source": 0,
      "target": 10,
      "value": 80024
    },
    {
      "source": 0,
      "target": 11,
      "value": 8345
    },
    {
      "source": 0,
      "target": 12,
      "value": 48204
    },
    {
      "source": 0,
      "target": 13,
      "value": 9875
    }
  ]
}


Alright, if up until this point we were playing the overture, this next block is the crescendo 🎼:


for (i, industry) in enumerate(industry_child)
    parent = df[df[:NAICS3] .== industry, 1][1]
    parent_index = findall(x -> x == parent, industry_parent)[1]
    value = df[df[:NAICS3] .== industry, 3][1]
    push!(emp["nodes"], ("name" => industry))
    push!(emp["links"], Dict("source" => parent_index, "target" => length(emp["links"]) + 1, "value" => value))
end
JSON.print(emp, 2)


All that was left up until this point was to add our NAICS3 industries to the nodes Array, and then create the rest of the Dicts that would become the values of our final links. We started by looping through the industry_child Array and creating three variables. parent is the variable that holds the value of the NAICS2 industry for the given NAICS3 industry. Let’s dissect it even further. Just looking at it this piece df[:NAICS3] .== industry, you can see that we are broadcasting the == equality operator across the :NAICS3 column of our Data Frame (with dot syntax), and checking whether or not each value is equal to the industry in our iterator. This returns a BitArray of Booleans. We then wrap that in df[], and include the number of the column that we want this to return. We are essentially saying, “Look in this Data Frame and for the rows where df[:NAICS3] .== industry returned true, return to me an Array that contains the value(s) from column 1 of that row.” It does just this, so we add the [1] to the end, as we want the value itself, not the Array.

To determine the value of parent_index, we used the findall function to search the industry_parent Array, which returns the index of the element(s) that is(are) equal to parent (inside of an Array, hence the [1] at the end).

For our value variable, we used the same technique that we did to compute parent, but returned the value from column 3 instead of column 1.

Finally, the last two lines inside our for loop are where we actually push new key/value pairs & Dicts to emp. The only thing interesting going on here is the value for target, which is the length of the links Array, plus one.

Okay, I know this is kind of abrupt but……we’re basically done! Let’s look at this beautiful JSON object and then write it to a file!!

{
  "nodes": [
    {
      "name": "2016 Employment"
    },
    {
      "name": "Agriculture, Forestry, Fishing and Hunting"
    },
    {
      "name": "Construction"
    },
    {
      "name": "Manufacturing"
    },
    {
      "name": "Wholesale Tradae"
    },
    {
      "name": "Retail Trade"
    },
    {
      "name": "Transportation & Warehousing"
    },
    {
      "name": "Information"
    },
    {
      "name": "Professional, Scientific & Technical Services"
    },
    {
      "name": "Administrative & Support & Waste Mgmt/Remediation Services"
    },
    {
      "name": "Health Care and Social Assistance"
    },
    {
      "name": "Arts, Entertainment & Recreation"
    },
    {
      "name": "Accommodation & Food Services"
    },
    {
      "name": "Other Services"
    },
    {
      "name": "Crop Production"
    },
    {
      "name": "Animal Production and Aquaculture"
    },
    {
      "name": "Fishing, Hunting and Trapping"
    },
    {
      "name": "Support Activities for Agriculture and Forestry"
    },
    {
      "name": "Construction of Buildings"
    },
    {
      "name": "Heavy and Civil Engineering Construction"
    },
    {
      "name": "Specialty Trade Contractors"
    },
    {
      "name": "Food Manufacturing"
    },
    {
      "name": "Wood Product Manufacturing"
    },
    {
      "name": "Printing and Related Support Activities"
    },
    {
      "name": "Merchant Wholesalers, Nondurable Goods"
    },
    {
      "name": "Motor Vehicle and Parts Dealers"
    },
    {
      "name": "Food and Beverage Stores"
    },
    {
      "name": "Gasoline Stations"
    },
    {
      "name": "Clothing and Clothing Accessories Stores"
    },
    {
      "name": "General Merchandise Stores"
    },
    {
      "name": "Miscellaneous Store Retailers"
    },
    {
      "name": "Nonstore Retailers"
    },
    {
      "name": "Truck Transportation"
    },
    {
      "name": "Support Activities for Transportation"
    },
    {
      "name": "Warehousing and Storage"
    },
    {
      "name": "Data Processing, Hosting, and Related Services"
    },
    {
      "name": "Professional, Scientific, and Technical Services"
    },
    {
      "name": "Administrative and Support Services"
    },
    {
      "name": "Ambulatory Health Care Services"
    },
    {
      "name": "Hospitals"
    },
    {
      "name": "Nursing and Residential Care Facilities"
    },
    {
      "name": "Social Assistance"
    },
    {
      "name": "Performing Arts, Spectator Sports, and Related Industries"
    },
    {
      "name": "Amusement, Gambling, and Recreation Industries"
    },
    {
      "name": "Accommodation"
    },
    {
      "name": "Food Services and Drinking Places"
    },
    {
      "name": "Repair and Maintenance"
    },
    {
      "name": "Personal and Laundry Services"
    }
  ],
  "links": [
    {
      "source": 0,
      "target": 1,
      "value": 770
    },
    {
      "source": 0,
      "target": 2,
      "value": 18134
    },
    {
      "source": 0,
      "target": 3,
      "value": 5656
    },
    {
      "source": 0,
      "target": 4,
      "value": 5274
    },
    {
      "source": 0,
      "target": 5,
      "value": 35824
    },
    {
      "source": 0,
      "target": 6,
      "value": 4764
    },
    {
      "source": 0,
      "target": 7,
      "value": 2436
    },
    {
      "source": 0,
      "target": 8,
      "value": 24392
    },
    {
      "source": 0,
      "target": 9,
      "value": 26095
    },
    {
      "source": 0,
      "target": 10,
      "value": 80024
    },
    {
      "source": 0,
      "target": 11,
      "value": 8345
    },
    {
      "source": 0,
      "target": 12,
      "value": 48204
    },
    {
      "source": 0,
      "target": 13,
      "value": 9875
    },
    {
      "source": 1,
      "target": 14,
      "value": 508
    },
    {
      "source": 1,
      "target": 15,
      "value": 145
    },
    {
      "source": 1,
      "target": 16,
      "value": 74
    },
    {
      "source": 1,
      "target": 17,
      "value": 43
    },
    {
      "source": 2,
      "target": 18,
      "value": 4345
    },
    {
      "source": 2,
      "target": 19,
      "value": 1822
    },
    {
      "source": 2,
      "target": 20,
      "value": 11967
    },
    {
      "source": 3,
      "target": 21,
      "value": 3508
    },
    {
      "source": 3,
      "target": 22,
      "value": 481
    },
    {
      "source": 3,
      "target": 23,
      "value": 1667
    },
    {
      "source": 4,
      "target": 24,
      "value": 5274
    },
    {
      "source": 5,
      "target": 25,
      "value": 5684
    },
    {
      "source": 5,
      "target": 26,
      "value": 11990
    },
    {
      "source": 5,
      "target": 27,
      "value": 1862
    },
    {
      "source": 5,
      "target": 28,
      "value": 4251
    },
    {
      "source": 5,
      "target": 29,
      "value": 7585
    },
    {
      "source": 5,
      "target": 30,
      "value": 2674
    },
    {
      "source": 5,
      "target": 31,
      "value": 1778
    },
    {
      "source": 6,
      "target": 32,
      "value": 2375
    },
    {
      "source": 6,
      "target": 33,
      "value": 1116
    },
    {
      "source": 6,
      "target": 34,
      "value": 1273
    },
    {
      "source": 7,
      "target": 35,
      "value": 2436
    },
    {
      "source": 8,
      "target": 36,
      "value": 24392
    },
    {
      "source": 9,
      "target": 37,
      "value": 26095
    },
    {
      "source": 10,
      "target": 38,
      "value": 25829
    },
    {
      "source": 10,
      "target": 39,
      "value": 23234
    },
    {
      "source": 10,
      "target": 40,
      "value": 18352
    },
    {
      "source": 10,
      "target": 41,
      "value": 12609
    },
    {
      "source": 11,
      "target": 42,
      "value": 1252
    },
    {
      "source": 11,
      "target": 43,
      "value": 7093
    },
    {
      "source": 12,
      "target": 44,
      "value": 4450
    },
    {
      "source": 12,
      "target": 45,
      "value": 43754
    },
    {
      "source": 13,
      "target": 46,
      "value": 4313
    },
    {
      "source": 13,
      "target": 47,
      "value": 5562
    }
  ]
}


string = JSON.json(emp)
open("julia.json", "w") do x
    write(x, string)
end


Here we simply store our emp Dict as a JSON object in the variable string, using JSON.json, and then write it to a file with the open and write functions, along with a good ol’ do block. open takes the name of the file as the first argument and the "w" flag as the second argument (indicating that we want to write to the file). Then we use a do block to write our string to the file. Note that instead of using do block syntax, we could have just written:

open(x -> write(x, string), "julia.json", "w")

Personally, I prefer the one-liner but most examples of writing to files in Julia use the do block syntax, so I followed suit for this post. Anyways, we’ve accomplished what we set out to accomplish so, until next time, I hope you’ve enjoyed flirting with Julia! 💘

P.S. You didn’t think I was going to leave you without showing the data visualization, did you?! Here it is:

visualization.png