Flirt With Julia

Learn the Julia programming language through real-world examples.

Regression From Api Data

29 Dec 2018 »

Install dependencies


using Pkg
Pkg.add(["DataFrames", "GLM", "HTTP", "JSON"])

using DataFrames, GLM, HTTP, JSON


  • DataFrames will allow us to view our data as a Data Frame
  • GLM is part of the JuliaStats family of packages, and we will use it for our simple linear regression
  • HTTP is the package that we will use to make our request to the API for our data
  • We will use the JSON package to parse the response that we get from the API

Call the API, parse the response, store the parsed response in a variable


res = HTTP.get(
    # the url to call
    "https://api.census.gov/data/timeseries/qwi/sa?"; 
    query = [
        "get" => "EarnS",
        "for" => "county:*",
        "in" => "state:40",
        "year" => "2017",
        "quarter" => "3",
        "sex" => "0",
        "agegrp" => "A01",
        "agegrp" => "A02",
        "agegrp" => "A03",
        "agegrp" => "A04",
        "agegrp" => "A05",
        "agegrp" => "A06",
        "agegrp" => "A07",
        "agegrp" => "A08",
        "ownercode" => "A05",
        "firmsize" => "0",
        "seasonadj" => "U",
        "industry" => "00",
        "key" => "YOUR-CENSUS-API-KEY"
    ]
)

employment_data = JSON.parse(String(res.body))


Output:

617-element Array{Any,1}:
 Any["EarnS", "year", "quarter", "sex", "agegrp", "ownercode", "firmsize", "seasonadj", "industry", "state", "county"]
 Any["821", "2017", "3", "0", "A01", "A05", "0", "U", "00", "40", "001"]                 
 ⋮                                                                                                                    
 Any["4575", "2017", "3", "0", "A05", "A05", "0", "U", "00", "40", "151"]                                             
 Any["4978", "2017", "3", "0", "A06", "A05", "0", "U", "00", "40", "151"] 

Here we are making a simple get request by calling the HTTP get function, passing in a URL and a query array to fetch some Census data. Head over to the Census Bureau website to read more about the Quarterly Workforce Indicators endpoint that we are calling here, and to obtain a free API key so that you can execute this code yourself!

Note that we are using an array of Pairs, rather than a Dict, to supply the query arguments. Dicts don’t preserve the order of the keys/values as we add them, which is important when calling our API. Our endpoint is expecting certain arguments to be in specific locations in the query string and will return an error if the required arguments aren’t in the appropriate order.

Convert response to a Data Frame (this is where the magic happens 🎩🐰)


headers = Symbol.(employment_data[1])
df = DataFrame()
for (i, header) in enumerate(headers)
    df[header] =  [employment_data[j][i] for j in 2:length(employment_data)] 
end
df


There’s alot going on here so let’s take it line by line. First, we create a headers variable where we will store the values that we want to serve as our column names in the Data Frame. Column names in Julia Data Frames are stored as Symbols, so we must convert the values (currently strings) to Symbols, and we do this by broadcasting (with dot syntax) the Symbol function across the first array in employment_data (note that the response we got from the API is an array of arrays, the first of which contains the information that we want to serve as our column names).

Next, we construct an empty Data Frame and then write a loop to fill it. More precisely, we use the enumerate function, just one of several of Julia’s iteration utilities, to iterate the newly-created headers array, and we create a new column in the Data Frame for each header in headers. The enumerator function gives us the index i, as well as the value of the item at that index, which we need in order to assign the data from the remaining arrays in employment_data to their respective columns in the Data Frame.

We achieve this last part by using what’s known as an array comprehension in Julia. Pay special attention to the line [employment_data[j][i] for j in 2:length(employment_data)], which is the array that is assigned to the column that’s created on the left-hand side of this line. Since employment_data is an array of arrays, we need to iterate through each array and ultimately end up with just the ith value from each array in our final array (say that six times fast), which will then become its own column in the Data Frame. We start at the second array in employment_data (since the first array contains the column names) and we get the ith value, which corresponds to the ith value in the headers array, and we loop through all of the arrays in employment_data, doing the same thing.

Output:

616 rows × 11 columns

EarnSyearquartersexagegrpownercodefirmsizeseasonadjindustrystatecounty
StringStringStringStringStringStringStringStringStringStringString
1821201730A01A050U0040001
21379201730A02A050U0040001
31944201730A03A050U0040001
42389201730A04A050U0040001
52993201730A05A050U0040001
63083201730A06A050U0040001
73016201730A07A050U0040001
83486201730A08A050U0040001
9870201730A01A050U0040003
101329201730A02A050U0040003
112505201730A03A050U0040003
123631201730A04A050U0040003
133964201730A05A050U0040003
144354201730A06A050U0040003
153782201730A07A050U0040003
163644201730A08A050U0040003
17722201730A01A050U0040005
181645201730A02A050U0040005
192035201730A03A050U0040005
202893201730A04A050U0040005
212967201730A05A050U0040005
223272201730A06A050U0040005
232668201730A07A050U0040005
242315201730A08A050U0040005
25709201730A01A050U0040007
263138201730A02A050U0040007
272870201730A03A050U0040007
284829201730A04A050U0040007
295570201730A05A050U0040007
305154201730A06A050U0040007

Modify EarnS and agegrp so that we can model their relationship


What I want to model here is the relationship between earnings (EarnS) and age group (agegrp). We have a couple of problems left to solve though: 1) our EarnS data are strings, 2) the agegrp data are also strings with the form “A0X”. To solve the latter problem, we need to convert each data point to a dummy variable (making sure it’s an Integer) so that we can model the relationship. To solve the first problem, we simply need to parse the strings into integer type. The code below does just that:

age_variables = Dict("A01" => 1, "A02" => 2, "A03" => 3, "A04" => 4, "A05" => 5, "A06" => 6, "A07" => 7, "A08" => 8)
df[:agegrp] = map(x -> age_variables[x], df[:agegrp])
df[:EarnS] = map(x -> parse(Int64, x), df[:EarnS])
df


The first thing that we did was create a new Dict that will store the agegrp categories as keys, and their corresponding dummy variables as values. We’ll use this to then replace all of the values in the agegrp column with the appropriate dummy variables, using the map function. Map iterates through an array and applies a function to each item in that array. In this case, we iterate through our agegrp column and replace each value with the key from our age_variables Dict.

Next, we again use map to parse each value in EarnS into type Int64. This block closes with df so that we can see the new-and-improved df in a Jupyter notebook (if you’re following along).

Output:

616 rows × 11 columns

EarnSyearquartersexagegrpownercodefirmsizeseasonadjindustrystatecounty
Int64StringStringStringInt64StringStringStringStringStringString
18212017301A050U0040001
213792017302A050U0040001
319442017303A050U0040001
423892017304A050U0040001
529932017305A050U0040001
630832017306A050U0040001
730162017307A050U0040001
834862017308A050U0040001
98702017301A050U0040003
1013292017302A050U0040003
1125052017303A050U0040003
1236312017304A050U0040003
1339642017305A050U0040003
1443542017306A050U0040003
1537822017307A050U0040003
1636442017308A050U0040003
177222017301A050U0040005
1816452017302A050U0040005
1920352017303A050U0040005
2028932017304A050U0040005
2129672017305A050U0040005
2232722017306A050U0040005
2326682017307A050U0040005
2423152017308A050U0040005
257092017301A050U0040007
2631382017302A050U0040007
2728702017303A050U0040007
2848292017304A050U0040007
2955702017305A050U0040007
3051542017306A050U0040007

Finally, we can run the regression 🤘


Our last step is to use the GLM package to run the regression. For this example, we are doing a simple linear model where the formula is EarnS ~ agegrp (EarnS is our dependent variable, agegrp is our independent variable), and passing in our Data Frame df, so that the function knows where to get the two variables from.

regression = lm(@formula(EarnS ~ agegrp), df)
# names(GLM)[1:end]
# r_squared = r2(regression)


Output:

StatsModels.DataFrameRegressionModel{LinearModel{LmResp{Array{Float64,1}},DensePredChol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}

Formula: EarnS ~ 1 + agegrp

Coefficients:
             Estimate Std.Error t value Pr(>|t|)
(Intercept)   1059.75   73.5937 14.4001   <1e-39
agegrp        384.985   14.5737 26.4164   <1e-99

And that’s it! If you want to make predictions with your model, simply use the predict function, which takes the model (regression in this case) as its first argument, and the value(s) you want the prediction(s) to be made from as the second argument.

As a bonus, I’ve included a couple of extra lines that are commented out above. The names function can be used to see all of the names exported by a module (package). This is really helpful when working with packages that might not have the most complete documentation. Running names(GLM)[1:end] will print all of the names exported by GLM, one of which is r2, which I used to compute the r-squared value for our regression!

Until next time, I hope you’ve enjoyed flirting with Julia! 💘