### Install dependencies

```
using Pkg
Pkg.add(["DataFrames", "GLM", "HTTP", "JSON"])
using DataFrames, GLM, HTTP, JSON
```

**DataFrames**will allow us to view our data as a Data Frame**GLM**is part of the JuliaStats family of packages, and we will use it for our simple linear regression**HTTP**is the package that we will use to make our request to the API for our data- We will use the
**JSON**package to parse the response that we get from the API

### Call the API, parse the response, store the parsed response in a variable

```
res = HTTP.get(
# the url to call
"https://api.census.gov/data/timeseries/qwi/sa?";
query = [
"get" => "EarnS",
"for" => "county:*",
"in" => "state:40",
"year" => "2017",
"quarter" => "3",
"sex" => "0",
"agegrp" => "A01",
"agegrp" => "A02",
"agegrp" => "A03",
"agegrp" => "A04",
"agegrp" => "A05",
"agegrp" => "A06",
"agegrp" => "A07",
"agegrp" => "A08",
"ownercode" => "A05",
"firmsize" => "0",
"seasonadj" => "U",
"industry" => "00",
"key" => "YOUR-CENSUS-API-KEY"
]
)
employment_data = JSON.parse(String(res.body))
```

Output:

```
617-element Array{Any,1}:
Any["EarnS", "year", "quarter", "sex", "agegrp", "ownercode", "firmsize", "seasonadj", "industry", "state", "county"]
Any["821", "2017", "3", "0", "A01", "A05", "0", "U", "00", "40", "001"]
⋮
Any["4575", "2017", "3", "0", "A05", "A05", "0", "U", "00", "40", "151"]
Any["4978", "2017", "3", "0", "A06", "A05", "0", "U", "00", "40", "151"]
```

Here we are making a simple *get* request by calling the HTTP get function, passing in a URL and a query array to fetch some Census data. Head over to the Census Bureau website to read more about the Quarterly Workforce Indicators endpoint that we are calling here, and to obtain a free API key so that you can execute this code yourself!

Note that we are using an array of Pairs, rather than a Dict, to supply the query arguments. Dicts don’t preserve the order of the keys/values as we add them, which is important when calling our API. Our endpoint is expecting certain arguments to be in specific locations in the query string and will return an error if the required arguments aren’t in the appropriate order.

### Convert response to a Data Frame (this is where the magic happens 🎩🐰)

```
headers = Symbol.(employment_data[1])
df = DataFrame()
for (i, header) in enumerate(headers)
df[header] = [employment_data[j][i] for j in 2:length(employment_data)]
end
df
```

There’s alot going on here so let’s take it line by line. First, we create a `headers`

variable where we will store the values that we want to serve as our column names in the Data Frame. Column names in Julia Data Frames are stored as Symbols, so we must convert the values (currently strings) to Symbols, and we do this by broadcasting (with dot syntax) the Symbol function across the first array in `employment_data`

(note that the response we got from the API is an array of arrays, the first of which contains the information that we want to serve as our column names).

Next, we construct an empty Data Frame and then write a loop to fill it. More precisely, we use the enumerate function, just one of several of Julia’s iteration utilities, to iterate the newly-created `headers`

array, and we create a new column in the Data Frame for each `header`

in `headers`

. The enumerator function gives us the index *i*, as well as the value of the item at that index, which we need in order to assign the data from the remaining arrays in `employment_data`

to their respective columns in the Data Frame.

We achieve this last part by using what’s known as an *array comprehension* in Julia. Pay special attention to the line `[employment_data[j][i] for j in 2:length(employment_data)]`

, which is the array that is assigned to the column that’s created on the left-hand side of this line. Since `employment_data`

is an array of arrays, we need to iterate through each array and ultimately end up with just the *i*th value from each array in our final array (say that six times fast), which will then become its own column in the Data Frame. We start at the second array in `employment_data`

(since the first array contains the column names) and we get the *i*th value, which corresponds to the *i*th value in the `headers`

array, and we loop through all of the arrays in `employment_data`

, doing the same thing.

Output:

616 rows × 11 columns

EarnS | year | quarter | sex | agegrp | ownercode | firmsize | seasonadj | industry | state | county | |
---|---|---|---|---|---|---|---|---|---|---|---|

String | String | String | String | String | String | String | String | String | String | String | |

1 | 821 | 2017 | 3 | 0 | A01 | A05 | 0 | U | 00 | 40 | 001 |

2 | 1379 | 2017 | 3 | 0 | A02 | A05 | 0 | U | 00 | 40 | 001 |

3 | 1944 | 2017 | 3 | 0 | A03 | A05 | 0 | U | 00 | 40 | 001 |

4 | 2389 | 2017 | 3 | 0 | A04 | A05 | 0 | U | 00 | 40 | 001 |

5 | 2993 | 2017 | 3 | 0 | A05 | A05 | 0 | U | 00 | 40 | 001 |

6 | 3083 | 2017 | 3 | 0 | A06 | A05 | 0 | U | 00 | 40 | 001 |

7 | 3016 | 2017 | 3 | 0 | A07 | A05 | 0 | U | 00 | 40 | 001 |

8 | 3486 | 2017 | 3 | 0 | A08 | A05 | 0 | U | 00 | 40 | 001 |

9 | 870 | 2017 | 3 | 0 | A01 | A05 | 0 | U | 00 | 40 | 003 |

10 | 1329 | 2017 | 3 | 0 | A02 | A05 | 0 | U | 00 | 40 | 003 |

11 | 2505 | 2017 | 3 | 0 | A03 | A05 | 0 | U | 00 | 40 | 003 |

12 | 3631 | 2017 | 3 | 0 | A04 | A05 | 0 | U | 00 | 40 | 003 |

13 | 3964 | 2017 | 3 | 0 | A05 | A05 | 0 | U | 00 | 40 | 003 |

14 | 4354 | 2017 | 3 | 0 | A06 | A05 | 0 | U | 00 | 40 | 003 |

15 | 3782 | 2017 | 3 | 0 | A07 | A05 | 0 | U | 00 | 40 | 003 |

16 | 3644 | 2017 | 3 | 0 | A08 | A05 | 0 | U | 00 | 40 | 003 |

17 | 722 | 2017 | 3 | 0 | A01 | A05 | 0 | U | 00 | 40 | 005 |

18 | 1645 | 2017 | 3 | 0 | A02 | A05 | 0 | U | 00 | 40 | 005 |

19 | 2035 | 2017 | 3 | 0 | A03 | A05 | 0 | U | 00 | 40 | 005 |

20 | 2893 | 2017 | 3 | 0 | A04 | A05 | 0 | U | 00 | 40 | 005 |

21 | 2967 | 2017 | 3 | 0 | A05 | A05 | 0 | U | 00 | 40 | 005 |

22 | 3272 | 2017 | 3 | 0 | A06 | A05 | 0 | U | 00 | 40 | 005 |

23 | 2668 | 2017 | 3 | 0 | A07 | A05 | 0 | U | 00 | 40 | 005 |

24 | 2315 | 2017 | 3 | 0 | A08 | A05 | 0 | U | 00 | 40 | 005 |

25 | 709 | 2017 | 3 | 0 | A01 | A05 | 0 | U | 00 | 40 | 007 |

26 | 3138 | 2017 | 3 | 0 | A02 | A05 | 0 | U | 00 | 40 | 007 |

27 | 2870 | 2017 | 3 | 0 | A03 | A05 | 0 | U | 00 | 40 | 007 |

28 | 4829 | 2017 | 3 | 0 | A04 | A05 | 0 | U | 00 | 40 | 007 |

29 | 5570 | 2017 | 3 | 0 | A05 | A05 | 0 | U | 00 | 40 | 007 |

30 | 5154 | 2017 | 3 | 0 | A06 | A05 | 0 | U | 00 | 40 | 007 |

⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |

### Modify `EarnS`

and `agegrp`

so that we can model their relationship

What I want to model here is the relationship between earnings (EarnS) and age group (agegrp). We have a couple of problems left to solve though: 1) our EarnS data are strings, 2) the agegrp data are also strings with the form “A0X”. To solve the latter problem, we need to convert each data point to a dummy variable (making sure it’s an Integer) so that we can model the relationship. To solve the first problem, we simply need to parse the strings into integer type. The code below does just that:

```
age_variables = Dict("A01" => 1, "A02" => 2, "A03" => 3, "A04" => 4, "A05" => 5, "A06" => 6, "A07" => 7, "A08" => 8)
df[:agegrp] = map(x -> age_variables[x], df[:agegrp])
df[:EarnS] = map(x -> parse(Int64, x), df[:EarnS])
df
```

The first thing that we did was create a new Dict that will store the agegrp categories as keys, and their corresponding dummy variables as values. We’ll use this to then replace all of the values in the *agegrp* column with the appropriate dummy variables, using the map function. Map iterates through an array and applies a function to each item in that array. In this case, we iterate through our *agegrp* column and replace each value with the key from our age_variables Dict.

Next, we again use map to parse each value in EarnS into type Int64. This block closes with `df`

so that we can see the new-and-improved `df`

in a Jupyter notebook (if you’re following along).

Output:

616 rows × 11 columns

EarnS | year | quarter | sex | agegrp | ownercode | firmsize | seasonadj | industry | state | county | |
---|---|---|---|---|---|---|---|---|---|---|---|

Int64 | String | String | String | Int64 | String | String | String | String | String | String | |

1 | 821 | 2017 | 3 | 0 | 1 | A05 | 0 | U | 00 | 40 | 001 |

2 | 1379 | 2017 | 3 | 0 | 2 | A05 | 0 | U | 00 | 40 | 001 |

3 | 1944 | 2017 | 3 | 0 | 3 | A05 | 0 | U | 00 | 40 | 001 |

4 | 2389 | 2017 | 3 | 0 | 4 | A05 | 0 | U | 00 | 40 | 001 |

5 | 2993 | 2017 | 3 | 0 | 5 | A05 | 0 | U | 00 | 40 | 001 |

6 | 3083 | 2017 | 3 | 0 | 6 | A05 | 0 | U | 00 | 40 | 001 |

7 | 3016 | 2017 | 3 | 0 | 7 | A05 | 0 | U | 00 | 40 | 001 |

8 | 3486 | 2017 | 3 | 0 | 8 | A05 | 0 | U | 00 | 40 | 001 |

9 | 870 | 2017 | 3 | 0 | 1 | A05 | 0 | U | 00 | 40 | 003 |

10 | 1329 | 2017 | 3 | 0 | 2 | A05 | 0 | U | 00 | 40 | 003 |

11 | 2505 | 2017 | 3 | 0 | 3 | A05 | 0 | U | 00 | 40 | 003 |

12 | 3631 | 2017 | 3 | 0 | 4 | A05 | 0 | U | 00 | 40 | 003 |

13 | 3964 | 2017 | 3 | 0 | 5 | A05 | 0 | U | 00 | 40 | 003 |

14 | 4354 | 2017 | 3 | 0 | 6 | A05 | 0 | U | 00 | 40 | 003 |

15 | 3782 | 2017 | 3 | 0 | 7 | A05 | 0 | U | 00 | 40 | 003 |

16 | 3644 | 2017 | 3 | 0 | 8 | A05 | 0 | U | 00 | 40 | 003 |

17 | 722 | 2017 | 3 | 0 | 1 | A05 | 0 | U | 00 | 40 | 005 |

18 | 1645 | 2017 | 3 | 0 | 2 | A05 | 0 | U | 00 | 40 | 005 |

19 | 2035 | 2017 | 3 | 0 | 3 | A05 | 0 | U | 00 | 40 | 005 |

20 | 2893 | 2017 | 3 | 0 | 4 | A05 | 0 | U | 00 | 40 | 005 |

21 | 2967 | 2017 | 3 | 0 | 5 | A05 | 0 | U | 00 | 40 | 005 |

22 | 3272 | 2017 | 3 | 0 | 6 | A05 | 0 | U | 00 | 40 | 005 |

23 | 2668 | 2017 | 3 | 0 | 7 | A05 | 0 | U | 00 | 40 | 005 |

24 | 2315 | 2017 | 3 | 0 | 8 | A05 | 0 | U | 00 | 40 | 005 |

25 | 709 | 2017 | 3 | 0 | 1 | A05 | 0 | U | 00 | 40 | 007 |

26 | 3138 | 2017 | 3 | 0 | 2 | A05 | 0 | U | 00 | 40 | 007 |

27 | 2870 | 2017 | 3 | 0 | 3 | A05 | 0 | U | 00 | 40 | 007 |

28 | 4829 | 2017 | 3 | 0 | 4 | A05 | 0 | U | 00 | 40 | 007 |

29 | 5570 | 2017 | 3 | 0 | 5 | A05 | 0 | U | 00 | 40 | 007 |

30 | 5154 | 2017 | 3 | 0 | 6 | A05 | 0 | U | 00 | 40 | 007 |

⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |

### Finally, we can run the regression 🤘

Our last step is to use the GLM package to run the regression. For this example, we are doing a simple linear model where the formula is EarnS ~ agegrp (EarnS is our dependent variable, agegrp is our independent variable), and passing in our Data Frame `df`

, so that the function knows where to get the two variables from.

```
regression = lm(@formula(EarnS ~ agegrp), df)
# names(GLM)[1:end]
# r_squared = r2(regression)
```

Output:

```
StatsModels.DataFrameRegressionModel{LinearModel{LmResp{Array{Float64,1}},DensePredChol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}
Formula: EarnS ~ 1 + agegrp
Coefficients:
Estimate Std.Error t value Pr(>|t|)
(Intercept) 1059.75 73.5937 14.4001 <1e-39
agegrp 384.985 14.5737 26.4164 <1e-99
```

And that’s it! If you want to make predictions with your model, simply use the `predict`

function, which takes the model (`regression`

in this case) as its first argument, and the value(s) you want the prediction(s) to be made from as the second argument.

As a bonus, I’ve included a couple of extra lines that are commented out above. The `names`

function can be used to see all of the names exported by a module (package). This is really helpful when working with packages that might not have the most complete documentation. Running `names(GLM)[1:end]`

will print all of the names exported by GLM, one of which is r2, which I used to compute the r-squared value for our regression!

Until next time, I hope you’ve enjoyed flirting with Julia! 💘