Linear Regression · TuringGLM.jl

Let's cover Linear Regression with a famous dataset called kidiq (Gelman & Hill, 2007), which is data from a survey of adult American women and their respective children. Dated from 2007, it has 434 observations and 4 variables:

kid_score: child's IQ
mom_hs: binary/dummy (0 or 1) if the child's mother has a high school diploma
mom_iq: mother's IQ
mom_age: mother's age

For the purposes of this tutorial, we download the dataset from the TuringGLM repository:

using CSV

using DataFrames

using TuringGLM

url = "https://github.com/TuringLang/TuringGLM.jl/raw/main/data/kidiq.csv";

kidiq = CSV.read(download(url), DataFrame)

	kid_score	mom_hs	mom_iq	mom_age
1	65	1	121.118	27
2	98	1	89.3619	25
3	85	1	115.443	27
4	83	1	99.4496	25
5	115	1	92.7457	27
6	98	0	107.902	18
7	69	1	138.893	20
8	106	1	125.145	23
9	102	1	81.6195	24
10	95	1	95.0731	19
...
434	70	1	91.2533	25

Using kid_score as dependent variable and mom_hs along with mom_iq as independent variables with a moderation (interaction) effect:

fm = @formula(kid_score ~ mom_hs * mom_iq)

FormulaTerm
Response:
  kid_score(unknown)
Predictors:
  mom_hs(unknown)
  mom_iq(unknown)
  mom_hs(unknown) & mom_iq(unknown)

Next, we instantiate our model with turing_model without specifying any model, thus the default model will be used (model=Normal):

model = turing_model(fm, kidiq);

n_samples = 2_000;

This model is a valid Turing model, which we can pass to the default sample function from Turing to get our parameter estimates. We use the NUTS sampler with 2000 samples.

chns = sample(model, NUTS(), n_samples);

plot_chains(chns)

References

Gelman, A., & Hill, J. (2007). Data analysis using regression and multilevel/hierarchical models. Cambridge university press.