Logistic Regression · TuringGLM.jl

For our tutorial on Logistic Regression, let's use a famous dataset called wells (Gelman & Hill, 2007), which is data from a survey of 3,200 residents in a small area of Bangladesh suffering from arsenic contamination of groundwater. Respondents with elevated arsenic levels in their wells had been encouraged to switch their water source to a safe public or private well in the nearby area and the survey was conducted several years later to learn which of the affected residents had switched wells. It has 3,200 observations and the following variables:

switch – binary/dummy (0 or 1) for well-switching.
arsenic – arsenic level in respondent's well.
dist – distance (meters) from the respondent's house to the nearest well with safe drinking water.
association – binary/dummy (0 or 1) if member(s) of household participate in community organizations.
educ – years of education (head of household).

using CSV

using DataFrames

using TuringGLM

url = "https://github.com/TuringLang/TuringGLM.jl/raw/main/data/wells.csv";

wells = CSV.read(download(url), DataFrame)

	switch	arsenic	dist	assoc	educ
1	1	2.36	16.826	0	0
2	1	0.71	47.322	0	0
3	0	2.07	20.967	0	10
4	1	1.15	21.486	0	12
5	1	1.1	40.874	1	14
6	1	3.9	69.518	1	9
7	1	2.97	80.711	1	4
8	1	3.24	55.146	0	10
9	1	3.28	52.647	1	0
10	1	2.52	75.072	1	0
...
3020	1	0.66	20.844	1	5

Using switch as dependent variable and dist, arsenic, assoc, and educ as independent variables:

fm = @formula(switch ~ dist + arsenic + assoc + educ)

FormulaTerm
Response:
  switch(unknown)
Predictors:
  dist(unknown)
  arsenic(unknown)
  assoc(unknown)
  educ(unknown)

Now we instantiate our model with turing_model passing a keyword argument model=Bernoulli to indicate that the model is a logistic regression:

model = turing_model(fm, wells; model=Bernoulli);

chn = sample(model, NUTS(), 2_000);

plot_chains(chn)

References

Gelman, A., & Hill, J. (2007). Data analysis using regression and multilevel/hierarchical models. Cambridge university press.