For our tutorial on Logistic Regression, let's use a famous dataset called wells (Gelman & Hill, 2007), which is data from a survey of 3,200 residents in a small area of Bangladesh suffering from arsenic contamination of groundwater. Respondents with elevated arsenic levels in their wells had been encouraged to switch their water source to a safe public or private well in the nearby area and the survey was conducted several years later to learn which of the affected residents had switched wells. It has 3,200 observations and the following variables:

  • switch – binary/dummy (0 or 1) for well-switching.

  • arsenic – arsenic level in respondent's well.

  • dist – distance (meters) from the respondent's house to the nearest well with safe drinking water.

  • association – binary/dummy (0 or 1) if member(s) of household participate in community organizations.

  • educ – years of education (head of household).

using CSV
using DataFrames
using TuringGLM
url = "https://github.com/TuringLang/TuringGLM.jl/raw/main/data/wells.csv";
wells = CSV.read(download(url), DataFrame)
switcharsenicdistassoceduc
112.3616.82600
210.7147.32200
302.0720.967010
411.1521.486012
511.140.874114
613.969.51819
712.9780.71114
813.2455.146010
913.2852.64710
1012.5275.07210
...
302010.6620.84415

Using switch as dependent variable and dist, arsenic, assoc, and educ as independent variables:

fm = @formula(switch ~ dist + arsenic + assoc + educ)
FormulaTerm
Response:
  switch(unknown)
Predictors:
  dist(unknown)
  arsenic(unknown)
  assoc(unknown)
  educ(unknown)

Now we instantiate our model with turing_model passing a keyword argument model=Bernoulli to indicate that the model is a logistic regression:

model = turing_model(fm, wells; model=Bernoulli);
chn = sample(model, NUTS(), 2_000);
plot_chains(chn)

References

Gelman, A., & Hill, J. (2007). Data analysis using regression and multilevel/hierarchical models. Cambridge university press.