Gamma regression for blood clotting#

As a preliminary example of this package’s functionality, we provide an example of performing a Gamma regression, which is used when the response variable is continuous and positive. We have adapted the following canonical example of a Gamma regression from McCullagh & Nelder (1989).

Nine different percentage concentrations with prothrombin-free plasma (\(u\)) and clotting was induced via two lots of thromboplastin. Previous researchers had fitted a hyperbolic model, using an inverse transformation of the data for both lots \(1\) and \(2\), but we will analyze both lots using the inverse link and Gamma family.

The following initial plots hint at using a log scale for \(u\) to achieve inverse linearity, as well as the fact that the two lots have different regression and intercept coefficients.

[1]:

from scikit_stan import GLM

import numpy as np
import pandas as pd

import matplotlib as mpl
import matplotlib.pyplot as plt

[3]:

# ATTRIBUTION: McCullagh & Nelder (1989), chapter 8.4.2 p 301-302
bcdata_dict = {
    "u": np.array([5, 10, 15, 20, 30, 40, 60, 80, 100]),
    "lot1": np.array([118, 58, 42, 35, 27, 25, 21, 19, 18]),
    "lot2": np.array([69, 35, 26, 21, 18, 16, 13, 12, 12]),
}
bc_data_X = np.log(bcdata_dict["u"])
bc_data_lot1 = bcdata_dict["lot1"]
bc_data_lot2 = bcdata_dict["lot2"]

[4]:

l1, = plt.plot(bcdata_dict["u"], bcdata_dict["lot1"], "o", label="lot 1")
l2, = plt.plot(bcdata_dict["u"], bcdata_dict["lot2"], "o", label="lot 2")

plt.suptitle("Mean Clotting Times vs Plasma Concentration")
plt.xlabel('Normal Plasma Concentration')
plt.ylabel('Blood Clotting Time')

plt.legend(handles=[l1, l2])

/home/docs/checkouts/readthedocs.org/user_builds/scikit-stan/conda/v0.1.0/lib/python3.9/site-packages/traitlets/traitlets.py:3258: FutureWarning: --rc={'figure.dpi': 96} for dict-traits is deprecated in traitlets 5.0. You can pass --rc <key=value> ... multiple times to add items to a dict.
  warn(

[4]:

<matplotlib.legend.Legend at 0x7f866f546f70>

../_images/examples_Gamma_Bloodclotting_5_2.svg

[5]:

l1, = plt.plot(bc_data_X, bc_data_lot1, "o", label="lot 1")
l2, = plt.plot(bc_data_X, bc_data_lot2, "o", label="lot 2")

plt.suptitle("Mean Clotting Times vs Plasma Concentration")
plt.xlabel('Normal Plasma Concentration')
plt.ylabel('Blood Clotting Time')

plt.legend(handles=[l1, l2])

[5]:

<matplotlib.legend.Legend at 0x7f866f4cedf0>

../_images/examples_Gamma_Bloodclotting_6_1.svg

After this preliminary data analysis, we fit two lines to the two lots of data. Using \(x = \log u\), we fit a GLM to the data.

The original results were as follows, and we recreate regression coefficients within a standard deviation of these values:

\[\text{lot 1:} \quad \hat{\mu} ^{-1} = - 0.01655(\pm 0.00086) + 0.01534(\pm 0.00143)x\]

\[\text{lot 2:} \quad \hat{\mu} ^{-1} = - 0.02391(\pm 0.00038) + 0.02360(\pm 0.00062)x\]

As in previous work, we will fit two different linear models for each lot in the dataset. As usual, the \(\alpha\) parameter is the regression intercept and \(\mathbf{\beta}\) is vector of regression coefficients and the parameter \(\sigma\) represents an auxiliary variable for the model. In this case, \(\sigma\) is the shape parameter for the Gamma distribution.

[6]:

# Initialize two different GLM objects, one for each lot.
glm_gamma1 = GLM(family="gamma", link="inverse", seed=1234)
glm_gamma2 = GLM(family="gamma", link="inverse", seed=1234)

# Fit the model. Note that default priors are used without autoscaling, see the
# API to see how to change these.
glm_gamma1.fit(bc_data_X, bc_data_lot1, show_console=False)
glm_gamma2.fit(bc_data_X, bc_data_lot2, show_console=False)

print(glm_gamma1.alpha_, glm_gamma1.beta_)
print(glm_gamma2.alpha_, glm_gamma2.beta_)

/home/docs/checkouts/readthedocs.org/user_builds/scikit-stan/checkouts/v0.1.0/scikit_stan/utils/validation.py:229: UserWarning: Passed data is one-dimensional, while estimator expects it to be at at least two-dimensional.
  warnings.warn(
/home/docs/checkouts/readthedocs.org/user_builds/scikit-stan/checkouts/v0.1.0/scikit_stan/generalized_linear_regression/glm.py:498: UserWarning: Prior on intercept not specified. Using default prior.
                alpha ~ normal(mu(y), 2.5 * sd(y)) if Gaussian family else normal(0, 2.5)
  warnings.warn(
/home/docs/checkouts/readthedocs.org/user_builds/scikit-stan/checkouts/v0.1.0/scikit_stan/generalized_linear_regression/glm.py:543: UserWarning: Prior on auxiliary parameter not specified. Using default unscaled prior
                        sigma ~ exponential(1)

  warnings.warn(
20:26:12 - cmdstanpy - INFO - CmdStan start processing

20:26:13 - cmdstanpy - INFO - CmdStan done processing.
20:26:13 - cmdstanpy - WARNING - Some chains may have failed to converge.
        Chain 1 had 20 divergent transitions (2.0%)
        Chain 2 had 39 divergent transitions (3.9%)
        Chain 3 had 26 divergent transitions (2.6%)
        Chain 4 had 34 divergent transitions (3.4%)
        Use function "diagnose()" to see further information.
20:26:13 - cmdstanpy - INFO - CmdStan start processing

20:26:13 - cmdstanpy - INFO - CmdStan done processing.
20:26:13 - cmdstanpy - WARNING - Some chains may have failed to converge.
        Chain 1 had 8 divergent transitions (0.8%)
        Chain 2 had 37 divergent transitions (3.7%)
        Chain 3 had 17 divergent transitions (1.7%)
        Chain 4 had 8 divergent transitions (0.8%)
        Use function "diagnose()" to see further information.


-0.014230614903001692 [0.01500947]
-0.019889316430406125 [0.02298595]

As can be seen above, the fitted model has the following parameters, which are within one standard deviation of the results from past studies.

\[\text{lot 1:} \quad \hat{\mu} ^{-1} = - 0.01437 + 0.01511 \cdot x\]

\[\text{lot 2:} \quad \hat{\mu} ^{-1} = - 0.02016 + 0.02301 \cdot x\]

As a verification of the accuracy of the fitted model, we can plot the fitted lines and the data.

[7]:

mu_inv1 = 1 /( glm_gamma1.alpha_ + glm_gamma1.beta_ * bc_data_X)
mu_inv2 = 1 /( glm_gamma2.alpha_ + glm_gamma2.beta_ * bc_data_X)

[8]:

mlot1, = plt.plot(bc_data_X, mu_inv1, "r", label="mu_inv lot 1")
mlot2, = plt.plot(bc_data_X, mu_inv2, "b", label="mu_inv lot 2")
l1, = plt.plot(bc_data_X, bc_data_lot1, "o", label="lot1")
l2, = plt.plot(bc_data_X, bc_data_lot2, "o", label="lot2")

plt.suptitle("Mean Clotting Times vs Plasma Concentration")
plt.xlabel('Normal Plasma Concentration')
plt.ylabel('Blood Clotting Time')

plt.legend(handles=[mlot1, mlot2, l1, l2])

[8]:

<matplotlib.legend.Legend at 0x7f866dbb8100>

../_images/examples_Gamma_Bloodclotting_11_1.svg

As this package is a wrapper around CmdStanPy, we can gather additional statistics about the fitted model with methods from that package. In particular, we can consider further statistics about the model by using CmdStanPy’s summary method on the results of the fit.

Notice that \(\mu\) (“mu”) and the link-inverted \(\mu\) (“mu unlinked”) are included as part of the model summary.

[9]:

glm_gamma1.fitted_samples_.summary()

[9]:

	Mean	MCSE	StdDev	5%	50%	95%	N_Eff	N_Eff/s	R_hat
name
lp__	-37.778600	0.046785	1.403670	-40.522200	-37.400600	-36.293900	900.1620	2557.28000	1.00412
alpha	-0.014231	0.000337	0.010197	-0.029048	-0.015165	0.003691	914.9120	2599.18000	1.00151
beta[1]	0.015010	0.000148	0.004417	0.007862	0.015049	0.022065	889.3810	2526.65000	1.00240
sigma	4.691710	0.053930	2.090850	1.836280	4.359500	8.560740	1502.9555	4269.75993	0.99991
mu[1]	0.009926	0.000119	0.004447	0.003997	0.009162	0.018356	1386.4200	3938.70000	1.00042
mu[2]	0.020330	0.000080	0.003828	0.014891	0.019936	0.027054	2308.7500	6558.94000	1.00129
mu[3]	0.026416	0.000107	0.004510	0.019901	0.026111	0.033674	1764.6300	5013.17000	1.00227
mu[4]	0.030734	0.000140	0.005317	0.022905	0.030426	0.039585	1443.2200	4100.07000	1.00264
mu[5]	0.036820	0.000193	0.006703	0.026990	0.036428	0.048149	1208.6500	3433.66000	1.00285
mu[6]	0.041138	0.000232	0.007787	0.029577	0.040712	0.054328	1122.6700	3189.40000	1.00289
mu[7]	0.047223	0.000290	0.009395	0.033157	0.046676	0.063299	1052.0200	2988.70000	1.00289
mu[8]	0.051541	0.000331	0.010571	0.035570	0.050966	0.069343	1020.8900	2900.24000	1.00288
mu[9]	0.054891	0.000363	0.011498	0.037506	0.054330	0.074204	1003.0600	2849.60000	1.00286
mu_unlinked[1]	124.261000	1.504720	68.359200	54.425400	109.158000	250.085000	2063.8600	5863.24000	1.00065
mu_unlinked[2]	50.864500	0.167089	9.347230	36.961500	50.161500	67.148200	3129.4700	8890.55000	1.00113
mu_unlinked[3]	38.934400	0.135521	6.603890	29.687800	38.303300	50.211100	2374.5700	6745.93000	1.00172
mu_unlinked[4]	33.495100	0.134257	5.800810	25.250000	32.875800	43.609700	1866.8300	5303.50000	1.00191
mu_unlinked[5]	28.056300	0.133558	5.192950	20.766700	27.453900	37.045600	1511.7900	4294.85000	1.00196
mu_unlinked[6]	25.187500	0.132405	4.919700	18.397700	24.564600	33.805600	1380.6000	3922.16000	1.00191
mu_unlinked[7]	22.038500	0.129934	4.634350	15.784500	21.424800	30.159800	1272.1300	3614.00000	1.00182
mu_unlinked[8]	20.254300	0.127863	4.472150	14.411100	19.620900	28.086300	1223.3300	3475.36000	1.00174
mu_unlinked[9]	19.062300	0.126239	4.361420	13.473300	18.406200	26.647000	1193.6300	3391.01000	1.00168

[10]:

glm_gamma2.fitted_samples_.summary()

[10]:

	Mean	MCSE	StdDev	5%	50%	95%	N_Eff	N_Eff/s	R_hat
name
lp__	-33.593000	0.052207	1.441860	-36.442000	-33.230700	-32.049700	762.7800	1745.49000	1.00250
alpha	-0.019889	0.000554	0.016820	-0.045271	-0.020698	0.009217	921.2220	2108.06000	1.00569
beta[1]	0.022986	0.000231	0.007030	0.011729	0.022965	0.034812	923.2150	2112.62000	1.00428
sigma	4.656330	0.059330	2.086980	1.825410	4.360940	8.579550	1237.3025	2831.35585	1.00036
mu[1]	0.017105	0.000219	0.007794	0.006798	0.015984	0.030622	1267.7900	2901.12000	1.00470
mu[2]	0.033038	0.000143	0.006629	0.023800	0.032344	0.044242	2137.4100	4891.11000	1.00120
mu[3]	0.042358	0.000163	0.007492	0.031572	0.041706	0.055453	2116.3100	4842.81000	1.00020
mu[4]	0.048970	0.000208	0.008639	0.036295	0.048330	0.064112	1726.4900	3950.79000	1.00032
mu[5]	0.058290	0.000287	0.010705	0.042183	0.057501	0.077193	1396.1200	3194.78000	1.00088
mu[6]	0.064903	0.000346	0.012362	0.045948	0.063959	0.086757	1273.8900	2915.09000	1.00129
mu[7]	0.074223	0.000434	0.014852	0.051542	0.073258	0.100046	1171.5800	2680.95000	1.00178
mu[8]	0.080836	0.000497	0.016688	0.055468	0.079888	0.109560	1126.7500	2578.38000	1.00206
mu[9]	0.085965	0.000547	0.018140	0.058355	0.085109	0.117108	1101.1200	2519.71000	1.00224
mu_unlinked[1]	72.768500	0.918299	43.936200	32.610500	62.596000	146.955000	2289.1700	5238.37000	1.00183
mu_unlinked[2]	31.432500	0.107813	6.150220	22.577300	30.921400	41.978700	3254.1600	7446.60000	1.00082
mu_unlinked[3]	24.341900	0.080036	4.341970	18.028400	23.977600	31.659300	2943.0500	6734.67000	1.00002
mu_unlinked[4]	21.060600	0.081197	3.791850	15.591400	20.692200	27.537500	2180.8700	4990.54000	1.00018
mu_unlinked[5]	17.746900	0.082769	3.371840	12.942500	17.391700	23.700700	1659.5800	3797.66000	1.00089
mu_unlinked[6]	15.985100	0.083394	3.185690	11.515800	15.636900	21.760300	1459.2800	3339.32000	1.00146
mu_unlinked[7]	14.039800	0.083320	2.996320	9.994920	13.651800	19.383800	1293.2400	2959.36000	1.00220
mu_unlinked[8]	12.932100	0.082892	2.891750	9.127060	12.518900	18.023100	1217.0200	2784.94000	1.00266
mu_unlinked[9]	12.189800	0.082430	2.821760	8.538730	11.752700	17.130600	1171.8400	2681.55000	1.00296

Additional information about the model and various visualizations can be revealed by Arviz, which seamlessly integrates with CmdStanPy components. Consider the following.

[11]:

import arviz as az
az.style.use("arviz-darkgrid")

[12]:

infdata = az.from_cmdstanpy(glm_gamma1.fitted_samples_)
infdata

[13]:

az.plot_posterior(infdata, var_names=['alpha', 'beta', 'sigma']);

../_images/examples_Gamma_Bloodclotting_18_0.svg

[14]:

az.plot_trace(infdata, var_names=['alpha', 'beta', 'sigma'], compact=True);

../_images/examples_Gamma_Bloodclotting_19_0.svg