Gamma regression for blood clotting#

As a preliminary example of this package’s functionality, we provide an example of performing a Gamma regression, which is used when the response variable is continuous and positive. We have adapted the following canonical example of a Gamma regression from McCullagh & Nelder (1989).

Nine different percentage concentrations with prothrombin-free plasma (\(u\)) and clotting was induced via two lots of thromboplastin. Previous researchers had fitted a hyperbolic model, using an inverse transformation of the data for both lots \(1\) and \(2\), but we will analyze both lots using the inverse link and Gamma family.

The following initial plots hint at using a log scale for \(u\) to achieve inverse linearity, as well as the fact that the two lots have different regression and intercept coefficients.

[1]:
from scikit_stan import GLM

import numpy as np
import pandas as pd

import matplotlib as mpl
import matplotlib.pyplot as plt
[3]:
# ATTRIBUTION: McCullagh & Nelder (1989), chapter 8.4.2 p 301-302
bcdata_dict = {
    "u": np.array([5, 10, 15, 20, 30, 40, 60, 80, 100]),
    "lot1": np.array([118, 58, 42, 35, 27, 25, 21, 19, 18]),
    "lot2": np.array([69, 35, 26, 21, 18, 16, 13, 12, 12]),
}
bc_data_X = np.log(bcdata_dict["u"])
bc_data_lot1 = bcdata_dict["lot1"]
bc_data_lot2 = bcdata_dict["lot2"]
[4]:
l1, = plt.plot(bcdata_dict["u"], bcdata_dict["lot1"], "o", label="lot 1")
l2, = plt.plot(bcdata_dict["u"], bcdata_dict["lot2"], "o", label="lot 2")

plt.suptitle("Mean Clotting Times vs Plasma Concentration")
plt.xlabel('Normal Plasma Concentration')
plt.ylabel('Blood Clotting Time')

plt.legend(handles=[l1, l2])
/home/docs/checkouts/readthedocs.org/user_builds/scikit-stan/conda/v0.1.0/lib/python3.9/site-packages/traitlets/traitlets.py:3258: FutureWarning: --rc={'figure.dpi': 96} for dict-traits is deprecated in traitlets 5.0. You can pass --rc <key=value> ... multiple times to add items to a dict.
  warn(
[4]:
<matplotlib.legend.Legend at 0x7f866f546f70>
../_images/examples_Gamma_Bloodclotting_5_2.svg
[5]:
l1, = plt.plot(bc_data_X, bc_data_lot1, "o", label="lot 1")
l2, = plt.plot(bc_data_X, bc_data_lot2, "o", label="lot 2")

plt.suptitle("Mean Clotting Times vs Plasma Concentration")
plt.xlabel('Normal Plasma Concentration')
plt.ylabel('Blood Clotting Time')

plt.legend(handles=[l1, l2])
[5]:
<matplotlib.legend.Legend at 0x7f866f4cedf0>
../_images/examples_Gamma_Bloodclotting_6_1.svg

After this preliminary data analysis, we fit two lines to the two lots of data. Using \(x = \log u\), we fit a GLM to the data.

The original results were as follows, and we recreate regression coefficients within a standard deviation of these values:

\[\text{lot 1:} \quad \hat{\mu} ^{-1} = - 0.01655(\pm 0.00086) + 0.01534(\pm 0.00143)x\]
\[\text{lot 2:} \quad \hat{\mu} ^{-1} = - 0.02391(\pm 0.00038) + 0.02360(\pm 0.00062)x\]

As in previous work, we will fit two different linear models for each lot in the dataset. As usual, the \(\alpha\) parameter is the regression intercept and \(\mathbf{\beta}\) is vector of regression coefficients and the parameter \(\sigma\) represents an auxiliary variable for the model. In this case, \(\sigma\) is the shape parameter for the Gamma distribution.

[6]:
# Initialize two different GLM objects, one for each lot.
glm_gamma1 = GLM(family="gamma", link="inverse", seed=1234)
glm_gamma2 = GLM(family="gamma", link="inverse", seed=1234)

# Fit the model. Note that default priors are used without autoscaling, see the
# API to see how to change these.
glm_gamma1.fit(bc_data_X, bc_data_lot1, show_console=False)
glm_gamma2.fit(bc_data_X, bc_data_lot2, show_console=False)

print(glm_gamma1.alpha_, glm_gamma1.beta_)
print(glm_gamma2.alpha_, glm_gamma2.beta_)
/home/docs/checkouts/readthedocs.org/user_builds/scikit-stan/checkouts/v0.1.0/scikit_stan/utils/validation.py:229: UserWarning: Passed data is one-dimensional, while estimator expects it to be at at least two-dimensional.
  warnings.warn(
/home/docs/checkouts/readthedocs.org/user_builds/scikit-stan/checkouts/v0.1.0/scikit_stan/generalized_linear_regression/glm.py:498: UserWarning: Prior on intercept not specified. Using default prior.
                alpha ~ normal(mu(y), 2.5 * sd(y)) if Gaussian family else normal(0, 2.5)
  warnings.warn(
/home/docs/checkouts/readthedocs.org/user_builds/scikit-stan/checkouts/v0.1.0/scikit_stan/generalized_linear_regression/glm.py:543: UserWarning: Prior on auxiliary parameter not specified. Using default unscaled prior
                        sigma ~ exponential(1)

  warnings.warn(
20:26:12 - cmdstanpy - INFO - CmdStan start processing

20:26:13 - cmdstanpy - INFO - CmdStan done processing.
20:26:13 - cmdstanpy - WARNING - Some chains may have failed to converge.
        Chain 1 had 20 divergent transitions (2.0%)
        Chain 2 had 39 divergent transitions (3.9%)
        Chain 3 had 26 divergent transitions (2.6%)
        Chain 4 had 34 divergent transitions (3.4%)
        Use function "diagnose()" to see further information.
20:26:13 - cmdstanpy - INFO - CmdStan start processing


20:26:13 - cmdstanpy - INFO - CmdStan done processing.
20:26:13 - cmdstanpy - WARNING - Some chains may have failed to converge.
        Chain 1 had 8 divergent transitions (0.8%)
        Chain 2 had 37 divergent transitions (3.7%)
        Chain 3 had 17 divergent transitions (1.7%)
        Chain 4 had 8 divergent transitions (0.8%)
        Use function "diagnose()" to see further information.

-0.014230614903001692 [0.01500947]
-0.019889316430406125 [0.02298595]

As can be seen above, the fitted model has the following parameters, which are within one standard deviation of the results from past studies.

\[\text{lot 1:} \quad \hat{\mu} ^{-1} = - 0.01437 + 0.01511 \cdot x\]
\[\text{lot 2:} \quad \hat{\mu} ^{-1} = - 0.02016 + 0.02301 \cdot x\]

As a verification of the accuracy of the fitted model, we can plot the fitted lines and the data.

[7]:
mu_inv1 = 1 /( glm_gamma1.alpha_ + glm_gamma1.beta_ * bc_data_X)
mu_inv2 = 1 /( glm_gamma2.alpha_ + glm_gamma2.beta_ * bc_data_X)
[8]:
mlot1, = plt.plot(bc_data_X, mu_inv1, "r", label="mu_inv lot 1")
mlot2, = plt.plot(bc_data_X, mu_inv2, "b", label="mu_inv lot 2")
l1, = plt.plot(bc_data_X, bc_data_lot1, "o", label="lot1")
l2, = plt.plot(bc_data_X, bc_data_lot2, "o", label="lot2")

plt.suptitle("Mean Clotting Times vs Plasma Concentration")
plt.xlabel('Normal Plasma Concentration')
plt.ylabel('Blood Clotting Time')

plt.legend(handles=[mlot1, mlot2, l1, l2])
[8]:
<matplotlib.legend.Legend at 0x7f866dbb8100>
../_images/examples_Gamma_Bloodclotting_11_1.svg

As this package is a wrapper around CmdStanPy, we can gather additional statistics about the fitted model with methods from that package. In particular, we can consider further statistics about the model by using CmdStanPy’s summary method on the results of the fit.

Notice that \(\mu\) (“mu”) and the link-inverted \(\mu\) (“mu unlinked”) are included as part of the model summary.

[9]:
glm_gamma1.fitted_samples_.summary()
[9]:
Mean MCSE StdDev 5% 50% 95% N_Eff N_Eff/s R_hat
name
lp__ -37.778600 0.046785 1.403670 -40.522200 -37.400600 -36.293900 900.1620 2557.28000 1.00412
alpha -0.014231 0.000337 0.010197 -0.029048 -0.015165 0.003691 914.9120 2599.18000 1.00151
beta[1] 0.015010 0.000148 0.004417 0.007862 0.015049 0.022065 889.3810 2526.65000 1.00240
sigma 4.691710 0.053930 2.090850 1.836280 4.359500 8.560740 1502.9555 4269.75993 0.99991
mu[1] 0.009926 0.000119 0.004447 0.003997 0.009162 0.018356 1386.4200 3938.70000 1.00042
mu[2] 0.020330 0.000080 0.003828 0.014891 0.019936 0.027054 2308.7500 6558.94000 1.00129
mu[3] 0.026416 0.000107 0.004510 0.019901 0.026111 0.033674 1764.6300 5013.17000 1.00227
mu[4] 0.030734 0.000140 0.005317 0.022905 0.030426 0.039585 1443.2200 4100.07000 1.00264
mu[5] 0.036820 0.000193 0.006703 0.026990 0.036428 0.048149 1208.6500 3433.66000 1.00285
mu[6] 0.041138 0.000232 0.007787 0.029577 0.040712 0.054328 1122.6700 3189.40000 1.00289
mu[7] 0.047223 0.000290 0.009395 0.033157 0.046676 0.063299 1052.0200 2988.70000 1.00289
mu[8] 0.051541 0.000331 0.010571 0.035570 0.050966 0.069343 1020.8900 2900.24000 1.00288
mu[9] 0.054891 0.000363 0.011498 0.037506 0.054330 0.074204 1003.0600 2849.60000 1.00286
mu_unlinked[1] 124.261000 1.504720 68.359200 54.425400 109.158000 250.085000 2063.8600 5863.24000 1.00065
mu_unlinked[2] 50.864500 0.167089 9.347230 36.961500 50.161500 67.148200 3129.4700 8890.55000 1.00113
mu_unlinked[3] 38.934400 0.135521 6.603890 29.687800 38.303300 50.211100 2374.5700 6745.93000 1.00172
mu_unlinked[4] 33.495100 0.134257 5.800810 25.250000 32.875800 43.609700 1866.8300 5303.50000 1.00191
mu_unlinked[5] 28.056300 0.133558 5.192950 20.766700 27.453900 37.045600 1511.7900 4294.85000 1.00196
mu_unlinked[6] 25.187500 0.132405 4.919700 18.397700 24.564600 33.805600 1380.6000 3922.16000 1.00191
mu_unlinked[7] 22.038500 0.129934 4.634350 15.784500 21.424800 30.159800 1272.1300 3614.00000 1.00182
mu_unlinked[8] 20.254300 0.127863 4.472150 14.411100 19.620900 28.086300 1223.3300 3475.36000 1.00174
mu_unlinked[9] 19.062300 0.126239 4.361420 13.473300 18.406200 26.647000 1193.6300 3391.01000 1.00168
[10]:
glm_gamma2.fitted_samples_.summary()
[10]:
Mean MCSE StdDev 5% 50% 95% N_Eff N_Eff/s R_hat
name
lp__ -33.593000 0.052207 1.441860 -36.442000 -33.230700 -32.049700 762.7800 1745.49000 1.00250
alpha -0.019889 0.000554 0.016820 -0.045271 -0.020698 0.009217 921.2220 2108.06000 1.00569
beta[1] 0.022986 0.000231 0.007030 0.011729 0.022965 0.034812 923.2150 2112.62000 1.00428
sigma 4.656330 0.059330 2.086980 1.825410 4.360940 8.579550 1237.3025 2831.35585 1.00036
mu[1] 0.017105 0.000219 0.007794 0.006798 0.015984 0.030622 1267.7900 2901.12000 1.00470
mu[2] 0.033038 0.000143 0.006629 0.023800 0.032344 0.044242 2137.4100 4891.11000 1.00120
mu[3] 0.042358 0.000163 0.007492 0.031572 0.041706 0.055453 2116.3100 4842.81000 1.00020
mu[4] 0.048970 0.000208 0.008639 0.036295 0.048330 0.064112 1726.4900 3950.79000 1.00032
mu[5] 0.058290 0.000287 0.010705 0.042183 0.057501 0.077193 1396.1200 3194.78000 1.00088
mu[6] 0.064903 0.000346 0.012362 0.045948 0.063959 0.086757 1273.8900 2915.09000 1.00129
mu[7] 0.074223 0.000434 0.014852 0.051542 0.073258 0.100046 1171.5800 2680.95000 1.00178
mu[8] 0.080836 0.000497 0.016688 0.055468 0.079888 0.109560 1126.7500 2578.38000 1.00206
mu[9] 0.085965 0.000547 0.018140 0.058355 0.085109 0.117108 1101.1200 2519.71000 1.00224
mu_unlinked[1] 72.768500 0.918299 43.936200 32.610500 62.596000 146.955000 2289.1700 5238.37000 1.00183
mu_unlinked[2] 31.432500 0.107813 6.150220 22.577300 30.921400 41.978700 3254.1600 7446.60000 1.00082
mu_unlinked[3] 24.341900 0.080036 4.341970 18.028400 23.977600 31.659300 2943.0500 6734.67000 1.00002
mu_unlinked[4] 21.060600 0.081197 3.791850 15.591400 20.692200 27.537500 2180.8700 4990.54000 1.00018
mu_unlinked[5] 17.746900 0.082769 3.371840 12.942500 17.391700 23.700700 1659.5800 3797.66000 1.00089
mu_unlinked[6] 15.985100 0.083394 3.185690 11.515800 15.636900 21.760300 1459.2800 3339.32000 1.00146
mu_unlinked[7] 14.039800 0.083320 2.996320 9.994920 13.651800 19.383800 1293.2400 2959.36000 1.00220
mu_unlinked[8] 12.932100 0.082892 2.891750 9.127060 12.518900 18.023100 1217.0200 2784.94000 1.00266
mu_unlinked[9] 12.189800 0.082430 2.821760 8.538730 11.752700 17.130600 1171.8400 2681.55000 1.00296

Additional information about the model and various visualizations can be revealed by Arviz, which seamlessly integrates with CmdStanPy components. Consider the following.

[11]:
import arviz as az
az.style.use("arviz-darkgrid")
[12]:
infdata = az.from_cmdstanpy(glm_gamma1.fitted_samples_)
infdata
[12]:
arviz.InferenceData
    • <xarray.Dataset>
      Dimensions:            (chain: 4, draw: 1000, beta_dim_0: 1, mu_dim_0: 9,
                              mu_unlinked_dim_0: 9)
      Coordinates:
        * chain              (chain) int64 0 1 2 3
        * draw               (draw) int64 0 1 2 3 4 5 6 ... 994 995 996 997 998 999
        * beta_dim_0         (beta_dim_0) int64 0
        * mu_dim_0           (mu_dim_0) int64 0 1 2 3 4 5 6 7 8
        * mu_unlinked_dim_0  (mu_unlinked_dim_0) int64 0 1 2 3 4 5 6 7 8
      Data variables:
          alpha              (chain, draw) float64 -0.01532 -0.006448 ... -0.009792
          beta               (chain, draw, beta_dim_0) float64 0.01482 ... 0.01252
          sigma              (chain, draw) float64 5.693 4.149 6.202 ... 3.381 8.383
          mu                 (chain, draw, mu_dim_0) float64 0.008525 ... 0.04785
          mu_unlinked        (chain, draw, mu_unlinked_dim_0) float64 117.3 ... 20.9
      Attributes:
          created_at:                 2022-08-05T20:26:14.832783
          arviz_version:              0.12.1
          inference_library:          cmdstanpy
          inference_library_version:  1.0.4

    • <xarray.Dataset>
      Dimensions:          (chain: 4, draw: 1000)
      Coordinates:
        * chain            (chain) int64 0 1 2 3
        * draw             (draw) int64 0 1 2 3 4 5 6 ... 993 994 995 996 997 998 999
      Data variables:
          lp               (chain, draw) float64 -36.11 -36.88 ... -36.73 -37.15
          acceptance_rate  (chain, draw) float64 0.9991 0.9936 ... 0.9803 0.9989
          step_size        (chain, draw) float64 0.1811 0.1811 ... 0.2019 0.2019
          tree_depth       (chain, draw) int64 3 3 4 2 3 3 4 3 3 ... 3 4 4 4 3 2 3 3 4
          n_steps          (chain, draw) int64 15 11 15 5 7 7 15 ... 15 15 7 15 15 15
          diverging        (chain, draw) bool False False False ... False False False
          energy           (chain, draw) float64 36.58 37.17 38.51 ... 37.23 37.33
      Attributes:
          created_at:                 2022-08-05T20:26:14.837437
          arviz_version:              0.12.1
          inference_library:          cmdstanpy
          inference_library_version:  1.0.4

[13]:
az.plot_posterior(infdata, var_names=['alpha', 'beta', 'sigma']);
../_images/examples_Gamma_Bloodclotting_18_0.svg
[14]:
az.plot_trace(infdata, var_names=['alpha', 'beta', 'sigma'], compact=True);
../_images/examples_Gamma_Bloodclotting_19_0.svg