Introduction

This document explore the longitudinal progression of emphysema subtypes (sLTPs/QES) over time - from either baseline or any other QES form to after 5 or 6 years later and so on. The dataset is comprised of CT scans from COPD patients.

Different patients have different proportions on emphysema, so that is why also different pixels (corresponding to the number of observations) were recorded - What I call multiple measures. In other words, different patients had different number of QES pixels labeled for their lungs.

Data Description:

copd_df = fread("Data/spatial-temporal_mesa_copd.csv", data.table = F) %>%
  janitor::clean_names() %>%
  dplyr::select(-v8,-dice) %>%
  rename(sLTP_baseline = s_ltp_copd1_ne_1,
         sLTP_follow_up = s_ltp_copd2_ne_1) %>%
  mutate(
    sLTP_baseline = factor(sLTP_baseline, levels = c("-1","1","2","3","4",
                                                     "5","6","7","8","9","10")),
    sLTP_follow_up = factor(sLTP_follow_up, levels = c("-1","1","2","3","4",
                                                        "5","6","7","8","9","10")),
    x = as.numeric(unlist(x)),
    y = as.numeric(unlist(y)),
    z = as.numeric(unlist(z)),
    qes_baseline = recode(sLTP_baseline, "-1" = "NE", "4" = "Diffuse", "6" = "Diffuse", "7" = "Senile",
                          "3" = "Apical", "5" = "Apical", "9" = "Apical",
                          "8" = "oCPFE", "10" = "rCPFE",
                          "1" = "Vanishing", "2" = "Vanishing"),
    qes_follow_up = recode(sLTP_follow_up, "-1" = "NE", "4" = "Diffuse", "6" = "Diffuse", "7" = "Senile",
                          "3" = "Apical", "5" = "Apical", "9" = "Apical",
                          "8" = "oCPFE", "10" = "rCPFE",
                          "1" = "Vanishing", "2" = "Vanishing")
  )
## Warning in fread("Data/spatial-temporal_mesa_copd.csv", data.table = F):
## Detected 8 column names but the data has 7 columns. Filling rows automatically.
## Set fill=TRUE explicitly to avoid this warning.

Exploratory Data Analysis

No Emphysema (NE) is Included

The IDNO variable

Number of patients (including those with NE):

I will count the number of patients by counting distinct IDs.

copd_df %>% 
  summarise(n = n_distinct(idno))
##     n
## 1 126

There are 126 patients with differing CT scans measurements.

Next, we explore patients who were eliminated due to the elimination of unwanted multiple lung locations in the dataset. In brief, transitions from QES to NE constituted the entirety of the multiple measurements we did not want to include in the research study. Therefore, this research only included valid transitions, meaning NE to NE, NE to QES, and QES to QES.

# the dataset we want
copd_wewant = copd_df %>%
  filter(!(qes_baseline != "NE" & qes_follow_up == "NE")) %>% 
  mutate(idno = as_factor(idno))

Basic Statistics

# Descriptive stats
tab1_controls = tableby.control(
               total = T,
               test = T,
               numeric.stats = c("meansd", "medianq1q3", "range"),
               cat.stats = "countpct",
               stats.labels = list(
               meansd = "Mean (SD)",
               medianq1q3 = "Median (Q1, Q3)",
               range = "Min - Max",
               countpct = "N (%)"))
tab1 = tableby(~ x + y + z + qes_baseline + qes_follow_up,
                data = copd_wewant, control = tab1_controls)
summary(tab1, title = 'Descriptive Statistics', digits = 2, text = T)
Descriptive Statistics
Overall (N=2808544)
x
- Mean (SD) 211.70 (114.25)
- Median (Q1, Q3) 237.00 (102.00, 311.00)
- Min - Max 21.00 - 462.00
y
- Mean (SD) 132.15 (65.32)
- Median (Q1, Q3) 124.00 (79.00, 180.00)
- Min - Max 18.00 - 342.00
z
- Mean (SD) 256.74 (110.92)
- Median (Q1, Q3) 255.00 (169.00, 343.00)
- Min - Max 18.00 - 607.00
qes_baseline
- NE 2355152 (83.9%)
- Vanishing 9203 (0.3%)
- Apical 80523 (2.9%)
- Diffuse 177548 (6.3%)
- Senile 109652 (3.9%)
- oCPFE 41034 (1.5%)
- rCPFE 35432 (1.3%)
qes_follow_up
- NE 2146336 (76.4%)
- Vanishing 14798 (0.5%)
- Apical 136795 (4.9%)
- Diffuse 218556 (7.8%)
- Senile 163094 (5.8%)
- oCPFE 71878 (2.6%)
- rCPFE 57087 (2.0%)
  • From this Table 1. we can identify descriptive statistics of the 3D coordinates (x,y,z) and also the rarest of QES at both baseline and follow-up.

  • At baseline, two of the rarest QES are Vanishing Lung and Retrictive CPFE which only appears 1.4% and 5.4% of the times respectively.

  • At follow-up, two of the rarest QES are also Vanishing Lung and Restrictive CPFE which appears 2.2% and 8.6% of the times respectively.

  • Another important observation was that the order with respect to prevalence of QES did not change from baseline to follow up where the diffuse QES occured the most followed by Senile then Apical.

# changing the initial dataset (Excluding NE on both sides)
copd_wewant_no = copd_wewant %>% 
  filter(qes_baseline != "NE", qes_follow_up != "NE") %>% 
  mutate(
    qes_baseline = factor(qes_baseline, levels = c("Vanishing","Apical","Diffuse",
                                                   "Senile","oCPFE","rCPFE")),
    qes_follow_up = factor(qes_follow_up, levels = c("Vanishing","Apical","Diffuse",
                                                     "Senile","oCPFE","rCPFE")))
# Descriptive stats
tab1_controls = tableby.control(
               total = T,
               test = T,
               numeric.stats = c("meansd", "medianq1q3", "range"),
               cat.stats = "countpct",
               stats.labels = list(
               meansd = "Mean (SD)",
               medianq1q3 = "Median (Q1, Q3)",
               range = "Min - Max",
               countpct = "N (%)"))
tab1 = tableby(~ qes_baseline + qes_follow_up,
                data = copd_wewant_no, control = tab1_controls)
summary(tab1, title = 'Descriptive Statistics - Excluding No Emphysema', digits = 2, text = T)
Descriptive Statistics - Excluding No Emphysema
Overall (N=453392)
qes_baseline
- Vanishing 9203 (2.0%)
- Apical 80523 (17.8%)
- Diffuse 177548 (39.2%)
- Senile 109652 (24.2%)
- oCPFE 41034 (9.1%)
- rCPFE 35432 (7.8%)
qes_follow_up
- Vanishing 14768 (3.3%)
- Apical 122071 (26.9%)
- Diffuse 166596 (36.7%)
- Senile 88983 (19.6%)
- oCPFE 29380 (6.5%)
- rCPFE 31594 (7.0%)

The table above represents percent frequencies of transitions from QES to QES, excluding No Emphysema at both baseline and follow up.

Number of measured lung locations per patient (Excluding NE):

I will use the group_by function to count how many times a particular ID appears, meaning how many times a single patient’s CT scan was taken.

Table 3: Measured Locations per Patient

copd_wewant %>% 
  group_by(idno) %>% 
  summarise(multiple_locations = n()) %>%
  mutate_at("multiple_locations", funs(prettyNum(., big.mark = ","))) %>%
  kable()
## Warning: `funs()` was deprecated in dplyr 0.8.0.
## ℹ Please use a list of either functions or
##   lambdas:
## 
## # Simple named list: list(mean = mean, median
##   = median)
## 
## # Auto named with `tibble::lst()`:
##   tibble::lst(mean, median)
## 
## # Using lambdas list(~ mean(., trim = .2), ~
##   median(., na.rm = TRUE))
idno multiple_locations
4010100 26,384
4010700 14,162
4010800 12,992
4011000 32,308
4011200 8,698
4011700 13,748
4012000 18,249
4012200 15,750
4012500 10,657
4013200 12,978
4014400 13,249
4014800 10,955
4014900 25,000
4015000 12,475
4015300 16,846
4015500 12,892
4015900 11,549
4016100 15,065
4016800 21,555
4017200 17,049
4017500 41,971
4017600 36,159
4017700 12,533
4017900 12,388
4018000 16,847
4018700 12,996
4019100 20,945
4019300 16,943
4019400 12,801
4021300 12,602
4021900 10,520
4022100 29,040
4022200 12,637
4022300 6,591
4022500 13,511
4022600 11,903
4022900 12,419
4023000 28,724
4400000 8,449
4400100 77,850
4400200 83,262
4400300 102,419
4400400 47,588
4400500 97,408
4400600 77,744
4400700 61,381
4400800 76,524
4400900 83,427
4401000 45,258
4401100 51,444
4401300 43,457
4401400 31,919
4401500 44,606
4401600 46,492
4401700 37,495
5010100 13,073
5010200 13,779
5010400 12,940
5010800 13,491
5011100 14,085
5011200 14,926
5011300 11,552
5011500 12,574
5011900 14,286
5012000 7,884
5012200 14,615
5012300 17,550
5012600 18,573
5012700 17,692
5013100 14,767
5013600 12,696
5013700 18,741
5014200 18,463
5015800 13,900
5016700 14,902
5017600 30,305
5018400 13,320
5018700 12,750
5018900 16,204
5021800 13,472
5021900 33,230
5022000 9,947
5022600 10,678
5022800 13,961
7011600 13,529
7012200 19,972
7013600 12,690
7014000 17,095
7014900 11,052
7015400 34,355
7015800 12,550
7015900 19,024
7016200 16,727
7016400 9,342
7017800 21,409
7018400 10,093
7018500 20,197
7018600 20,917
7019200 15,841
7019300 16,760
7019400 9,734
7019500 20,318
7019800 17,939
7042100 15,697
7043300 13,305
7043900 13,670
7044000 20,965
8011100 18,913
8011200 20,239
8012300 18,158
8012800 12,301
8013600 18,177
8013900 15,181
8014200 17,562
8014900 22,011
8015200 12,550
8016300 13,535
8017000 13,387
8017300 20,157
8018400 19,116
8018500 13,344
8021400 12,763
8021900 17,850
8022500 14,635
8023000 20,363
8023100 19,951
copd_wewant %>% 
  ggplot(aes(idno, fill = qes_baseline)) +
  scale_y_log10() +
   geom_bar() +
  labs(
    x = "Number of Patients",
    caption = "Figure 1. Number of Pixels per Patient"
  ) +
  theme(axis.title.x = element_blank(),
        axis.text.x = element_blank(),
        axis.ticks.x = element_blank())

The bar chart above represents the number of pixels measured per patient labeled by the proportion of QES for each patient. We can see that in most of the patients the NE is dominant.

Note: The x-axis was cleared because patients’ IDs made the plot impossible to read

Ranking QES according to occurrence at baseline and Follow up

rank_baseline = copd_wewant %>% 
  group_by(qes_baseline) %>% 
  summarise(bas_num = n()) %>% 
  arrange(desc(bas_num))

# # ranking QES at baseline
# rank_baseline = sqldf("SELECT qes_baseline, COUNT(qes_baseline) AS bas_num
#                       FROM copd_wewant
#                       GROUP BY qes_baseline
#                       ORDER BY bas_num DESC")

p1 = rank_baseline %>% 
  rename(`QES at Baseline` = "qes_baseline") %>%
  mutate(
    `QES at Baseline` = fct_inorder(`QES at Baseline`)) %>% 
  ggplot(aes(`QES at Baseline`, bas_num, fill = `QES at Baseline`)) + 
  geom_bar(stat = "identity") + 
  scale_y_log10(breaks = c(0,10,100,1000,10000, 100000)) +
  labs(
    y = "Baseline Count",
    caption = "Figure 2. Frequencies of NE and QES at Baseline"
  ) +
  theme_bw() +
  theme(legend.position = "none")

# ranking QES at follow up
rank_follow_up = copd_wewant %>% 
  group_by(qes_follow_up) %>% 
  summarise(follow_num = n()) %>% 
  arrange(desc(follow_num))

# rank_follow_up = sqldf("SELECT qes_follow_up, COUNT(qes_follow_up) AS follow_num
#                        FROM copd_wewant
#                        GROUP BY qes_follow_up
#                        ORDER BY follow_num DESC")

p2 = rank_follow_up %>%
  rename(`QES at follow up` = "qes_follow_up") %>%
  mutate(
    `QES at follow up` = fct_inorder(`QES at follow up`)) %>% 
  ggplot(aes(`QES at follow up`, follow_num, fill = `QES at follow up`)) +
  geom_bar(stat = "identity") + 
  scale_y_log10(breaks = c(0,10,100,1000,10000, 100000)) +
  labs(
    y = "Follow up Count",
    caption = "Figure 3. Frequencies of NE and QES at Follow Up"
  ) +
  theme_bw() +
  theme(legend.position = "none")

p1 / p2

Both bar charts above depicts QES frequencies at both baseline and follow up. Excluding the No emphysema (NE), the Diffuse QES appears more often while Vanishing Lung QES appears the least.

Sankey Diagram

A Sankey diagram is a type of flow diagram in which the width of the arrows is proportional to the flow rate. Therefore, this plot illustrates a sankey diagram that represents the transition from baseline QES flowing into QES at follow up. Notice that the transition from No Emphysema to No emphysema or from QES to No Emphysema are not considered as they do not clinically make sense to be looked at.

# importing the dataset
sankey_df = read.csv("Data/data_sankey.csv")
sankey_df = sankey_df %>% 
  dplyr::select(-X) %>% 
  mutate(
    source = as.character(source),
    target = as.character(target),
    value = as.numeric(value)
  )
sankey_df$target = paste(sankey_df$target, " ", sep = "")

# From these flows we need to create a node data frame: it lists every entities involved in the flow
nodes = data.frame(name = c(as.character(sankey_df$source), as.character(sankey_df$target)) %>% unique())
 
# With networkD3, connection must be provided using id, not using real name like in the links dataframe.. So we need to reformat it.
sankey_df$IDsource = match(sankey_df$source, nodes$name) - 1 
sankey_df$IDtarget = match(sankey_df$target, nodes$name) - 1

# prepare colour scale
ColourScal = 'd3.scaleOrdinal() .range(["#FDE725FF","#B4DE2CFF","#6DCD59FF","#35B779FF","#1F9E89FF","#26828EFF","#31688EFF"])'

# Make the Network
sankeyNetwork(Links = sankey_df, Nodes = nodes,
                     Source = "IDsource", Target = "IDtarget",
                     Value = "value", NodeID = "name", iterations = 32, 
              sinksRight = FALSE, colourScale = ColourScal, 
              nodeWidth = 40, fontSize = 12, nodePadding = 20, LinkGroup = nodes)

Plotting the 3D coordinates

Below, I explore a 3D plot of randomly selected ID numbers using the x,y,z pixel coordinates in the dataset. Please note that the colors merely represent different locations of the lung measured for this particular patient.

p1 = copd_wewant[1:4] %>% 
  filter(idno == "4010100") %>% 
  scatterplot3d(main = "Figure 5. 3D Plot of a Randomly Selected Patient")

The 3D plots explores the patient’s (2 randomly selected patients) emphysema severity. The CT scan shows these patients have a wide ranging emphysema subtype all over their lungs.

Hypothesis Testing: Contigency Tables for QES - Test of Independence

This test will be used for testing independence or association between QES at baseline and QES at follow up. The question of interest is whether knowledge of one variable’s value provides any information about the value of the other variable, meaning are QES at baseline and follow up independent?

\[H_0=Baseline\:and\:follow\:up\:QES\:are\:independent.\]

\[Vs\]

\[H_1=Baseline\:and\:follow\:up\:QES\:are\:dependent\:or\:associated.\]

we will then compute the test statistic with the formula below:

\[\chi^2=\sum_{i=1}^R\sum_{j=1}^C(\frac{(O-E)^2}{E})\sim\chi^2_{(R-1)\times(C-1)}\:where\:df=(R-1)\times(C-1)\]

\[Reject\:H_0:\:if\:\chi^2>\chi^2_{(R-1)\times(C-1),1-\alpha}\]

\[Fail\:to\:Reject\:H_0:\:if\:\chi^2\leq\chi^2_{(R-1)\times(C-1),1-\alpha}\]

Observed Frequencies:

obs_df = table(copd_wewant$qes_baseline,copd_wewant$qes_follow_up)
obs_df %>% kable()
NE Vanishing Apical Diffuse Senile oCPFE rCPFE
NE 2146336 30 14724 51960 74111 42498 25493
Vanishing 0 7549 1545 57 15 13 24
Apical 0 6369 61096 6169 2319 943 3627
Diffuse 0 760 30522 121251 21694 1955 1366
Senile 0 52 14499 34015 48456 8377 4253
oCPFE 0 19 4945 4113 13346 14098 4513
rCPFE 0 19 9464 991 3153 3994 17811
test = chisq.test(table(copd_wewant$qes_baseline,copd_wewant$qes_follow_up))
test

Expected Frequencies:

exp_df = test$expected
exp_df %>% kable()
NE Vanishing Apical Diffuse Senile oCPFE rCPFE
NE 1799846.299 12409.11280 114711.7574 183273.8246 136765.2279 60274.5107 47871.2679
Vanishing 7033.086 48.48989 448.2481 716.1614 534.4243 235.5289 187.0619
Apical 61537.015 424.26943 3922.0122 6266.1595 4676.0236 2060.7946 1636.7258
Diffuse 135685.132 935.48661 8647.7829 13816.4760 10310.3293 4543.9185 3608.8744
Senile 83797.881 577.74786 5340.7906 8532.9276 6367.5639 2806.2820 2228.8074
oCPFE 31358.865 216.20496 1998.6320 3193.1944 2382.8714 1050.1676 834.0649
rCPFE 27077.723 186.68845 1725.7769 2757.2565 2057.5596 906.7977 720.1976

Chi-squared Test results:

test = chisq.test(table(copd_wewant$qes_baseline,copd_wewant$qes_follow_up))
test
## 
##  Pearson's Chi-squared test
## 
## data:  table(copd_wewant$qes_baseline, copd_wewant$qes_follow_up)
## X-squared = 4712291, df = 36, p-value < 2.2e-16

Therefore with 36 degrees of freedom (formula provided above), the Chi-squared was calculated to be 4.7122909^{6} with a P-value of 0 which is far lower than the significance level of 0.05.

Conclusion:

\[\:Since\:\chi^2>\chi^2_{36,0.95}=51\]

We reject the null hypothesis at 0.05 significance level and conclude that there is enough evidence that baseline QES is significantly associated with QES at follow up.

Note: Even though a permutation test can only be conducted if anyone of the Expected cell counts is less than or equal to 5, I will still conduct a Fisher’s Exact test which is a permutation test and check whether the Chi-squared test above approximate the actual sampling distribution.

Fisher’s Exact Test - A Permutation Test

The Fisher’s Exact test is a form of Permutation test considered as a sanity check to see if the association between baseline and follow up is from pure random chance. We do this by calculating all possible values of the test statistic under all possible rearrangements of the observed data points. This test is similar to the Chi-square test in terms of hypothesis and interpretation of the results.

Model Building - Association of Baseline and Follow up QES

The dataset below was used in model building. It was constructed from counting the frequencies of each QES transitions from baseline to follow up. By default, the Apical is considered as the reference QES category, thus all model interpretations are based on it.

Table 4: Frequencies of Transitions Between QES

mult_qes_right = tibble(qes_baseline = c("NE", "Diffuse", "Apical", "Senile", 
                                    "rCPFE", "oCPFE", "Vanishing"),
                   Diffuse = c(51960,121251,6169,34015,991,4113,57),
                   NE = c(2146336,0,0,0,0,0,0),
                   Senile = c(74111,21694,2319,48456,3153,13346,15),
                   Apical = c(14724,30522,61096,14499,9464,4945,1545),
                   rCPFE = c(25493,1366,3627,4253,17811,4513,24),
                   oCPFE = c(42498,1955,943,8377,3994,14098,13),
                   Vanishing = c(30,760,6369,52,19,19,7549)) %>%
  mutate(qes_baseline = factor(qes_baseline, levels = c("Diffuse","NE","Senile","Apical",
                                                        "rCPFE","oCPFE","Vanishing")))

mult_qes_right %>%
  mutate_at(c("Apical","Diffuse","Senile","rCPFE","oCPFE","Vanishing","NE"), funs(prettyNum(., big.mark = ","))) %>%
  kable()
## Warning: `funs()` was deprecated in dplyr 0.8.0.
## ℹ Please use a list of either functions or
##   lambdas:
## 
## # Simple named list: list(mean = mean, median
##   = median)
## 
## # Auto named with `tibble::lst()`:
##   tibble::lst(mean, median)
## 
## # Using lambdas list(~ mean(., trim = .2), ~
##   median(., na.rm = TRUE))
qes_baseline Diffuse NE Senile Apical rCPFE oCPFE Vanishing
NE 51,960 2,146,336 74,111 14,724 25,493 42,498 30
Diffuse 121,251 0 21,694 30,522 1,366 1,955 760
Apical 6,169 0 2,319 61,096 3,627 943 6,369
Senile 34,015 0 48,456 14,499 4,253 8,377 52
rCPFE 991 0 3,153 9,464 17,811 3,994 19
oCPFE 4,113 0 13,346 4,945 4,513 14,098 19
Vanishing 57 0 15 1,545 24 13 7,549
# grouping by qes at baseline and counting the frequencies
# copd_wewant %>%
  # filter(qes_baseline == "Vanishing" & qes_follow_up == "Vanishing")

# creating the dataframe
mult_qes = tibble(qes_baseline = c("NE", "Diffuse", "Apical", "Senile", 
                                    "rCPFE", "oCPFE", "Vanishing"),
                   Diffuse = c(51960,121251,6169,34015,991,4113,57),
                   NE = c(2146336,0,0,0,0,0,0),
                   Senile = c(74111,21694,2319,48456,3153,13346,15),
                   Apical = c(14724,30522,61096,14499,9464,4945,1545),
                   rCPFE = c(25493,1366,3627,4253,17811,4513,24),
                   oCPFE = c(42498,1955,943,8377,3994,14098,13),
                   Vanishing = c(30,760,6369,52,19,19,7549)) %>%
  mutate(qes_baseline = factor(qes_baseline, levels = c("Diffuse","NE","Senile","Apical",
                                                        "rCPFE","oCPFE","Vanishing")))

Multinomial Log-linear Model for QES

Assuming observations are independent, meaning correlation between individual measures per patient is irrelevant in the study, we can model QES at follow up (response variable) using the Nominal regression model (from the generalized linear model) that looks at nominal responses (where no order is assigned), which is what our response is.

  • The code below also builds the nominal regression model for the association between QES at baseline and QES at follow up. Reference categories for the model will be Diffuse for both the QES at baseline variables and QES at follow up.

Note that any of the categories can be used as the reference category. Regardless of the chosen reference category, the model will fit equally well and achieve the same likelihood.

This model is flawed in that it does not treat this dataset as longitidinal data. It considers each observation as independent, therefore does not account for within-subject correlations that might be present due to multiple measures taken. It merely studies QES transitions from baseline to follow up.

Table 5: Model estimates and (Standard Errors) for QES are shown in the table below

# confidence intervals for each QES at follow up
ci_qes_nom = confint(qes_nom)

ci_qes_NE = ci_qes_nom[,,"qes_NE"] %>% 
  as.tibble() %>% 
  rename(ci_lb = "2.5 %",
         ci_ub = "97.5 %") %>% 
  mutate(Response = "qes_NE",
         term = c("ci_intercept", "ci_NE", "ci_senile", 
                  "ci_apical", "ci_rcpfe", "ci_ocpfe", 
                  "ci_vanishing"))
## Warning: `as.tibble()` was deprecated in tibble 2.0.0.
## ℹ Please use `as_tibble()` instead.
## ℹ The signature and semantics have changed,
##   see `?as_tibble`.
ci_qes_senile = ci_qes_nom[,,"qes_senile"] %>% 
  as.tibble() %>% 
  rename(ci_lb = "2.5 %",
         ci_ub = "97.5 %") %>%
  mutate(Response = "qes_senile",
         term = c("ci_intercept", "ci_NE", "ci_senile", 
                  "ci_apical", "ci_rcpfe", "ci_ocpfe", 
                  "ci_vanishing"))

ci_qes_apical = ci_qes_nom[,,"qes_apical"] %>% 
  as.tibble() %>% 
  rename(ci_lb = "2.5 %",
         ci_ub = "97.5 %") %>% 
  mutate(Response = "qes_apical",
         term = c("ci_intercept", "ci_NE", "ci_senile", 
                  "ci_apical", "ci_rcpfe", "ci_ocpfe", 
                  "ci_vanishing"))

ci_qes_rCPFE = ci_qes_nom[,,"qes_rcpfe"] %>% 
  as.tibble() %>% 
  rename(ci_lb = "2.5 %",
         ci_ub = "97.5 %") %>% 
  mutate(Response = "qes_rcpfe",
         term = c("ci_intercept", "ci_NE", "ci_senile", 
                  "ci_apical", "ci_rcpfe", "ci_ocpfe", 
                  "ci_vanishing"))

ci_qes_oCPFE = ci_qes_nom[,,"qes_ocpfe"] %>% 
  as.tibble() %>% 
  rename(ci_lb = "2.5 %",
         ci_ub = "97.5 %") %>% 
  mutate(Response = "qes_ocpfe",
         term = c("ci_intercept", "ci_NE", "ci_senile", 
                  "ci_apical", "ci_rcpfe", "ci_ocpfe", 
                  "ci_vanishing"))

ci_qes_vanishing = ci_qes_nom[,,"qes_vanishing"] %>% 
  as.tibble() %>% 
  rename(ci_lb = "2.5 %",
         ci_ub = "97.5 %") %>% 
  mutate(Response = "qes_vanishing",
         term = c("ci_intercept", "ci_NE", "ci_senile", 
                  "ci_apical", "ci_rcpfe", "ci_ocpfe", 
                  "ci_vanishing"))


# Dealing with standard errors

qes_sum$standard.errors %>% 
  as.tibble() %>% 
  mutate(Response = c("qes_NE","qes_senile", "qes_apical", "qes_rcpfe", 
                   "qes_ocpfe", "qes_vanishing")) %>%
  rename("se_intercept" = "(Intercept)",
         "se_NE" = qes_baselineNE,
         "se_senile" = qes_baselineSenile,
         "se_apical" = qes_baselineApical,
         "se_rcpfe" = qes_baselinerCPFE,
         "se_ocpfe" = qes_baselineoCPFE,
         "se_vanishing" = qes_baselineVanishing) %>% 
  cbind(., coefficients(qes_nom)) %>%
  rename("intercept" = "(Intercept)",
         "bas_NE" = qes_baselineNE,
         "bas_senile" = qes_baselineSenile,
         "bas_apical" = qes_baselineApical,
         "bas_rcpfe" = qes_baselinerCPFE,
         "bas_ocpfe" = qes_baselineoCPFE,
         "bas_vanishing" = qes_baselineVanishing) %>%
  mutate(intercept = paste0(round(intercept,2)," (", 
                            round(se_intercept,2), ")"),
         bas_NE = paste0(round(bas_NE,2)," (", 
                              round(se_NE,2),")"),
         bas_senile = paste0(round(bas_senile,2)," (", 
                         round(se_senile,2),")"),
         bas_apical = paste0(round(bas_apical,2)," (", 
                            round(se_apical,2),")"),
         bas_rcpfe = paste0(round(bas_rcpfe,2)," (", 
                            round(se_rcpfe,2),")"),
         bas_ocpfe = paste0(round(bas_ocpfe,2)," (", 
                             round(se_ocpfe,2),")"),
         bas_vanishing = paste0(round(bas_vanishing,2)," (", 
                                round(se_vanishing,2),")")) %>% 
  dplyr::select(Response, intercept, bas_NE, bas_senile, bas_apical, bas_rcpfe,
         bas_ocpfe, bas_vanishing) %>% 
  kable()
Response intercept bas_NE bas_senile bas_apical bas_rcpfe bas_ocpfe bas_vanishing
qes_NE qes_NE -8.91 (0.25) 12.63 (0.25) -0.05 (0.54) -4.54 (10.64) 0.2 (2.45) -2.61 (4.99) -4.26 (0)
qes_senile qes_senile -1.72 (0.01) 2.07 (0.01) 2.08 (0.01) 0.72 (0.03) 2.84 (0.04) 2.91 (0.02) 0.54 (0.28)
qes_apical qes_apical -1.38 (0.01) 0.12 (0.01) 0.53 (0.01) 3.67 (0.01) 3.61 (0.03) 1.59 (0.02) 4.71 (0.14)
qes_rcpfe qes_rcpfe -4.49 (0.03) 3.77 (0.03) 2.4 (0.03) 3.97 (0.03) 7.34 (0.04) 4.59 (0.03) 3.57 (0.26)
qes_ocpfe qes_ocpfe -4.13 (0.02) 3.93 (0.02) 2.72 (0.03) 2.3 (0.04) 5.5 (0.04) 5.36 (0.03) 2.5 (0.34)
qes_vanishing qes_vanishing -4.95 (0.03) -27.4 (0) -1.39 (0.13) 4.98 (0.04) -1.04 (0.63) -3.21 (0.93) 9.92 (0.14)
# Goodness of fit 
pihat = predict(qes_nom, type = "probs")
m = rowSums(mult_qes[,2:8])
res_pearson = (mult_qes[, 2:8] - pihat * m) / sqrt(pihat * m) # pearson residual

# Goodness of fit
G_stat = sum(res_pearson^2) # Generalized pearson Chi-Squared
pval_G = 1 - pchisq(G_stat, qes_nom$edf)

QES Model Interpretations:

We have six indicator/dummy variables all with Diffuse as their reference category:

X1: No Emphysema at baseline

X2: Senile QES at baseline

X3: Apical QES at baseline

X4: rCPFE QES at baseline

X5: oCPFE QES at baseline

X6: Vanishing Lung QES at baseline

Comparing the NE Vs Diffuse QES at follow up:

  • The odds ratio of NE is 3.0660526^{5} (95% CI: [1.8869834^{5}, 4.9818554^{5}]) in patients with No Emphysema at baseline.

  • The odds ratio of NE is 0.95 (95% CI: [0.33, 2.73]) in patients with Senile at baseline.

  • The odds ratio of NE is 0.01 (95% CI: [0, 1.2060164^{7}]) in patients with Apical at baseline.

  • The odds ratio of NE is 1.23 (95% CI: [0.01, 148.25]) in patients with rCPFE at baseline.

  • The odds ratio of NE is 0.07 (95% CI: [0, 1283.95]) in patients with oCPFE at baseline.

  • The odds ratio of NE is 0.01 (95% CI: [0.01, 0.01]) in patients with Vanishing Lungs at baseline.

Comparing the Senile Vs Diffuse QES at follow up:

  • The odds ratio of senile is 7.96 (95% CI: [7.82, 8.11]) in patients with No Emphysema at baseline.

  • The odds ratio of senile is 7.96 (95% CI: [7.81, 8.13]) in patients with Senile at baseline.

  • The odds ratio of senile is 2.05 (95% CI: [1.95, 2.16]) in patients with Apical at baseline.

  • The odds ratio of senile is 17.09 (95% CI: [15.9, 18.37]) in patients with rCPFE at baseline.

  • The odds ratio of senile is 18.3 (95% CI: [17.62, 19.01]) in patients with oCPFE at baseline.

  • The odds ratio of senile is 1.71 (95% CI: [0.98, 2.98]) in patients with Vanishing Lungs at baseline.

Comparing the Apical Vs Diffuse QES at follow up:

  • The odds ratio of Apical is 1.13 (95% CI: [1.11, 1.16]) in patients with No Emphysema at baseline.

  • The odds ratio of Apical is 1.7 (95% CI: [1.66, 1.74]) in patients with Senile at baseline.

  • The odds ratio of Apical is 39.28 (95% CI: [38.15, 40.43]) in patients with Apical at baseline.

  • The odds ratio of Apical is 37.07 (95% CI: [34.71, 39.58]) in patients with rCPFE at baseline.

  • The odds ratio of Apical is 4.89 (95% CI: [4.68, 5.11]) in patients with oCPFE at baseline.

  • The odds ratio of Apical is 111.43 (95% CI: [84.73, 146.55]) in patients with Vanishing Lungs at baseline.

Comparing the rCPFE Vs Diffuse QES at follow up:

  • The odds ratio of rCPFE is 43.43 (95% CI: [41.09, 45.91]) in patients with No Emphysema at baseline.

  • The odds ratio of rCPFE is 11.06 (95% CI: [10.39, 11.77]) in patients with Senile at baseline.

  • The odds ratio of rCPFE is 52.94 (95% CI: [49.5, 56.62]) in patients with Apical at baseline.

  • The odds ratio of rCPFE is 1540.63 (95% CI: [1418.44, 1673.34]) in patients with rCPFE at baseline.

  • The odds ratio of rCPFE is 98.83 (95% CI: [92.32, 105.79]) in patients with oCPFE at baseline.

  • The odds ratio of rCPFE is 35.48 (95% CI: [21.38, 58.89]) in patients with Vanishing Lungs at baseline.

Comparing the oCPFE Vs Diffuse QES at follow up:

  • The odds ratio of oCPFE is 51.01 (95% CI: [48.69, 53.44]) in patients with No Emphysema at baseline.

  • The odds ratio of oCPFE is 15.16 (95% CI: [14.41, 15.95]) in patients with Senile at baseline.

  • The odds ratio of oCPFE is 9.94 (95% CI: [9.17, 10.78]) in patients with Apical at baseline.

  • The odds ratio of oCPFE is 244.03 (95% CI: [224.83, 264.88]) in patients with rCPFE at baseline.

  • The odds ratio of oCPFE is 213.27 (95% CI: [201.5, 225.72]) in patients with oCPFE at baseline.

  • The odds ratio of oCPFE is 12.21 (95% CI: [6.28, 23.76]) in patients with Vanishing Lungs at baseline.

Comparing the Vanishing Lung Vs Diffuse QES at follow up:

  • The odds ratio of Vanishing is 0 (95% CI: [0, 0]) in patients with No Emphysema at baseline.

  • The odds ratio of Vanishing is 0.25 (95% CI: [0.19, 0.32]) in patients with Senile at baseline.

  • The odds ratio of Vanishing is 144.91 (95% CI: [134.32, 156.33]) in patients with Apical at baseline.

  • The odds ratio of Vanishing is 0.35 (95% CI: [0.1, 1.21]) in patients with rCPFE at baseline.

  • The odds ratio of Vanishing is 0.04 (95% CI: [0.01, 0.25]) in patients with oCPFE at baseline.

  • The odds ratio of Vanishing is 2.028392^{4} (95% CI: [1.535959^{4}, 2.678699^{4}]) in patients with Vanishing Lungs at baseline.

Model Building - Studying Prediction of QES

Linear Discriminant Analysis (P = 1)

The Linear Discriminant Analysis (LDA) algorithm starts by finding directions that maximizes the separation between classes, then use these directions to predict the class of individuals. These directions, called linear discriminants, are a linear combinations of predictor variables.

LDA generates posterior probabilities that a certain observations belongs to a certain QES categories given the predictor value for that observation. LDA assumes that observations within each QES category come from a normal distribution with a category-specific mean vector and a common covariance matrix for all categories.

# splitting dataset
training = copd_wewant[1:2246835,]
testing = copd_wewant[2246836:2808544,]

# LDA using MASS
lda.fit = lda(qes_follow_up~qes_baseline, data = training)

# evaluate the test set performance using ROC
lda.pred = data.frame(predict(lda.fit, newdata = testing))
lda.pred = cbind(testing, lda.pred)

# lda model accuracy
lda.pred %>% 
  summarise(score = mean(class == qes_follow_up))
##       score
## 1 0.8973472

Table 8: Model Output of Count of correctly classified Vs misclassfied Responses

# visualization
lda.pred %>%
  count(class, qes_follow_up) %>%
  rename(Predicted = class, Observed = qes_follow_up, Count = n) %>% 
  dplyr::select(Observed, Predicted, Count) %>%
  group_by(Observed, Predicted) %>%
  arrange(-Count) %>%
  mutate_at(c("Count"), funs(prettyNum(., big.mark = ","))) %>%
  kable()
## Warning: `funs()` was deprecated in dplyr 0.8.0.
## ℹ Please use a list of either functions or
##   lambdas:
## 
## # Simple named list: list(mean = mean, median
##   = median)
## 
## # Auto named with `tibble::lst()`:
##   tibble::lst(mean, median)
## 
## # Using lambdas list(~ mean(., trim = .2), ~
##   median(., na.rm = TRUE))
Observed Predicted Count
NE NE 456,149
Diffuse Diffuse 26,683
Apical Apical 13,458
Diffuse NE 9,755
Apical Diffuse 8,808
Diffuse Senile 7,622
Senile NE 7,344
Apical NE 5,453
Senile Senile 5,431
Apical Senile 3,107
rCPFE NE 2,270
oCPFE NE 2,247
Senile Diffuse 2,144
Vanishing Apical 1,458
Diffuse Apical 1,201
oCPFE oCPFE 1,201
Apical oCPFE 1,163
oCPFE Senile 926
Senile oCPFE 844
rCPFE rCPFE 762
Apical rCPFE 757
rCPFE oCPFE 529
Vanishing Vanishing 364
oCPFE Diffuse 305
rCPFE Senile 287
Senile Apical 284
Diffuse oCPFE 218
rCPFE Apical 199
rCPFE Diffuse 194
Apical Vanishing 171
oCPFE Apical 109
Senile rCPFE 83
Vanishing Diffuse 82
oCPFE rCPFE 72
Diffuse rCPFE 11
Vanishing NE 6
Senile Vanishing 5
Diffuse Vanishing 4
rCPFE Vanishing 2
oCPFE Vanishing 1

Prediction of Percent Number of QES Pixels at Follow up

Table 9: DataFrame for number of pixels per subject

The DataFrame below was used to predict the percent number of Emphysema pixels at follow up as it relates to the individual QES and their count per subject at baseline. I have 7 features and percent_pixel as the continuous response variable. There are 118 patients in this dataset as a total of 8 patients (totaling 126 from the orifinal dataset) only experienced the transition from No Emphysema to No Emphysema, therefore, they were eliminated from this modeling section. The percent number of Emphysema pixels at follow up is defined as the number of pixels which resulted into actual QES divided by the total number of recorded pixels, which also includes No Emphysema pixels.80% of the dataset was used to train the model while 20% was used to evaluate model performance.

# total number of QES + NE
qes_ne = copd_wewant %>%
  dplyr::select(-c(x:sLTP_follow_up)) %>%
  group_by(idno) %>%
  summarise(total_pixel = n())
qes_ne = qes_ne %>% 
  filter(!(idno %in% c("4021300","5012200","5022600","7011600",
                    "7016400","7017800","7018400","8021400")))

# total number of QES
qes_df = copd_wewant %>%
  filter(qes_follow_up != "NE") %>% 
  dplyr::select(-c(x:sLTP_follow_up)) %>%
  group_by(idno) %>%
  summarise(qes_pixel = n())

# Number of unique QES per patient
uni_df = copd_wewant %>%
  dplyr::select(-c(x:sLTP_follow_up)) %>%
  group_by(idno, qes_baseline) %>%
  summarise(uni = n()) %>%
  pivot_wider(
    names_from = "qes_baseline",
    values_from = "uni") %>%
  mutate(
    Vanishing = replace_na(Vanishing,0),
    Apical = replace_na(Apical,0),
    Diffuse = replace_na(Diffuse,0),
    Senile = replace_na(Senile,0),
    oCPFE = replace_na(oCPFE,0),
    rCPFE = replace_na(rCPFE,0)
  )
## `summarise()` has grouped output by 'idno'.
## You can override using the `.groups` argument.
uni_df = uni_df %>%
  as.tibble() %>% 
  filter(!(idno %in% c("4021300","5012200","5022600","7011600",
                    "7016400","7017800","7018400","8021400")))

# combining both dataframes then normalize
volume_total = 
  left_join(uni_df, qes_ne, by = "idno") %>% 
  mutate(total_pixel = as.numeric(total_pixel)) %>%
  mutate(
    NE = round(NE/total_pixel, 2),
    Vanishing = round(Vanishing/total_pixel, 2),
    Apical = round(Apical/total_pixel, 2),
    Diffuse = round(Diffuse/total_pixel, 2),
    Senile = round(Senile/total_pixel, 2),
    oCPFE = round(oCPFE/total_pixel, 2),
    rCPFE = round(rCPFE/total_pixel, 2))

# combining with total qes column then calculating response variable
volume_df = 
  left_join(volume_total, qes_df, by = "idno") %>% 
  mutate(
    percent_pixel = round(qes_pixel/total_pixel, 2)
  )


volume_df2 = volume_df %>% head(10)
volume_df2 %>% kable()
idno NE Vanishing Apical Diffuse Senile oCPFE rCPFE total_pixel qes_pixel percent_pixel
4010100 0.64 0.02 0.15 0.04 0.07 0.01 0.08 26384 10313 0.39
4010700 1.00 0.00 0.00 0.00 0.00 0.00 0.00 14162 446 0.03
4010800 0.99 0.00 0.00 0.00 0.00 0.01 0.00 12992 845 0.07
4011000 0.74 0.00 0.00 0.08 0.12 0.05 0.01 32308 15756 0.49
4011200 1.00 0.00 0.00 0.00 0.00 0.00 0.00 8698 272 0.03
4011700 1.00 0.00 0.00 0.00 0.00 0.00 0.00 13748 100 0.01
4012000 0.18 0.00 0.29 0.26 0.16 0.04 0.08 18249 15466 0.85
4012200 0.94 0.00 0.00 0.00 0.02 0.03 0.01 15750 3774 0.24
4012500 0.96 0.00 0.00 0.00 0.01 0.01 0.03 10657 633 0.06
4013200 0.99 0.00 0.00 0.00 0.00 0.00 0.00 12978 1755 0.14
# changing the dataset
volume_df1 = volume_df %>% as.tibble() %>% 
  dplyr::select(-c(idno,total_pixel,qes_pixel))

# splitting the dataset
training = volume_df1[1:94,]
testing = volume_df1[95:118,]
# matrix of predictors for training
x = model.matrix(percent_pixel~.,training)[,-1]
# vector of response 
y = training$percent_pixel

# matrix of predictors for testing
x2 = model.matrix(percent_pixel~.,testing)[,-1]
# vector of response 
y2 = testing$percent_pixel

Since this is a regression problem, I will first check for linear model assumptions such as linearity between features and response or if the response is normally distributed.

theme1 <- trellis.par.get()
theme1$plot.symbol$col <- rgb(.2, .4, .2, .5)
theme1$plot.symbol$pch <- 16
theme1$plot.line$col <- rgb(.8, .1, .1, 1)
theme1$plot.line$lwd <- 2
theme1$strip.background$col <- rgb(.0, .2, .6, .2)
trellis.par.set(theme1)
featurePlot(x, y, plot = "scatter", labels = c("","Y"),
            type = c("p"), layout = c(4, 2))

As seen on the plot, the linearity assumption is not met, therefore I will explore non-linear models, i.e models that are more flexible such as the Generalized Additive Model (GAM), the Multivariate Adaptive Regression Spline (MARS), and a Tree-based method which will be performed in Python. The (Root) Mean Squared Error will be the evaluation metric. Both GAM and MARS will be 5-fold cross-validated to prevent overfitting and due to the fact that we have limited training examples.

Generalized Additive Model (GAM)

GAM fits a generalized Additive Model (GAM) to the data, the term ‘GAM’ being take to include any quadratically penalized Generalized Linear Model (GLM) and a variety of other models estimated by a quadratically penalized likelihood type of approach. GAM can be thought of as an extension of the Multiple Linear Regression in that it provides a general framework for extending a standard linear model by allowing non-linear functions of each of the variables, while maintaining additivity. The main limitation of the GAM model is that the model is restricted to be additive. With many variables, important interaction can be missed.

# fitting a GAM model
ctrl1 = trainControl(method = "repeatedcv", number = 5)

set.seed(2)
gam.fit = train(x, y,
                method = "gam",
                tuneGrid = data.frame(method = "GCV.Cp", select = c(TRUE,FALSE)),
                trControl = ctrl1)
## Loading required package: mgcv
## Loading required package: nlme
## 
## Attaching package: 'nlme'
## The following object is masked from 'package:dplyr':
## 
##     collapse
## This is mgcv 1.8-41. For overview type 'help("mgcv-package")'.
## 
## Attaching package: 'mgcv'
## The following object is masked from 'package:nnet':
## 
##     multinom
## The following object is masked from 'package:arsenal':
## 
##     negbin
gam.fit$bestTune
##   select method
## 2   TRUE GCV.Cp

This GAM model has a 0.8429968 r-squared.

Model Evaluation using testing data:

# Predicting solubilty from the test dataset
predict_gam = predict(gam.fit, newdata = x2)

# TEST ERROR
mse(y2, predict_gam)
## [1] 0.02336465

For model evaluation, the Mean Squared Error for the GAM model was found to be 0.0233647

Multivariate Adaptive Regression Spline (MARS)

For the Multivariate Adaptive Regression Spline, I will use the caret package since no flexibility is lost at this stage. Two tuning parameters degree of interaction and number of retained terms are defined. With the degree tuning parameter, we are allowed to have the sum of functions and also the product of functions (interaction terms). With the nprune tuning parameter, we can determine how many predictors are allowed in the final model.

# fitting a MARS model
mars_grid = expand.grid(degree = 1:2, # 1: sum of functions and 2: product of functions
                        nprune = 2:7) # How many predictors are allowed in the final model

set.seed(2)
mars.fit = train(x, y,
                 method = "earth",
                 tuneGrid = mars_grid,
                 trControl = ctrl1)
ggplot(mars.fit)

coef(mars.fit$finalModel)
## (Intercept)  h(NE-0.87)  h(0.87-NE) 
##   0.2649058  -1.8995381   0.8633877

Thus, this model chose to only include NE and Diffuse as they the most important variables in determining the percent emphysema pixels at follow up. This MARS model has a 0.9042869 r-squared.

Model Evaluation using testing data:

# Predicting solubilty from the test dataset
predict_mars = predict(mars.fit, newdata = x2)

# TEST ERROR
mse(y2, predict_mars)
## [1] 0.004487725

For model evaluation, the Mean Squared Error for the MARS model was found to be 0.0044877

the plot below shows the best cross-validated model between GAM and MARS, which is the model with the lowest cross-validated Root Mean Squared Error.

bwplot(resamples(list(mars = mars.fit,
                      gam = gam.fit)), metric = "RMSE")