Twitter Conference Presentation – Jonathan A. Pedroza

I thought I’d talk about the analyses I conducted for my submission to the American Public Health Association Physical Activity Section’s Twitter Conference. I thought it was a fun opportunity to disseminate some analyses that I conducted using public data from the County Health Rankings and Roadmap to examine what factors are associated with leisure-time physical activity (LTPA) in counties in California from 2016 to 2020. If you are interested in the presentation itself, you can follow it here.

Physical activity is important as its related to many physical and mental health conditions. It is also a health behavior that can be modified slightly easier than other health behaviors. While not as beneficial as extended periods of exercise, even walking for leisure can be beneficial for one’s health. I was predominately interested in California because I wanted to know how much variation there was between the counties. For instance, there are areas like San Francisco county and Los Angeles County, which may be seen as hubs for cultures of being physically active, but what about counties throughout Central California. I’m also interested in LTPA engagement because this health behavior has several social determinants of health that impact how much LTPA individuals can engage in. The social determinant that I’m most interested in is the role that access to recreational facilities and parks have on counties’ LTPA engagement. Since I was interested in looking at variation between counties while also examining the longitudinal association between access and LTPA, I decided to create a two-level multilevel model with time (level 1) nested within counties (level 2).

Packages ued

Warning: package 'ggplot2' was built under R version 4.3.3

Warning: package 'purrr' was built under R version 4.3.3

Warning: package 'stringr' was built under R version 4.3.2

Warning: package 'lubridate' was built under R version 4.3.3

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.2     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.0     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Warning: package 'inspectdf' was built under R version 4.3.2

Warning: package 'psych' was built under R version 4.3.1


Attaching package: 'psych'

The following objects are masked from 'package:ggplot2':

    %+%, alpha

Warning: package 'lme4' was built under R version 4.3.1

Loading required package: Matrix

Attaching package: 'Matrix'

The following objects are masked from 'package:tidyr':

    expand, pack, unpack

Warning: package 'lmerTest' was built under R version 4.3.3


Attaching package: 'lmerTest'

The following object is masked from 'package:lme4':

    lmer

The following object is masked from 'package:stats':

    step

Warning: package 'optimx' was built under R version 4.3.3

Warning: package 'ggmap' was built under R version 4.3.3

ℹ Google's Terms of Service: <https://mapsplatform.google.com>
  Stadia Maps' Terms of Service: <https://stadiamaps.com/terms-of-service/>
  OpenStreetMap's Tile Usage Policy: <https://operations.osmfoundation.org/policies/tiles/>
ℹ Please cite ggmap if you use it! Use `citation("ggmap")` for details.

Warning: package 'maps' was built under R version 4.3.3


Attaching package: 'maps'

The following object is masked from 'package:purrr':

    map

Warning: package 'RColorBrewer' was built under R version 4.3.1

Warning: package 'ggrepel' was built under R version 4.3.3

Warning: package 'gganimate' was built under R version 4.3.3

Warning: package 'transformr' was built under R version 4.3.3

Functions

Before beginning I created a simple function to get the intraclass correlation coefficient (ICC). I also included a function to get data from the 5 years of County Health Rankings data. The function to get the ICC from random-intercept models is the between county variation divided by the total variation between counties and within counties. This gives you information about how much of the variation in the model can be attributed to differences between counties regarding your outcome. Random-intercept models were tested because I was only interested in knowing the variation in counties’ LTPA engagement. There were also not enough points for each county to test individual slopes for each county (random-slopes model).

Warning: package 'curl' was built under R version 4.3.3

Using libcurl 8.3.0 with Schannel


Attaching package: 'curl'

The following object is masked from 'package:readr':

    parse_date

I also made some slight changes to my data. The first was to get rid of the estimates for each state and only focus on estimates from the county. I also wanted to treat year as a continuous variable in my models but wanted to keep the year variable as a factor too. Then after filtering to only examine California counties I used the str_replace_all function from the stringr package to get rid of the county name after each observation. This was to make it easier to join with map data from the maps package. Lastly, I made the counties title case to also make joining the observations easier.

Models

Now when running the first model, I was first interested in examining if there was an increase in LTPA engagement in all California counties from 2016 to 2020. From the finding below, it shows that in California, there was a decrease in LTPA over that time. It’s also important to note that lmerTest and lme4 both have a lmer function. By namespacing them with two colons, you can see that the summary information is slightly different.

Code

preliminary_ltpa_long <- lmerTest::lmer(ltpa_percent ~ year_num + (1 | county_fips_code), data = ca,
                              REML = FALSE)
summary(preliminary_ltpa_long)

Linear mixed model fit by maximum likelihood . t-tests use Satterthwaite's
  method [lmerModLmerTest]
Formula: ltpa_percent ~ year_num + (1 | county_fips_code)
   Data: ca

     AIC      BIC   logLik deviance df.resid 
  1427.9   1442.5   -709.9   1419.9      286 

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-5.6854 -0.3709  0.0361  0.5377  1.7084 

Random effects:
 Groups           Name        Variance Std.Dev.
 county_fips_code (Intercept) 7.187    2.681   
 Residual                     5.174    2.275   
Number of obs: 290, groups:  county_fips_code, 58

Fixed effects:
              Estimate Std. Error         df t value            Pr(>|t|)    
(Intercept) 1923.40172  190.60225  231.84941  10.091 <0.0000000000000002 ***
year_num      -0.91276    0.09445  231.84760  -9.664 <0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation of Fixed Effects:
         (Intr)
year_num -1.000

Code

prelim_ltpa_lmer <- lme4::lmer(ltpa_percent ~ year_num +(1 | county_fips_code), data = ca,
                              REML = FALSE)
summary(prelim_ltpa_lmer)

Linear mixed model fit by maximum likelihood  ['lmerMod']
Formula: ltpa_percent ~ year_num + (1 | county_fips_code)
   Data: ca

     AIC      BIC   logLik deviance df.resid 
  1427.9   1442.5   -709.9   1419.9      286 

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-5.6854 -0.3709  0.0361  0.5377  1.7084 

Random effects:
 Groups           Name        Variance Std.Dev.
 county_fips_code (Intercept) 7.187    2.681   
 Residual                     5.174    2.275   
Number of obs: 290, groups:  county_fips_code, 58

Fixed effects:
              Estimate Std. Error t value
(Intercept) 1923.40172  190.60225  10.091
year_num      -0.91276    0.09445  -9.664

Correlation of Fixed Effects:
         (Intr)
year_num -1.000

Code

ltpa_null_icc <- as_tibble(VarCorr(preliminary_ltpa_long))
ltpa_null_icc

# A tibble: 2 × 5
  grp              var1        var2   vcov sdcor
  <chr>            <chr>       <chr> <dbl> <dbl>
1 county_fips_code (Intercept) <NA>   7.19  2.68
2 Residual         <NA>        <NA>   5.17  2.27

Code

county_icc_2level(ltpa_null_icc)

[1] 0.5814093

Along with the fixed effects, we also got our random effects for both differences found between counties for LTPA engagement and differences within counties for LTPA engagement. This shows that there was a good amount of variation between counties (σ2 = 7.19) but also a large amount of variation within each county in California (σ2 = 5.17). Using the function to calculate the ICC, it found that county differences attributed to 58% of the variation in LTPA engagement. Something that should be considered is the potential for heteroscedastic residual variance at level 1. There is also the issue that the residuals could suggest spatial autocorrelation or clustering within these counties. Maybe I’ll create something on these soon. But for the time being, lets move on to what was found for the twitter conference.

Code

ltpa_long_access <- lmer(ltpa_percent ~ year_num + violent_crime + obesity_percent + median_household_income + rural_percent +
                           access_pa_percent + (1 | county_fips_code), data = ca,
                         REML = FALSE)

Warning: Some predictor variables are on very different scales: consider
rescaling

Warning: Some predictor variables are on very different scales: consider
rescaling

Code

anova(preliminary_ltpa_long, ltpa_long_access)

Data: ca
Models:
preliminary_ltpa_long: ltpa_percent ~ year_num + (1 | county_fips_code)
ltpa_long_access: ltpa_percent ~ year_num + violent_crime + obesity_percent + median_household_income + rural_percent + access_pa_percent + (1 | county_fips_code)
                      npar    AIC    BIC  logLik deviance  Chisq Df
preliminary_ltpa_long    4 1427.9 1442.5 -709.93   1419.9          
ltpa_long_access         9 1237.4 1270.4 -609.69   1219.4 200.47  5
                                 Pr(>Chisq)    
preliminary_ltpa_long                          
ltpa_long_access      < 0.00000000000000022 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Code

other_var <- lmer(ltpa_percent ~ year_num + violent_crime + obesity_percent + median_household_income + rural_percent + (1 | county_fips_code), data = ca,
                         REML = FALSE)

Warning: Some predictor variables are on very different scales: consider
rescaling

Warning: Some predictor variables are on very different scales: consider
rescaling

Code

anova(other_var, ltpa_long_access)

Data: ca
Models:
other_var: ltpa_percent ~ year_num + violent_crime + obesity_percent + median_household_income + rural_percent + (1 | county_fips_code)
ltpa_long_access: ltpa_percent ~ year_num + violent_crime + obesity_percent + median_household_income + rural_percent + access_pa_percent + (1 | county_fips_code)
                 npar    AIC    BIC  logLik deviance  Chisq Df Pr(>Chisq)  
other_var           8 1240.1 1269.4 -612.04   1224.1                       
ltpa_long_access    9 1237.4 1270.4 -609.69   1219.4 4.6883  1    0.03037 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

With the inclusion of several predictors for fixed effects, a likelihood ratio test was conducted to see if the inclusion of these fixed effects revealed a significantly better fitting model. The inclusion of these predictors revealed a better fitting model. It would probably be better to see if the inclusion of one variable of interest, such as access, resulted in a better fitting model than a model with the other social determinants of health. As can be see here, the likelihood ratio test of including only access still resulted in a signifcantly better fitting model.

Code

summary(ltpa_long_access)

Linear mixed model fit by maximum likelihood . t-tests use Satterthwaite's
  method [lmerModLmerTest]
Formula: 
ltpa_percent ~ year_num + violent_crime + obesity_percent + median_household_income +  
    rural_percent + access_pa_percent + (1 | county_fips_code)
   Data: ca

     AIC      BIC   logLik deviance df.resid 
  1237.4   1270.4   -609.7   1219.4      281 

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-4.8388 -0.5546  0.0010  0.6179  2.4921 

Random effects:
 Groups           Name        Variance Std.Dev.
 county_fips_code (Intercept) 1.286    1.134   
 Residual                     3.139    1.772   
Number of obs: 290, groups:  county_fips_code, 58

Fixed effects:
                              Estimate     Std. Error             df t value
(Intercept)             1138.319788532  193.185926311  289.534084648   5.892
year_num                  -0.517087679    0.096244053  289.365968612  -5.373
violent_crime             -0.001434185    0.001190066   82.732734919  -1.205
obesity_percent           -0.602338792    0.043760246  262.840283632 -13.765
median_household_income    0.000004822    0.000014701  102.255262863   0.328
rural_percent             -0.012178273    0.007671490   63.272282975  -1.587
access_pa_percent          0.027194696    0.012124625  153.692013407   2.243
                                    Pr(>|t|)    
(Intercept)                     0.0000000106 ***
year_num                        0.0000001597 ***
violent_crime                         0.2316    
obesity_percent         < 0.0000000000000002 ***
median_household_income               0.7436    
rural_percent                         0.1174    
access_pa_percent                     0.0263 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation of Fixed Effects:
            (Intr) yer_nm vlnt_c obsty_ mdn_h_ rrl_pr
year_num    -1.000                                   
violent_crm  0.116 -0.118                            
obsty_prcnt  0.489 -0.494 -0.100                     
mdn_hshld_n  0.586 -0.589  0.203  0.452              
rural_prcnt  0.260 -0.264  0.135  0.203  0.447       
accss_p_prc -0.147  0.142 -0.018  0.054 -0.341  0.158
fit warnings:
Some predictor variables are on very different scales: consider rescaling

The model summary suggests that the fixed effect of access on LTPA engagement was significantly associated. The thing that stands out the most here is that the inclusion of the predictors resulted in more variation within counties than between counties. So lets look into that more closely.

Code

ltpa_access_icc <- as_tibble(VarCorr(ltpa_long_access))
ltpa_access_icc

# A tibble: 2 × 5
  grp              var1        var2   vcov sdcor
  <chr>            <chr>       <chr> <dbl> <dbl>
1 county_fips_code (Intercept) <NA>   1.29  1.13
2 Residual         <NA>        <NA>   3.14  1.77

Code

county_icc_2level(ltpa_access_icc)

[1] 0.2906253

The ICC suggests that 29% of the variation explained is from differences between counties. It is also beneficial to look at all of this through visuals.

Visuals Prep

Below we’ll start by using the maps package to get county-level data of the contiguous United States. The steps below were to make sure this data frame joined with the county health rankings data we had created previously.

Visualizing model

One way to visualize the variation between counties in our final model (ltpa_long_access) is to use a caterpillar plot. This allows you to view variation in the residuals of each county for your outcome. From the visual, you can see the differences between Humboldt County and Tehama County.

Code

main_effects_var <- ranef(ltpa_long_access, condVar = TRUE)

main_effects_var <- as.data.frame(main_effects_var)

main_effects_var <- main_effects_var %>% 
  mutate(main_effects_term = term,
         county_fips_code = grp,
         main_effects_diff = condval,
         main_effects_se = condsd,
         county_fips_code = as.numeric(county_fips_code))

main_effects_var$no_name_county <- unique(ca$no_name_county)

main_effects_var %>% 
  ggplot(aes(fct_reorder(no_name_county, main_effects_diff), main_effects_diff)) +
  geom_errorbar(aes(ymin = main_effects_diff + qnorm(0.025)*main_effects_se,
                  ymax = main_effects_diff + qnorm(0.975)*main_effects_se)) +
  geom_point(aes(color = no_name_county)) +
  coord_flip() +
  labs(x = ' ',
     y = 'Differences in Leisure-time Physical Activity',
     title = 'Variation in Leisure-time Physical Activity\nAcross California Counties') +
  theme(legend.position = 'none')

This plot shows the fixed effect of access and LTPA across the various years.

Code

ca %>% 
  mutate(year = as.factor(year)) %>% 
  ggplot(aes(access_pa_percent, ltpa_percent)) +
  geom_point(aes(color = year)) +
  geom_smooth(color = 'dodgerblue',
            method = 'lm', se = FALSE, size = 1) +
  theme(legend.title = element_blank()) +
  labs(x = 'Access to Physical Activity Opportunities',
       y = 'Leisure-time Physical Activity',
       title = 'The Statewide Association of Access\nand Physical Activity')

Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

`geom_smooth()` using formula = 'y ~ x'

Finally, a gif of the change of LTPA from 2016 to 2020.

Code

library(gganimate)

ca_animate <- ca_visual %>%
  ggplot(aes(frame = year,
             cumulative = TRUE)) +
  geom_polygon(aes(x = long, y = lat, 
                   group = group, 
                   fill = ltpa_percent),
               color = 'black') +
  scale_fill_gradientn(colors = brewer.pal(n = 5, name = 'RdYlGn')) + 
  theme_classic() +
  transition_time(year) +
  labs(x = 'Longitude',
       y = 'Latitude',
       title = 'Leisure-time Physical Activity\nChange Over Time',
       subtitle = 'Year: {frame_time}') +
  theme(legend.title = element_blank(),
        legend.text = element_text(size = 12),
        axis.text.x = element_text(size = 10),
        axis.text.y = element_text(size = 10),
        plot.title = element_text(size = 20),
        plot.subtitle = element_text(size = 18))

library(gifski)

Warning: package 'gifski' was built under R version 4.3.3

Code

animate(ca_animate, renderer = gifski_renderer())