Post about my presentation I gave over Twitter for the American Public Health Association’s Physical Activity section.
Visualizations
Analysis
Published
April 30, 2021
I thought I’d talk about the analyses I conducted for my submission to the American Public Health Association Physical Activity Section’s Twitter Conference. I thought it was a fun opportunity to disseminate some analyses that I conducted using public data from the County Health Rankings and Roadmap to examine what factors are associated with leisure-time physical activity (LTPA) in counties in California from 2016 to 2020. If you are interested in the presentation itself, you can follow it here.
Physical activity is important as its related to many physical and mental health conditions. It is also a health behavior that can be modified slightly easier than other health behaviors. While not as beneficial as extended periods of exercise, even walking for leisure can be beneficial for one’s health. I was predominately interested in California because I wanted to know how much variation there was between the counties. For instance, there are areas like San Francisco county and Los Angeles County, which may be seen as hubs for cultures of being physically active, but what about counties throughout Central California. I’m also interested in LTPA engagement because this health behavior has several social determinants of health that impact how much LTPA individuals can engage in. The social determinant that I’m most interested in is the role that access to recreational facilities and parks have on counties’ LTPA engagement. Since I was interested in looking at variation between counties while also examining the longitudinal association between access and LTPA, I decided to create a two-level multilevel model with time (level 1) nested within counties (level 2).
Packages ued
Warning: package 'ggplot2' was built under R version 4.3.3
Warning: package 'purrr' was built under R version 4.3.3
Warning: package 'stringr' was built under R version 4.3.2
Warning: package 'lubridate' was built under R version 4.3.3
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.2 ✔ readr 2.1.4
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.0 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.0
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Warning: package 'inspectdf' was built under R version 4.3.2
Warning: package 'psych' was built under R version 4.3.1
Attaching package: 'psych'
The following objects are masked from 'package:ggplot2':
%+%, alpha
Warning: package 'lme4' was built under R version 4.3.1
Loading required package: Matrix
Attaching package: 'Matrix'
The following objects are masked from 'package:tidyr':
expand, pack, unpack
Warning: package 'lmerTest' was built under R version 4.3.3
Attaching package: 'lmerTest'
The following object is masked from 'package:lme4':
lmer
The following object is masked from 'package:stats':
step
Warning: package 'optimx' was built under R version 4.3.3
Warning: package 'ggmap' was built under R version 4.3.3
ℹ Google's Terms of Service: <https://mapsplatform.google.com>
Stadia Maps' Terms of Service: <https://stadiamaps.com/terms-of-service/>
OpenStreetMap's Tile Usage Policy: <https://operations.osmfoundation.org/policies/tiles/>
ℹ Please cite ggmap if you use it! Use `citation("ggmap")` for details.
Warning: package 'maps' was built under R version 4.3.3
Attaching package: 'maps'
The following object is masked from 'package:purrr':
map
Warning: package 'RColorBrewer' was built under R version 4.3.1
Warning: package 'ggrepel' was built under R version 4.3.3
Warning: package 'gganimate' was built under R version 4.3.3
Warning: package 'transformr' was built under R version 4.3.3
Functions
Before beginning I created a simple function to get the intraclass correlation coefficient (ICC). I also included a function to get data from the 5 years of County Health Rankings data. The function to get the ICC from random-intercept models is the between county variation divided by the total variation between counties and within counties. This gives you information about how much of the variation in the model can be attributed to differences between counties regarding your outcome. Random-intercept models were tested because I was only interested in knowing the variation in counties’ LTPA engagement. There were also not enough points for each county to test individual slopes for each county (random-slopes model).
Warning: package 'curl' was built under R version 4.3.3
Using libcurl 8.3.0 with Schannel
Attaching package: 'curl'
The following object is masked from 'package:readr':
parse_date
I also made some slight changes to my data. The first was to get rid of the estimates for each state and only focus on estimates from the county. I also wanted to treat year as a continuous variable in my models but wanted to keep the year variable as a factor too. Then after filtering to only examine California counties I used the str_replace_all function from the stringr package to get rid of the county name after each observation. This was to make it easier to join with map data from the maps package. Lastly, I made the counties title case to also make joining the observations easier.
Models
Now when running the first model, I was first interested in examining if there was an increase in LTPA engagement in all California counties from 2016 to 2020. From the finding below, it shows that in California, there was a decrease in LTPA over that time. It’s also important to note that lmerTest and lme4 both have a lmer function. By namespacing them with two colons, you can see that the summary information is slightly different.
Linear mixed model fit by maximum likelihood ['lmerMod']
Formula: ltpa_percent ~ year_num + (1 | county_fips_code)
Data: ca
AIC BIC logLik deviance df.resid
1427.9 1442.5 -709.9 1419.9 286
Scaled residuals:
Min 1Q Median 3Q Max
-5.6854 -0.3709 0.0361 0.5377 1.7084
Random effects:
Groups Name Variance Std.Dev.
county_fips_code (Intercept) 7.187 2.681
Residual 5.174 2.275
Number of obs: 290, groups: county_fips_code, 58
Fixed effects:
Estimate Std. Error t value
(Intercept) 1923.40172 190.60225 10.091
year_num -0.91276 0.09445 -9.664
Correlation of Fixed Effects:
(Intr)
year_num -1.000
Along with the fixed effects, we also got our random effects for both differences found between counties for LTPA engagement and differences within counties for LTPA engagement. This shows that there was a good amount of variation between counties (σ2 = 7.19) but also a large amount of variation within each county in California (σ2 = 5.17). Using the function to calculate the ICC, it found that county differences attributed to 58% of the variation in LTPA engagement. Something that should be considered is the potential for heteroscedastic residual variance at level 1. There is also the issue that the residuals could suggest spatial autocorrelation or clustering within these counties. Maybe I’ll create something on these soon. But for the time being, lets move on to what was found for the twitter conference.
Warning: Some predictor variables are on very different scales: consider
rescaling
Warning: Some predictor variables are on very different scales: consider
rescaling
Warning: Some predictor variables are on very different scales: consider
rescaling
Warning: Some predictor variables are on very different scales: consider
rescaling
With the inclusion of several predictors for fixed effects, a likelihood ratio test was conducted to see if the inclusion of these fixed effects revealed a significantly better fitting model. The inclusion of these predictors revealed a better fitting model. It would probably be better to see if the inclusion of one variable of interest, such as access, resulted in a better fitting model than a model with the other social determinants of health. As can be see here, the likelihood ratio test of including only access still resulted in a signifcantly better fitting model.
Code
summary(ltpa_long_access)
Linear mixed model fit by maximum likelihood . t-tests use Satterthwaite's
method [lmerModLmerTest]
Formula:
ltpa_percent ~ year_num + violent_crime + obesity_percent + median_household_income +
rural_percent + access_pa_percent + (1 | county_fips_code)
Data: ca
AIC BIC logLik deviance df.resid
1237.4 1270.4 -609.7 1219.4 281
Scaled residuals:
Min 1Q Median 3Q Max
-4.8388 -0.5546 0.0010 0.6179 2.4921
Random effects:
Groups Name Variance Std.Dev.
county_fips_code (Intercept) 1.286 1.134
Residual 3.139 1.772
Number of obs: 290, groups: county_fips_code, 58
Fixed effects:
Estimate Std. Error df t value
(Intercept) 1138.319788532 193.185926311 289.534084648 5.892
year_num -0.517087679 0.096244053 289.365968612 -5.373
violent_crime -0.001434185 0.001190066 82.732734919 -1.205
obesity_percent -0.602338792 0.043760246 262.840283632 -13.765
median_household_income 0.000004822 0.000014701 102.255262863 0.328
rural_percent -0.012178273 0.007671490 63.272282975 -1.587
access_pa_percent 0.027194696 0.012124625 153.692013407 2.243
Pr(>|t|)
(Intercept) 0.0000000106 ***
year_num 0.0000001597 ***
violent_crime 0.2316
obesity_percent < 0.0000000000000002 ***
median_household_income 0.7436
rural_percent 0.1174
access_pa_percent 0.0263 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Correlation of Fixed Effects:
(Intr) yer_nm vlnt_c obsty_ mdn_h_ rrl_pr
year_num -1.000
violent_crm 0.116 -0.118
obsty_prcnt 0.489 -0.494 -0.100
mdn_hshld_n 0.586 -0.589 0.203 0.452
rural_prcnt 0.260 -0.264 0.135 0.203 0.447
accss_p_prc -0.147 0.142 -0.018 0.054 -0.341 0.158
fit warnings:
Some predictor variables are on very different scales: consider rescaling
The model summary suggests that the fixed effect of access on LTPA engagement was significantly associated. The thing that stands out the most here is that the inclusion of the predictors resulted in more variation within counties than between counties. So lets look into that more closely.
The ICC suggests that 29% of the variation explained is from differences between counties. It is also beneficial to look at all of this through visuals.
Visuals Prep
Below we’ll start by using the maps package to get county-level data of the contiguous United States. The steps below were to make sure this data frame joined with the county health rankings data we had created previously.
Visualizing model
One way to visualize the variation between counties in our final model (ltpa_long_access) is to use a caterpillar plot. This allows you to view variation in the residuals of each county for your outcome. From the visual, you can see the differences between Humboldt County and Tehama County.