Numerical Simulation to Determine the Association between the Coefficient of Determination (R2) in Individual and Ecological Studies

Update Item Information
Publication Type report
School or College School of Medicine
Department Public Health Division
Project type Master of Statistics (MSTAT): Biostatistics Project
Author Yan, Bin
Title Numerical Simulation to Determine the Association between the Coefficient of Determination (R2) in Individual and Ecological Studies
Date 2022
Description The estimated coefficient of determination R2 is a widely-used summary statistic that quantifies the proportion of variance explained by the regression model, often as an important measure of the model's performance. It is reported by researchers in many statistical analyses in different fields. Individual studies are well-developed, and corresponding individual R2 has proved useful. Ecological studies arise within many different disciplines since their inquiry of investigating individual-level behavior, but only aggregate-level data can be accessed easily and inexpensively. As the amount of ecological studies have employed and expanded considerably in many different disciplines, ecological R2 as a measure of explained variation will follow up with reports. However, there has been less work done on revising the definition of R2 in an ecological study; also unknown is the association of R2 between individual and ecological studies. Data simulation: The simulations were conducted in RStudio Version 1.2.1335 (2009-2019 RStudio, Inc.). For each simulation of a total of 300 runs, the two variables X and Y were generated from two normal distributions, N(0,0.1) and N(1,0.1), respectively, each with 3000 (pseudo-) random numbers with an exact Pearson correlation coefficient between them under a seed of 456456. The correlation 𝜌𝜌 was pre-specified before simulations and varied from 0.05 to 0.95 by 0.03 each time to obtain a different individual R2. The individuals were randomly assigned to 30 mutually exclusive groups, each of the same size, of 100 people. After standardizing the individual dataset, averages for X and Y are computed for each group. Individual R-squared was measured by fitting simple linear regression from the entire individual data. Ecological data are obtained at an aggregated group level and averages individuals within each group. After randomly grouping and averaging the X and Y values for each group across the 100 points, an ecological dataset was obtained, then standardized, and an ecological R-squared was computed via simple linear regression in an aggregated dataset. Methods: This article has summarized the difficulties of constructing appropriate models, identified some sources of bias in ecological research, and reviewed the mathematical relationship among three types of correlations, all efforts made to avoid ecological bias and to make correct inferences evaluating individual relationships. To overcome the issue of non-constant variance, weighted least squares were performed in both linear and non-linear regression. Model diagnostic plots are used to evaluate the model assumptions. Akaike information criterion (AIC) and Bayesian Information Criterion (BIC) are used to compare across a set of statistical models. The AIC and BIC are adjusted to penalize the number of parameters in the model. AIC, BIC are defined to be 𝐴𝐴𝐴𝐴𝐴𝐴=2[−log􁉀L􀵫𝛽𝛽0 􀷢,𝛽𝛽1 􀷢,…,𝛽𝛽𝑝𝑝􀷢,𝜎𝜎􀷜2􀸫Y􀵯􁉁+K], and 𝐵𝐵𝐵 =−2log􁉀L􀵫𝛽𝛽0 􀷢,𝛽𝛽1 􀷢,…,𝛽𝛽𝑝𝑝􀷢,𝜎𝜎􀷜2􀸫Y􀵯􁉁+K log(n) . Where K=p+2, p is the number of estimated parameters. They are both defined as the smaller the value of AIC/BIC, the better the model. The only difference is the penalty term K or K log(n) [1].
Type Text
Publisher University of Utah
Subject Coefficient of Determination; Ecological studies
Dissertation Institution The project should be housed in the Division of Public Health's collection
Language eng
Rights Management (c) Bin Yan
Format Medium application/pdf
ARK ark:/87278/s6wzcfgh
Setname ir_dph
ID 2019534
Reference URL https://collections.lib.utah.edu/ark:/87278/s6wzcfgh
Back to Search Results