A statistical framework for identifying outliers in repeated ED50 estimates from pharmacological studies

School or College	School of Medicine
Department	Public Health Division
Project type	Master of Statistics (MSTAT): Biostatistics Project
Author	Tubilla, Alison
Title	A statistical framework for identifying outliers in repeated ED50 estimates from pharmacological studies
Date	2025
Description	Basic science experiments are fundamental in medical research. Specifically, drug experimentation on mice has been instrumental in the development of many drugs. Anticonvulsant drugs, which are used in the treatment of epilepsy (seizures), are a class of drugs in which animal experimentation has been crucial. Both the discovery and development of antiepileptic compounds have relied heavily on animal experimentation. The Anticonvulsant Drug Development (ADD) lab at the University of Utah is one research center that performs pharmacological studies with antiepileptic drugs used to treat seizures. Any new anticonvulsant drug introduced to clinical use in the past 40 years has been studied in this lab.
Type	Text
Publisher	University of Utah
Subject	ED50 estimates; outlier detection; meta-analysis
Language	eng
Rights Management	© Alison Tubilla
Format Medium	application/pdf
ARK	ark:/87278/s6rd22aq
Setname	ir_dph
ID	2698203
OCR Text	Show A Statistical Framework for Identifying Outliers in Repeated ED50 Estimates from Pharmacological Studies Alison Tubilla Master of Statistics – Biostatistics Final Project Spring 2025 Motivation and Objective Basic science experiments are fundamental in medical research. Specifically, drug experimentation on mice has been instrumental in the development of many drugs. Anticonvulsant drugs, which are used in the treatment of epilepsy (seizures), are a class of drugs in which animal experimentation has been crucial. Both the discovery and development of antiepileptic compounds have relied heavily on animal experimentation. The Anticonvulsant Drug Development (ADD) lab at the University of Utah is one research center that performs pharmacological studies with antiepileptic drugs used to treat seizures. Any new anticonvulsant drug introduced to clinical use in the past 40 years has been studied in this lab. In the research of these drug compounds over the years, researchers in this lab have done many experiments to calculate ED50 (Effective Dose 50) estimates. An ED50 estimate is the dose of a drug compound at which 50% of the population taking that dose will have a beneficial and therapeutic effect. Figure 1 provides an example of what a typical ED50 experiment looks like. There are four dosage groups with eight mice per group. Although four dosage groups are standard, studies may occasionally employ three or five groups depending on the specific research context or design considerations. Eight mice per group is also standard; however, over the years, mice per group have ranged from three to twelve. In these experiments, the total number of mice per group that have a beneficial and therapeutic effect is recorded. Figure 1: Structure of an ED50 Experiment 1 Once the effects have been recorded for each dosage group, a probit model is run through the data in order to calculate the dose where a predicted 50% of mice will have a beneficial and therapeutic effect. Probit analysis is used for binary data with the assumption that the underlying probability of response follows a cumulative normal distribution. Probit analysis is a standard statistical method used in toxicology and pharmacology to calculate ED50s, especially in regulatory settings. Figure 2 is a visual representation of how an ED50 calculation is made. Figure 2: ED50 calculation using probit analysis After many decades of research in the ADD lab, many ED50 experiments have been performed on a variety of anticonvulsant drug compounds. Therefore, each drug compound has more than one estimate of the ED50. Natural variation is expected to occur when multiple studies are performed on the same drug compound. Small variations are not just normal, but expected. Although similar dosages are used across experiments, they are rarely the exact same dosages. Figure 3 is a synthetic dataset that the researchers provided as an example of a drug compound with three different experiments. Variation in doses can be seen, which naturally will lead to variations in the estimated ED50 values 2 Figure 3: Synthetic ED50 experiment data A problem is introduced when the variation in ED50 values begins to exceed what would be considered natural variation. The researchers had more than one drug compound for which there appeared to be an ED50 estimate that was different from the others. Specifically, researchers provided me with ED50 estimates data on one of these drug compounds which is contained in Figure 4. With a quick glance at the ED50 estimates of this drug compound, the 71.6 value seems much different than the other estimates, which are in the 40s to low 50s range. Although the researchers can see that there appears to be a big enough variation to be more than natural, they don’t have a method to definitively determine this. At this point, they sought out statistical advice for their data. They desired a method to determine if any ED50 estimates within their datasets should be considered outliers. Figure 4: ED50 estimates for a drug compound 3 The objective for this project was to conduct a comparative analysis of outlier detection strategies to determine which is best for repeated ED50 experiments from pharmacological studies. The researchers were given a recommendation for their data based on this comparative analysis. Comparative Analysis There are many statistical outlier detection techniques used in common practice (Sullivan et al., 2021). There is not one specific method that is considered standard for all types of research contexts and study designs. Outlier detection is very dependent on the type of data being studied. Of all the outlier detection techniques researched, there were three that stood out as candidates for the ED50 estimates data. The first is a confidence interval overlap method. The second is a leave-one-out method. Lastly, the third is a standard deviation cut-off method. These are three of the more common approaches to outlier detection in statistical literature that were applicable to the ED50 data. Method 1 The first method to evaluate was the confidence interval overlap method. In a confidence interval overlap method, a value is considered an outlier if its confidence interval does not overlap with any other confidence interval estimate in the dataset. In the context of the data, an ED50 estimate would be considered an outlier if the confidence interval around it does not overlap with any other confidence interval of all the other ED50 estimates for the same drug compound. 4 Looking back at Figure 4, the researchers have confidence intervals calculated for each ED50 estimate. The upper limit and lower limit columns are the higher and lower limits of said confidence interval. In the case of the provided data, the 71.6 estimate that appears to be different from the rest is not considered an outlier with this method. This is because its lower confidence interval bound of 45.9 overlaps with at least one other interval from the other ED50 estimates. Figure 5 visually shows the overlap of the interval around 71.6 (designated by the blue area) with other ED50 intervals in the dataset. Figure 5: ED50 confidence intervals The confidence interval overlap method has both strengths and drawbacks. One of the biggest strengths comes from using an interval range rather than a single point estimate. By leveraging a confidence interval around the ED50 estimate to determine if it's an outlier, rather than using just the ED50 estimate itself, it takes into account some of the individual experiment variability. There are many sources of variability within a single ED50 experiment for a drug compound. Some of these sources of variability include, but are not limited to, the drug manufacturer, the researchers conducting the experiment, how many dosage groups are used, the 5 exact dosages used, the number of mice used per group, the type of mice being experimented on, and the number of mice responding in dosage groups. It is almost always advantageous to include sources of variability in analysis calculations where possible. A second strength of this method is based on the data the researchers have. In their data, confidence intervals have already been calculated for each ED50 estimate. This means implementation of this method would be very straightforward since their software has already calculated intervals. Although it has strengths, this method has many drawbacks. The first two drawbacks go hand in hand. The first one is that extreme values have very wide intervals in this data, which makes it difficult to ever flag an outlier. A value being extreme should make it more likely to be detected as an outlier, which isn’t the case in this method. This leads to the second drawback, which is that with the data given, it is not possible to calculate or simulate the probability of error for this method. Individual subject level data would be required to determine the probability of falsely flagging an outlier. This is a very notable disadvantage since knowing the probability of a false positive (saying a data point is an outlier when it actually is not) is crucial with any method. Having a constant probability of error across all sample sizes is important so that the method works the same no matter how many ED50 estimates a drug compound has. A third drawback, that is attached to the strength of incorporating individual ED50 experiment data, is that the variability of individual experiments and how it affects confidence interval calculation is unknown. More research would have to be done to determine if there are certain sources of variability in ED50 experiments that greatly impact the probit model, ED50 calculation, and the confidence interval calculation. Although using an interval around an ED50 6 estimate instead of just the point estimate sounds good in theory, there are a lot of unknowns that come with the variability and confidence interval calculation. Finally, the fourth drawback to this method is that using confidence intervals doesn’t appear to be appropriate in regards to how ED50 estimates from multiple experiments behave. In a confidence interval, the point estimate is a calculated mean, and the critical value uses a tdistribution, based on sample size. After simulating ED50 experiment estimates and comparing them to t-distribution probabilities, the estimates were not following the same distribution characteristics. Rather, ED50 estimates behave as individual data points in a population. Figure 6 shows the code in R used to simulate ED50 experiments for an experiment with four dosage groups, with eight mice per group. The probabilities were chosen based off researchers remarks on ED50 experiments. Specifically, most of the time the smallest dosage group had zero of the mice with a beneficial effect, and most of the time the largest dosage group had all the mice with a beneficial effect (Hence the 0.75 and 0.25 probabilities). The middle dosage groups were more likely to have a number of mice with a beneficial effect that was close to half of the group size, but rarely exactly half. These simulations created distributions of ED50 estimates which helped determine that ED50s behave like individual data points instead of means. 7 Figure 6: R code to simulate ED50 experiments Overall, the confidence interval method has a couple of critical weaknesses. The inability to calculate or simulate the probability of an error makes it impossible to ensure the method performs consistently across sample sizes. Another problematic weakness is that the true behavior of ED50 estimates is not consistent with a confidence interval calculation. ED50 estimates behave as individual data points in a population rather than a mean of data. The limitations overpower any advantages of this method. Method 2 The second potential method for outlier detection of the ED50 estimates data is a leaveone-out method. In this method, there is a three-step process. First, determine the most extreme value by ordering the dataset and determining whether the highest or lowest value is farthest from the next nearest data point. Once the most extreme value is identified, the second step is to leave this value out of the dataset. The last step is to look at the distribution of the remaining values and determine if the left-out value belongs to the same distribution as the rest of the data. 8 Since it has been determined previously that ED50 estimates behave like individual data points, a tolerance interval can be used on the remaining dataset, after the extreme value is left out. Calculating a tolerance interval gives a range of the distribution of the data set that the leftout value can be compared to. A tolerance interval provides limits within which at least a certain proportion of the population falls within a given level of confidence. For this method, a coverage factor of 0.999 is chosen so that only 0.001 of data points fall outside of the interval. In other words, the coverage factor of 0.999 is chosen so that the probability of error is 0.001. For the ED50 estimates of drug compounds, a low probability of error is desired to ensure that when outliers are detected, they truly are outliers. Experiments take a lot of time and money so calling something an outlier when in reality it is a true estimate is very undesirable. There are multiple methods for calculating a tolerance interval. A Howe’s tolerance interval is the most common way to do so. Similar to confidence interval structure, a Howe tolerance interval takes an estimate plus or minus a standard deviation that is multiplied by a critical value. This essentially takes an estimate and creates an upper and lower bound that are a certain number of standard deviations away from that estimate. Figure 7 shows the equations for a Howe tolerance interval as well as the equation for the critical value, or as it is commonly called in literature, the k factor. Figure 7: Howe's tolerance intervals equations For the given dataset, this outlier detection technique would be used as follows: 9 The ED50 estimates are ordered from smallest to largest: 40.59, 41.01, 44.86, 45.36, 45.55, 47.04, 47.79, 52.64, 71.60. The estimate of 71.6 is determined to be the most extreme value since 71.6 is farther in absolute distance from 52.64 than 40.59 is from 41.01. Then, for the next steps of this process, 71.6 is left out. For the remaining dataset, a tolerance interval is calculated. Using Howe’s method for tolerance intervals, this specific dataset produces a tolerance interval of (21.52, 69.69). The last step of this process is to compare the extreme value to the tolerance interval of the remaining values. Since 71.6 is not contained within the interval of (21.52, 69.69), it is considered an outlier. This leave-one-out method for outlier detection also comes with strengths and drawbacks. A strength of this method is that it implements intervals that use the true behavior of ED50 estimates. Rather than using confidence intervals, tolerance intervals are used since ED50 estimates more closely follow a distribution of individual data points in a population. A second strength of the leave-one-out method is that error probabilities can be simulated with this method. By simulating this method many times, approximate probabilities of falsely flagging an outlier can be accurately estimated. This is very important for testing if probabilities are constant with all sample sizes. A last strength of this method comes from leaving the most extreme data point out. By leaving the data point out, it does not affect the calculation of the tolerance interval for the rest of the data. Extreme values can greatly skew estimates, and therefore the tolerance interval. By leaving the extreme data point out, a more accurate picture of the distribution of the rest of the data can be obtained. An accurate distribution for the rest of the data allows for comparison of the extreme point to determine if it fits in with the distribution. 10 Along with its many strengths, the leave-one-out method comes with drawbacks as well. The first drawback is that the individual variability of each ED50 experiment is not considered. Since a single point estimate of the ED50 is used for the extreme value that is left out, variability around this estimate isn’t taken into account. As mentioned before, using intervals around the ED50 estimates is difficult, as it is unknown what aspects of variations from ED50 experiments affect the intervals most, but not including variations in estimates is still always a drawback since variability is an important aspect of any data. Although probabilities can be simulated using this method, after performing simulations, the probability of error was not consistent across sample sizes. A probability of 0.001 for falsely flagging an outlier wasn’t even achieved asymptotically when sample sizes got large. This is a major drawback. It appeared that there was a nonlinear relationship in the error probabilities as sample sizes got larger. Further research would have to be performed to determine why the patterns in the error probabilities occurred. The Howe tolerance interval calculation, as well as the leave-one-out aspect of the method were determined to be causes of inconsistent probabilities across sample sizes. Another drawback to this method that comes with simulating the probabilities is that it relies on the assumption that ED50 estimates are normally distributed. Figure 9 shows the R code that is used in simulations. The ED50s are simulated by randomly sampling from a normal distribution. Figure 8 shows graphs of simulated ED50 values (simulated using the code from Figure 6) and why the normality assumption is made in this data. Fully sampling ED50 values, as done in the code from Figure 6, couldn’t be done within the simulations to calculate error probability of the method. This is because it required thousands upon thousands of probit models 11 to be run, and R would crash before it could finish the simulation. The normality assumption was made in order to simplify and be able to simulate the probability of error. Figure 8: Histograms and QQ plots of ED50 estimates 12 Figure 9: R code for simulating method 2 13 When assessing the advantages and disadvantages of this second method, the drawback of inconsistent probabilities is a big concern. As mentioned before, an outlier detection method should perform consistently among all sample sizes when used. There should be a very small probability of error, the desire being a constant 0.001 across sample sizes, for the method to prove useful for the ED50 estimates data. Method 3 The last and final method is a standard deviation cut-off rule. In statistical practice, it is common to see a three-standard-deviation rule for outlier detection. This means that any points outside three standard deviations are considered outliers. This is a common cut-off since there is only a 0.003 probability of data being outside three standard deviations (for a standard normal distribution). In order to incorporate this kind of rule into the ED50 data, it must be adjusted for sample size and alpha. Similar to method 2, a tolerance interval will be used to incorporate the true behavior of ED50 estimates for anticonvulsant drug compounds. All data is included in the calculation of this tolerance interval. Howe’s tolerance interval method will be used with an added correction factor to ensure the probability of error is equal to 0.001. Any data that falls outside of the tolerance interval will be considered an outlier. Figure 10 is a table of error probabilities for outlier detection when Howe’s tolerance interval alone is used to calculate cut-offs. In this table it can be seen that as n approaches infinity, the asymptotic probability of falsely flagging an outlier is indeed approaching 0.001. However, small sample sizes need a correction factor to preserve that probability. Figure 11 is the table of error probabilities when the added correction factor is used. Figure 12 shows the equations used to calculate a Howe’s tolerance interval with a correction factor in this method. Figure 13 shows the k factor from Howe’s tolerance interval, the 14 correction factor, and the two multiplied together. The KCF column values represent the number of standard deviations in which the cutoff value is away from the mean. Figure 14 contains R code for simulating the correction factors in this method. Figure 10: Howe's tolerance interval error probability 15 Figure 11: Howe's tolerance with correction factor error probability Figure 12: Howe's tolerance interval with correction factor equations 16 Figure 13: Howe's tolerance intervals with correction factor 17 Figure 14: R code for correction factor standard deviation cut off method 18 Applying this method to the provided ED50 data, the tolerance interval with a correction factor is calculated as follows: First, the mean of the dataset (40.59, 41.01, 44.86, 45.36, 45.55, 47.04, 47.79, 52.64, 71.6) is calculated, which is equal to 48.4933. The standard deviation of the dataset is calculated as well, which is equal to 9.3799. Since this specific drug compound’s data set has nine ED50 values, the kCF (k factor multiplied by the correction factor) value of 2.3932 is taken from the table. A tolerance interval is then calculated by adding and subtracting (2.3932*9.3799) to/from the mean of 48.4933 to create the upper and lower limits. This interval is calculated as (26.045,70.941). From this dataset, 71.6 is a detected outlier since it falls outside of this interval. This method has both strengths and weaknesses. The first strength is the same as the second method. This method implements the true behavior of ED50 values by leveraging tolerance intervals to detect outliers. As stated previously, the ED50s act like individual data points in a population distribution, which would require a tolerance interval rather than a confidence interval. The second strength is that this method has a consistent probability of error across all sample sizes. By utilizing a correction factor to adjust the standard deviation cutoff value, this standard deviation rule proves to perform consistently, no matter the number of ED50 estimates for a drug compound. When a value is flagged as an outlier, there is no more than a one in a thousand chance (0.001) that it actually does belong to the same distribution as the other ED50 estimates. This ensures that mistakes are rarely made in outlier detection and that experiments are unlikely to be incorrectly looked at as an estimate that is significantly different from the others. 19 The drawbacks to this method are drawbacks that are both found in the leave-one-out method as well. The first being that individual ED50 variability is not completely taken into account. The second is that there is an assumption that the ED50s are normally distributed. Details of these drawbacks are the same as explained in the Method 2 section. Conclusions Figure 15: Comparative Analysis of Methods Overall, each of these three methods had both strengths and drawbacks. When doing a comparative analysis, it is important to understand which strengths are necessary for a good method as well as which drawbacks are unacceptable for a method. As far as strengths are concerned, there are two things that should be present in an outlier detection method. The first is that it correctly uses the specifics of the data type. The second is that the method performs consistently when used. If these two things aren’t present, they present significant drawbacks that are unacceptable for a method. The only method to have both of these things as strengths was the third method, which is a standard deviation cut-off method. It correctly implemented a 20 method that uses the true behavior of ED50 estimates as well as it has a consistent performance as seen in the consistent probability of error across all sample sizes. Method 1 neither implemented a method based on the specific ED50 estimates data type, nor could the probability of error be simulated to ensure the model performed consistently. Although Method 2 implemented true ED50 behavior by leveraging tolerance intervals, the model proved to be inconsistent across sample sizes with varying probabilities of error in outlier detection. Method 3 is the method that was ultimately recommended to the researchers to use for outlier detection for their ED50 estimates of drug compounds. Limitations & Next Steps In the comparative analysis, Method 3 was discovered to be the best outlier detection method for the researchers to use on the ED50 estimates from pharmacological studies on anticonvulsant drugs. Although Method 3 was identified as the best option for the given information, it does have its drawbacks and limitations. One big limitation with outlier detection techniques is that no matter what method is used, these methods can be unreliable for small datasets. It is well understood that the larger the sample size, the more likely the sample estimates (e.g. mean and standard deviation) are good estimates of the overall population estimates. With large sample sizes, data more accurately depicts the true population distribution, which would lead to an easier ability to detect datapoints that do not belong to the distribution. When it comes to smaller sample sizes, like in the ED50 estimates data, it becomes more difficult to detect outliers because there is variability in how accurately the sample follows the true population distribution. 21 The other limitations of this method are the drawbacks. Both its lack of incorporating variability from the individual ED50 experiments and the assumption of normality of the ED50s are still disadvantages to this method, even though this standard deviation cut-off method was determined to be the best possible option. Something to be researched further that could remedy these drawbacks is meta-analysis of dose-response data. Meta-analysis is the process of combining individual study results to get an overall trend. In meta-analysis for dose-response data (data like the ED50 experiments) a model is individually run through each experiment/study. Maximum likelihood is used to determine the best possible model to run through the subject level data points. Often times splines and knots are used to ensure the model fits the data as best as possible. Figure 16 and 17 are an example of meta-analysis. In figure 16, each graph shows a different study and a prediction model run through it. Each model has a unique distribution based on the subject level data. Then, once all the best models have been selected for each study, usually using maximum likelihood estimation techniques, all those models are aggregated into one overall model. This overall model can be see in figure 17. Figure 16: Meta-analysis for Median Dose Response: Individual Study Models 22 Figure 17: Meta-Analysis for Median Dose Response: Overall Trend Model Figure 18: Meta-Analysis for Median Dose Response: Individual Study Models Graphed with Overall Trend Model Figure 18 is another example of meta-analysis, in which the individual studies are represented by the gray lines and the overall trend by the thick black line. Once all models have been fit for individual studies and the overall trend, hypothesis tests using likelihood can be used to determine if any studies are significantly different from the overall trend. Using a method like this would help remedy both the drawback of the normality assumption and the lack of consideration of individual experiment variability. By using maximum likelihood estimation to develop unique models for each study, rather than just running a standard probit model like ED50 studies usually do, it no longer uses the assumption 23 of normality for the ED50s. There are no assumptions about the distribution, the likelihood is maximized to pick a model that predicts the data well. The individual experiment variability is taken into account because a unique model is selected for each study which allows for the variability in the experiments to be accounted for. This variation in studies is also taken into account in the overall trend model since rather than looking at just the median dose, the behavior of the entire range of doses is looked at. In order to research this type of a method, subject-level data of all the ED50 experiments would be required. Further research of meta-analysis for outlier detection would also be required. While method 3, the standard deviation cut-off method, is a suitable outlier detection technique for the researchers in the ADD lab, it still has limitations which leaves room for more research that might remedy the limitations. Meta-analysis of median dose response studies would be the next logical research step for a more rigorous outlier detection technique for ED50 estimates from pharmacological studies. 24 References Crippa, A., & Orsini, N. (2016). Dose-response meta-analysis of differences in means. BMC Medical Research Methodology, 16(1). https://doi.org/10.1186/s12874-016-0189-0 Howe, W. G. (1969). Two-sided tolerance limits for normal populations, some improvements. Journal of the American Statistical Association, 64(326), 610. https://doi.org/10.2307/2283644 Orsini, N., & Spiegelman, D. (2020). Meta-analysis of dose-response relationships. Handbook of Meta-Analysis, 395–428. https://doi.org/10.1201/9781315119403-18 Sullivan, J. H., Warkentin, M., & Wallace, L. (2021). So many ways for assessing outliers: What really works and does it matter? Journal of Business Research, 132, 530–543. https://doi.org/10.1016/j.jbusres.2021.03.066 25
Reference URL	https://collections.lib.utah.edu/ark:/87278/s6rd22aq