# Understanding the Surrogate Endpoint Dilemma in Research
Written on
Chapter 1: The Challenge of Correlation
The surrogate endpoint dilemma remains a significant challenge in research, one that has yet to be resolved. This issue often surprises researchers and can lead to misleading conclusions and decisions. In a previous discussion on correlation, I delved into Francis Galton’s insightful discoveries and highlighted certain pitfalls in correlation analysis, including intransitivity and non-linear relationships among correlated variables.
Now, I will focus on the more intricate surrogate endpoint problem. To start, we will examine a basic correlation scenario involving two binary variables. After that, we will briefly explore how this straightforward setup can lead to statistical pitfalls and how the surrogate endpoint issue emerges in practice. Let’s dive in.
Section 1.1: Correlation Among Binary Variables
Imagine a hypothetical situation where a vegan diet shows a positive correlation with COVID-19 infections. In this case, both factors can be treated as binary variables, where responses are limited to "yes" or "no." An individual is either fully vegan or not, and similarly, one can either be infected with COVID-19 or not. If we assert that vegans are more likely to contract COVID-19 than the average individual, it parallels the statement that someone infected with COVID-19 is more likely to be vegan than the average person.
To clarify this logical equivalence, we can express these statements mathematically. The first statement can be represented as follows:
(Vegans with COVID-19) / (Total vegans) > (Total COVID-19 cases) / (Total individuals)
The second statement can be expressed as:
(Vegans with COVID-19) / (Total COVID-19 cases) > (Total vegans) / (Total individuals)
By manipulating these expressions further, we can illustrate their equivalence:
(Vegans with COVID-19) * (Total individuals) > (Total COVID-19 cases) * (Total vegans)
This brings us to the next crucial aspect of our discussion.
Section 1.2: The Statistical Perspective on Correlation
When analyzing the last expression, a significant concern arises. In any relevant sample size, there is only a slim chance that the two sides of the expression will be perfectly equal.
What this implies is that these two variables are bound to exhibit some correlation, irrespective of whether they pertain to veganism or COVID-19. In fact, any binary variable—like gender or blood type—can show a positive or negative correlation with the risk of COVID-19.
So how can we address this issue? This is where statisticians play a vital role. Scientific studies often present statistically significant correlations. However, the concept of statistical significance can introduce its own set of challenges, which I have previously examined in detail. Even with statistically significant correlations, misinterpretation of causality can arise.
Subsection 1.2.1: Misinterpretation of Causality
Suppose we manage to establish a statistically significant correlation between veganism and severe COVID-19 cases. This leads to the statement:
“If you follow a vegan diet, your risk of severe COVID-19 infection is higher than that of the average person.”
While this statement is factual, it lacks the sensationalism often found in mainstream media. Thus, you might encounter a more dramatic claim:
“If you aren’t vegan, your chances of severe COVID-19 infection are lower.”
Although subtle, this difference carries significant implications. The second statement suggests a causal relationship, which has not been proven through mere statistical correlation.
Chapter 2: The Surrogate Endpoint Problem
The surrogate endpoint dilemma emerges naturally from various correlation scenarios. Taking the previous example, conducting scientific studies to quantify the risk of severe COVID-19 due to veganism is both labor-intensive and time-consuming; researchers would have to wait for severe outcomes to occur.
Instead of that lengthy wait, researchers often seek surrogate endpoints—simpler measures that can stand in for more complex phenomena. For instance, they may use a biomarker like blood oxygen levels. If a vegan’s blood oxygen level drops below a specific threshold, researchers might classify this as a severe risk.
However, surrogate endpoints may not have a causal connection to the actual phenomenon. They might also stem from confounding factors that have gone untracked, leading to incorrect conclusions and misguided decisions.
Section 2.1: Identifying the Surrogate Endpoint Problem
Whenever you come across phrases like “This product increases cancer risk…” or “This food heightens cardiovascular health risk…,” it’s essential to scrutinize what was measured to reach such conclusions. Terms like “cancer risk” or “cardiovascular health risk” are frequently quantified using surrogates or proxies. While this doesn’t always lead to the surrogate endpoint problem, the risk is ever-present in such studies.
Despite over 130 years since Galton introduced the concept of correlation, we continue to grapple with the relationship between correlation and causation, largely due to issues like statistical significance and the surrogate endpoint dilemma. The root of this challenge may lie in human nature rather than a lack of scientific advancement. Nevertheless, we must persist in our quest for solutions.
To support my ongoing work, feel free to check out my Patreon page for contributions.
The first video, "Confronting Common Data Issues in the Trial-Level Evaluation of a Surrogate Endpoint," discusses the complexities researchers face when dealing with surrogate endpoints in clinical trials.
The second video, "Endpoints, Clinical Trial Outcomes, & Surrogate Endpoints," further explores the relationship between clinical trial outcomes and surrogate endpoints, shedding light on their implications in research.