## Specification of M-Bias, and an Introduction to the Challenges of Causal Discovery

Causal Inference is a field that touches several domains and is of interest to a wide range of practitioners including Statisticians, Data Scientists, Machine Learning Scientists, and other Computational Researchers. Recovery of unbiased estimates of Causal Effects is at times a tough task, particularly in non-randomized settings. I have written several technical pieces on leveraging G-Methods for necessary adjustment to recover causal inferences/contrasts of interests; these include pieces on **Efficient Study Sampling Designs in Causal Inference**, **Doubly Robust Estimation Techniques**, **G-Estimation of Structural Nested Models**, and **Marginal Structural Modeling for informative censoring adjustment in randomized A/B Tests**.

This piece is a bit shorter, but touches on an important topic. Specifically, we will discuss the structure and issues of “M-Bias”, and how confounding identification is technically unverifiable using empirical analysis alone.

The contents of this piece are as follows:

Let us postulate a simple toy example:

We have binary Intervention ** A** with support {0,1} a continuous Outcome

**, and we are working in an observational (i.e. non-randomized) setting. We would like to recover an unbiased estimate of the Mean Causal Effect Difference of Intervention**

*Y***on Outcome**

*A***.**

*Y*We want to check if there is “confounding” by binary variable ** L**. As per empirical analysis of our dataset, we would like to “prove” that

**is in-fact a confounder of the**

*L***–**

*A***causal relationship. And after proving**

*Y***is in-fact a confounder, we can adjust for**

*L***in our analysis to recover an unbiased estimate of the Mean Causal Effect of**

*L***on**

*A***.**

*Y*As per the “traditional” definition, a variable ** L** is a confounder of the Intervention-Outcome relationship if it meets the following three criteria:

is not a “downstream consequence” of Intervention*L*(i.e.,*A*occurs in time before Intervention*L*)*A*is associated with Intervention*L**A*is associated with the Outcome*L*within levels of Intervention*Y*(i.e. when conditioning on*A*)*A*

We would like to confirm the following three criteria in our dataset empirically:

Given we have empirically confirmed the three criteria above, we can specify the causal structure of our problem. We have “proven” that the true causal structure is specified by the Causal Directed Acyclic Graph shown below in *Figure 1*:

Given the established the Causal DAG in *Figure 1*, we would like to recover an unbiased estimate of the Mean Causal Effect of Intervention ** A** on Outcome

**. In the marginal DAG, there is an open back-door associational path**

*Y***<-**

*A***->**

*L***, meaning we need to “adjust” for**

*Y***.**

*L*So we’re done! …

Or are we? …

Is there possibly something amiss here??

So what is the issue with our analysis in *Section 2*? Well, it turns out, the structure of our problem is in-truth different than we thought. *Figure 2* below shows the true causal structure of our problem:

In the Causal DAG in *Figure 2*, variables ** U1** and

**are unmeasured (we do not have access to them in our dataset). And importantly, it’s evident that variable**

*U2***is not a “confounder” of the**

*L***–**

*A***relationship even though**

*Y***meets all three criteria specified in the previous section. It turns out the three criteria listed in**

*L**Section 2*are not sufficient to guarantee a variable is a “confounder”.

In *Figure 2* variable ** L** is a “collider” between unmeasured variables

**and**

*U1***, and there is no open back-door associational path from**

*U2***to**

*A***in the marginal Causal DAG. Rather**

*Y***and**

*A***are marginally d-separated. The causal structure presented in**

*Y**Figure 2*is referred to as

**M-Bias**in the academic literature.

And most importantly, what would happen if we were to condition on ** L**? Again, given the Causal DAG in

*Figure 2*,

**is a collider between unmeasured variables**

*L***and**

*U1***. If we were to condition on**

*U2***, we would be**

*L***an associational backdoor path from**

*opening***to**

*A***via**

*Y***<-**

*A***-> [**

*U2***] <-**

*L***->**

*U1***. Implicitly this is a form of selection bias. Therefore, not only is there not confounding by**

*Y***, but conditioning on**

*L***would cause bias (i.e. selection bias) in our analysis, not prevent it.**

*L*## Aside on origins of the name “M-Bias”:

When presenting a Causal DAG visually, I like to rank order my variables in the graph from left to right in the order said variables occur in time. This is simply a preference of mine; I find Causal DAGs drawn in this manner easier to analyze and understand. However, because in a Causal DAG the direction of causality between any two variables is already mathematically encoded in the direction of the edge (i.e. arrow) between them, rank-ordering the variables visually in their causal-order is not required.

In the academic literature, it is far more common to see the Causal DAG shown in *Figure 2* visually presented as the figure to the bottom right:

The two Causal DAGs shown above are one in the same. Given the clear “M structure” of the figure on the right above, the origins of the name M-Bias are hopefully clear.

Let’s recap where we are. In *Section 2*, we performed a small empirical “exploration” of our dataset in an attempt to determine the causal structure of our problem. We determined variable ** L** was a “confounder” and needed to be “adjusted” for. However, in

*Section 3*we revealed that our understanding of the causal structure from

*Section 2*was incorrect. Variable

**is in-fact not a confounder, and given the true Causal DAG in**

*L**Figure 2*“adjusting” for

**would**

*L**cause*bias in our analysis (i.e. selection bias), not prevent it.

So what is the takeaway message here? The lesson is that empirical analysis of our dataset cannot alone definitively reveal the “true” causal structure of a problem with certainty. Rather our understanding should be supplemented with subject-matter domain knowledge. The small “exploration” analysis we conducted in *Section 2* is a mini-example of a process called “Causal Discovery”, a topic I plan on writing a detailed piece about in the future. In short, Causal Discovery is hard. Very hard in fact. It relies on quite a number of empirical assumptions that are often untestable. M-Bias is but one example that throws a wrench into most Causal Discovery techniques.

In short, in the interest of recovering an unbiased estimate of the causal effect of an Intervention on an Outcome in a non-randomized setting, when determining if a variable should be “adjusted for” it is important to lean on domain and subject-matter expertise to inform the specification of the causal structure of a problem, and not rely on empirical analysis of the sample data alone.

We’re going to conduct a computational simulation in Python to investigate the M-Bias causal structure discussed in this piece. We will:

- Create a simulated dataset with the true Causal DAG as shown below in
*Figure 2*with the true Causal Effect Difference ofon*A*being null (i.e. equal to zero)*Y* - We will show how variable
fulfills all three of the “classical confounder criteria” discussed in this article, even though it is not a confounder of the*L*–*A*causal relationship.*Y* - We will show how the marginal empirical estimate of
on*A*recovers an unbiased estimate of the causal effect of*Y*on*A*, and how conditioning on*Y*causes bias instead of preventing it.*L* - We will repeat steps 1 through 3, but with a simulated dataset where the true Causal Effect Difference of
on*A*is 1.62*Y*

Let’s import our needed libraries: