\(\phi\) Correlation#
The \(\phi\) correlation measures the relationship between two binary variables. However, it does not assume any latent continuous distributions underlying either variable.
At this point you might be asking why you would want to use this if there’s something like Chi-Square available to you? So, the best way to think about it is this: They’re related, but they answer different questions.
Chi-Square: Is there a statistically significant association between two categorical variables? It’s more geared toward hypothesis testing.
\(\phi\) Correlation: How strong is the association between two variables? So this looks at effect size/magnitude.
The \(\phi\) correlation has a closed-form solution:
\(\phi = \frac{ad - bc}{\sqrt((a + b)(c + d)(a + c)(b + d)}\)
Where the values a, b, c, and d correspond to cells in a contingency table as seen below:
Y = 0 |
Y = 1 |
|
|---|---|---|
X = 0 |
a |
b |
X = 1 |
c |
d |
Interpretation
\(\phi\) correlations are interpreted similarly to most other correlations covered so far.
Within the range of [-1, 1]
A value of 0 indicates no correlation.
The table of interpretation values appears below:
Correlation Coefficient |
Interpretation |
|---|---|
0.00 – 0.10 |
Negligible or trivial |
0.10 – 0.30 |
Weak |
0.30 – 0.50 |
Moderate |
0.50 – 1.00 |
Strong |
Assumptions
Both variables are binary.
There is no underlying latent continuous variable for either binary variable.
Python Example#
\(\phi\) correlation is really easy to calculate in Python. To show how to do this, let’s first generate some data aligned with the following scenario:
You want to look at the strength of the association between people who use grocery delivery service, and the subscribers to a streaming service. For both variables, 0 means not enrolled, and 1 means enrolled.
# Import
import numpy as np
import pandas as pd
# Visualizations
import matplotlib.pyplot as plt
import seaborn as sns
# Set a seed
np.random.seed(123)
# Randomly generate a dataset
df = pd.DataFrame(data={'delivery': list(np.random.randint(0, 2, 100, dtype=int)),
'streaming':list(np.random.randint(0, 2, 100, dtype=int))})
# Generate crosstabs
ct = pd.crosstab(df['delivery'], df['streaming'])
# Show
display(ct)
| streaming | 0 | 1 |
|---|---|---|
| delivery | ||
| 0 | 24 | 31 |
| 1 | 23 | 22 |
# Calculate phi correlation
a = ct.iloc[0,0]
b = ct.iloc[0,1]
c = ct.iloc[1,0]
d = ct.iloc[1,1]
phi = (a*d - b*c)/np.sqrt((a+b)*(c+d)*(a+c)*(b+d))
# Show
print(f'Phi correlation: {phi}')
Phi correlation: -0.07450703190575633
From the above, you can see that there is a very weak correlation between grocery delivery service subscribers and streaming service subscribers. For this scenario, if you’ve also done a Chi-Square test, the alpha level there would be your indication of the level of statistical significance that you can attach to the relationship.