Point Biserial Correlation (r_{pb})

Point Biserial Correlation \((r_{pb})\)#

Here is where we start mixing data types. The point biserial correlation measures the strength of a relation between a dichotomous variable and a continuous variable. It assumes a linear relationship between the two variables.

As with most other correlations, the range of the point biserial correlation is [-1, 1]. Some instances where you might want to use point biserial correlation could be:

  • Strength of the relationship between task success (0 = failed, 1 = completed) and SUS score.

  • Mobile OS (0 = iOS, 1 = Android) and error count during task completion.

Assumptions

  1. One variable is dichotomous (two categories) and another is continuous.

  2. There are no outliers among the continuous variable values associated with each category of the dichotomous variable.

  3. Continuous variables for each category of the dichotomous variable should be approximately normally distributed.

  4. Continuous variables for each category of the dichotomous variable should have equal variances.

The formula for the point biserial correlation is:

\(r_{pb} = \frac{\mu_{1} - \mu_{0}}{s}\sqrt{\frac{n_{1}n_{0}}{n^{2}}}\)

  • \(\mu_{0}, \mu_{1}\) are the means of the continuous variables associated with each category of the dichotomous variable.

  • \(s\) is the standard deviation of the entire continuous variable (not broken out by category).

  • \(n_{0}, n_{1}\) are the sample sizes for each category.

  • \(n\) is the size of the full sample set

How to Interpret (assumes absolute values of \(r_{pb}\))

Correlation Coefficient

Interpretation

0.00 – 0.10

Negligible or trivial

0.10 – 0.30

Weak

0.30 – 0.50

Moderate

0.50 – 1.00

Strong

Python Code Example#

For this example, I will generate a data set to use. The data set will have two variables:

  • state: 0 for off, 1 for on

  • msec: time in milliseconds

Let’s imagine with this data set that the msec variable refers to reaction time on a task when an additional notification light is either on to signal a need for a person to react, or if it remains off even when a person needs to react.

# Import
import numpy as np
import pandas as pd

# Correlation coefficient
import scipy.stats as stats

# Visualizations
import matplotlib.pyplot as plt
import seaborn as sns
# Set a seed
np.random.seed(123)

# Randomly generate a dataset
df = pd.DataFrame(data={'state': list(np.random.randint(0,2, 100, dtype=int)),
                        'msec':list(np.random.normal(0,0.1,100))})
# Explore each category
df.groupby('state').describe()
msec
count mean std min 25% 50% 75% max
state
0 55.0 0.018073 0.107047 -0.197789 -0.074583 0.013021 0.099574 0.259830
1 45.0 0.003063 0.106846 -0.212310 -0.068887 0.018104 0.080724 0.223814
# Test normality for each state - Shapiro-Wilk Test
print(stats.shapiro(df[df['state'] == 0]['msec']))
print(stats.shapiro(df[df['state'] == 1]['msec']))
ShapiroResult(statistic=np.float64(0.9724618426989159), pvalue=np.float64(0.23691688392959231))
ShapiroResult(statistic=np.float64(0.9699718490795727), pvalue=np.float64(0.28890857959306887))

The p-values are greater than 0.05/2 (0.025), so we cannot reject the null hypothesis that the values for each group were from a normal distribution.

Now, to check equality of variance.

# Test of homoskedasticity
print(stats.levene(df[df['state'] == 0]['msec'], df[df['state'] == 1]['msec']))
LeveneResult(statistic=np.float64(0.062338174744849005), pvalue=np.float64(0.8033604123049517))

The large p-value suggests that each group has homogeneous variance. We may proceed with the point biserial correlation.

# Isolate the variables
x = df.state
y = df.msec

# print
print(stats.pointbiserialr(x,y))
SignificanceResult(statistic=np.float64(-0.07034906630698473), pvalue=np.float64(0.48673672603530155))

Looking at the results above, notice the p-value of 0.4864. This is far higher than a traditional alpha of 0.05, so we cannot reject the null hypothesis. Therefore, there is no relationship between state and milliseconds.