Rank Biserial Correlation r_{rb}

Contents

Rank Biserial Correlation \(r_{rb}\)#

The rank biserial correlation is a nonparametric correlation that describes the relationship between a dichotomous variable and an ordinal variable. If you have some passing familiarity with non-parametric statistical tests, then you may recognize it as being used in the Mann Whitney U test. The point of the rank biserial correlation is to indicate the proportion of favorable vs. unfavorable rank comparisons between the groups of the dichotomous variable.

An example of a use case scenario is if you wanted to compare the frustration ratings for opening a new packaging design between right-hand dominant people with left-hand dominant people.

  • dichotomous variable: Dominant hand

  • ordinal variable: Frustration

The general form of this correlation is:

\(r_{rb} = 2*\frac{(Y_{1} - Y_{0})}{n}\)

  • \(Y_{1}, Y_{0}\) are means of the ranks computed for data pairs.

  • n is the number of observations

Interpreting \(r_{rb}\)

As with many other correlation values, \(r_{rb}\) is within the range of [-1, 1], where 0 indicates no correlation.

Correlation Coefficient

Interpretation

0.00 – 0.10

Negligible or trivial

0.10 – 0.30

Weak

0.30 – 0.50

Moderate

0.50 – 1.00

Strong

Assumptions

  1. One variable is dichotomous, and the other is ordinal.

Python Example#

To illustrate the use of the rank biserial correlation, I will generate 100 observations randomly.

# Import
import numpy as np
import pandas as pd

# Correlation coefficient
import scipy.stats as stats

# Visualizations
import matplotlib.pyplot as plt
import seaborn as sns
# Seed
np.random.seed(10)

# Create the dataframe
df = pd.DataFrame(data={'handedness': list(np.random.randint(0, 2, 100)),
                        'frustration': list(np.random.randint(1, 6, 100))})

In the above dataset assume the following:

  • Handedness: 0 is left hand dominant, 1 is right hand dominant.

  • Frustration: 1 is not at all frustrated, 5 is very frustrated.

# Turn handedness into a categorical variable
df['handedness'] = df['handedness'].astype('category')

# Look at the info
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 2 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   handedness   100 non-null    category
 1   frustration  100 non-null    int64   
dtypes: category(1), int64(1)
memory usage: 1.1 KB
# Describe via crosstabs
pd.crosstab(df['handedness'], df['frustration'], margins=True)
frustration 1 2 3 4 5 All
handedness
0 12 16 8 10 9 55
1 10 12 8 9 6 45
All 22 28 16 19 15 100
# Visualize
sns.countplot(y='frustration', hue='handedness', data=df)
# Bar annotations
plt.bar_label(plt.gca().containers[0])
plt.bar_label(plt.gca().containers[1])

# Title
plt.title('Frustration Ratings per Dominant Hand')

# Legend
plt.legend(loc='lower right', labels=['Left', 'Right'], title='Hand Dominance')

# Display
plt.tight_layout()
plt.show()
../../_images/d539aa3ab4c52363250c69f407c01c4a60a6a242df5e856b066dd6868c0c3e7e.png

In python, the easiest way to get the rank biserial correlation is to run a Mann-Whitney U test and calculate the correlation manually.

# subset left and right
left_hand = df[df['handedness'] == 0]['frustration']
right_hand = df[df['handedness'] == 1]['frustration']

# Set up the test
u, p = stats.mannwhitneyu(left_hand, right_hand, alternative='two-sided')
# Get sizes of each group
n_left = len(left_hand)
n_right = len(right_hand)

# Calculate the r_rb
r_rb = 2 * u / (n_left * n_right) - 1

# Show
print(f'Rank biserial correlation = {r_rb} | p-value: {p}')
Rank biserial correlation = 0.007676767676767726 | p-value: 0.9490732935909326

As can be seen above, the correlation is practically non-existent. To see if it is significant, you can use p value generated from the Mann Whitney test. Unsurprisingly, it is not significant for this weak correlation.