Tetrachoric Correlations

Contents

Tetrachoric Correlations#

The tetrachoric correlation is a special case of the polychoric correlation. As a reminder, the polychoric correlation measures the strength of a relationship between two ordinal variables that:

  • Have 7 or fewer categories

  • Assume the underlying latent variables for each ordinal measure are continuous in nature.

The tetrachoric correlation, by comparison, measures the relationship between two binary variables.

Like, polychoric correlation though, it has no closed-form equation. It relies on maximum likelihood estimation (MLE) on a 2x2 table.

Interpretation

Tetrachoric correlations are interpreted similarly to polychoric correlations.

  • Within the range of [-1, 1]

  • A value of 0 indicates no correlation.

The table of interpretation values appears below:

Correlation Coefficient

Interpretation

0.00 – 0.10

Negligible or trivial

0.10 – 0.30

Weak

0.30 – 0.50

Moderate

0.50 – 1.00

Strong

Assumptions

  1. Both variables are binomial (2 categories)

  2. Underlying latent variables are assumed to be normally distributed.

  3. The variables have a joint bivariate distribution.

Python Example#

Like polychoric correlations, tetrachoric correlations in python are calculated using R’s polycor library after bridging between Python and R via rpy2.

# Install rpy2 if you need it
# !pip install rpy2

If you’re using conda for install, this is the line you need to use:

conda install conda-forge::rpy2
# Import
import numpy as np
import pandas as pd

# Visualizations
import matplotlib.pyplot as plt
import seaborn as sns

# Necessary imports
import rpy2
import rpy2.robjects as ro
from rpy2.robjects.vectors import IntVector
from rpy2.robjects.packages import importr, isinstalled
from rpy2.robjects import pandas2ri
# Import the needed R libraries
utils = importr('utils')

# Don't go through the install process if don't need to
if not isinstalled('polycor'):
  utils.install_packages('polycor')

# Set the import
polycor = importr('polycor')
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: also installing the dependencies ‘mvtnorm’, ‘admisc’


WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: trying URL 'https://cran.rstudio.com/src/contrib/mvtnorm_1.3-3.tar.gz'

WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: trying URL 'https://cran.rstudio.com/src/contrib/admisc_0.38.tar.gz'

WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: trying URL 'https://cran.rstudio.com/src/contrib/polycor_0.8-1.tar.gz'

WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: 

WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: 
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: The downloaded source packages are in
	‘/tmp/RtmpJL9pnK/downloaded_packages’
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: 
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: 

So, imagine that you’re evaluating a dashboard with 100 users. You have two binary variables:

  • Utilization: 0 = low use, 1 = high use

  • Frustration: 0 = low frustration, 1 = high frustration

Utilization reflects a latent, normally distributed motivation to use. Frustration reflects a latent, normally distributed negative emotional state.

# Set a seed
np.random.seed(123)

# Randomly generate a dataset
df = pd.DataFrame(data={'utilization': list(np.random.randint(0, 2, 100, dtype=int)),
                        'frustration':list(np.random.randint(0, 2, 100, dtype=int))})
# Check distribution
pd.crosstab(df.utilization, df.frustration, margins=True)
frustration 0 1 All
utilization
0 24 31 55
1 23 22 45
All 47 53 100
# Set as an R dataframe using context and a converter object
with (ro.default_converter + pandas2ri.converter).context():
  r_df = ro.conversion.get_conversion().py2rpy(df)

To calculate tetrachoric correlation, just use polycor() as shown below.

# Check the correlation
r_corr = polycor.polychor(r_df.rx2('utilization'), r_df.rx2('frustration'))

# Show
print(f'Polychoric correlation: {r_corr[0]}')
Polychoric correlation: -0.11716496854676184
# Coerce numbers to ordered factor (levels 0,1 assumed)
r_df[r_df.names.index("utilization")] = ro.r["ordered"](r_df.rx2("utilization"), levels=IntVector([0,1]))
r_df[r_df.names.index("frustration")] = ro.r["ordered"](r_df.rx2("frustration"), levels=IntVector([0,1]))

# Check significance
r_hetcor = polycor.hetcor(r_df, use="complete.obs")

# Print
print(r_hetcor)
Two-Step Estimates

Correlations/Type of Correlation:
            utilization frustration
utilization           1  Polychoric
frustration     -0.1172           1

Standard Errors:
[1] ""       "0.1561"

n = 100 

The two variables in our example have a weak inverse correlation (r = -0.1172), and it is not statistically significant (p = 0.1561).