Polychoric Correlation#
Polychoric correlation quantifies the relationship between two ordinal variables where there are 7 or less categories and the latent variable underlying the ordinal measurement is assumed to be continuous. Most often, polychoric correlations are used for Likert scale or Likert-type items.
Imagine that you have two Likert scale items that you wish to learn the correlation between:
This product is easy to use.
This product is useful.
Both would be measured on the traditional scale of 1 to 5 assessing level of agreement with each of the above statements.
Interpretation
Polychoric correlations are within the range of [-1, 1], and a value of 0 indicates no correlation. The table of interpretation values appears below:
Correlation Coefficient |
Interpretation |
|---|---|
0.00 – 0.10 |
Negligible or trivial |
0.10 – 0.30 |
Weak |
0.30 – 0.50 |
Moderate |
0.50 – 1.00 |
Strong |
Assumptions
Both variables are ordinal with 7 or fewer categories.
The latent variables underlying each ordinal measure are assumed to continuous.
The joint distribution of the latent variables is bivariate normal.
Given the assumptions above, I will now point out that unlike some of the other correlations such as Pearson’s \(r\), or Point Biserial correlation \(r_{pb}\), the polychoric correlation does not have a closed-form solution. It is estimated numerically via maximum likelihood estimation (MLE). In other words, going from the discretized observed variables to the underlying continuous variables requires maximizing the log likelihood of a bivariate normal cumulative distribution over all of the observed cell frequencies in a contingency table between the two variables of interest.
Python Example#
Polychoric correlations are another correlation that does not have a well-maintained Python package. Therefore, this tutorial will bridge between Python and R using rpy2.
# Install rpy2 if you need it
# !pip install rpy2
If you’re using conda for install, this is the line you need to use:
conda install conda-forge::rpy2
# Import
import numpy as np
import pandas as pd
# Visualizations
import matplotlib.pyplot as plt
import seaborn as sns
# Necessary imports
import rpy2
import rpy2.robjects as ro
from rpy2.robjects.vectors import IntVector
from rpy2.robjects.packages import importr, isinstalled
from rpy2.robjects import pandas2ri
# Import the needed R libraries
utils = importr('utils')
# Don't go through the install process if don't need to
if not isinstalled('polycor'):
utils.install_packages('polycor')
# Set the import
polycor = importr('polycor')
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: also installing the dependencies ‘mvtnorm’, ‘admisc’
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: trying URL 'https://cran.rstudio.com/src/contrib/mvtnorm_1.3-3.tar.gz'
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: trying URL 'https://cran.rstudio.com/src/contrib/admisc_0.38.tar.gz'
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: trying URL 'https://cran.rstudio.com/src/contrib/polycor_0.8-1.tar.gz'
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]:
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]:
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: The downloaded source packages are in
‘/tmp/RtmpLPaHdG/downloaded_packages’
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]:
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]:
To start, let’s create a data set of two variables with 100 data points.
# Seed
np.random.seed(10)
# Create the dataframe
df = pd.DataFrame(data={'easy_to_use': list(np.random.randint(1, 6, 100)),
'usefulness': list(np.random.randint(1, 6, 100))})
# Set as an R dataframe using context and a converter object
with (ro.default_converter + pandas2ri.converter).context():
r_df = ro.conversion.get_conversion().py2rpy(df)
# Check the correlation
r_corr = polycor.polychor(r_df.rx2('easy_to_use'), r_df.rx2('usefulness'))
# Show
print(f'Polychoric correlation: {r_corr[0]}')
Polychoric correlation: -0.15387111896077552
# Coerce numbers to ordered factor (levels 1-5 assumed)
r_df[r_df.names.index("easy_to_use")] = ro.r["ordered"](r_df.rx2("easy_to_use"), levels=IntVector([1, 2, 3, 4, 5]))
r_df[r_df.names.index("usefulness")] = ro.r["ordered"](r_df.rx2("usefulness"), levels=IntVector([1, 2, 3, 4, 5]))
# Check significance
r_hetcor = polycor.hetcor(r_df, use="complete.obs")
# Print
print(r_hetcor)
Two-Step Estimates
Correlations/Type of Correlation:
easy_to_use usefulness
easy_to_use 1 Polychoric
usefulness -0.1538 1
Standard Errors:
[1] "" "0.1102"
n = 100
P-values for Tests of Bivariate Normality:
[1] "" "0.3857"
We can see that our two variables in this example have a weak inverse relationship that is not statistically significant:
\(r\) = -0.1538
p = 0.3587, which is greater than 0.05, meaning we cannot reject the null hypothesis.