Language of Data#

Language is fundamental to every aspect of data science and analytics. Therefore, it’s worth reviewing some basic vocabulary that can be broken out into two groups:

  1. Language of Data Features- refers to aspects of data itself.

  2. Language of Methods- ways of deriving information from data.

Don’t look at these items and worry, though. I guarantee that as a UX professional you’re well acquainted with the terms I’ll be mentioning. It might be, however, that you never put them all together to see how they all relate.

Language of Data Features#

Often when talking about variables, analysts and data scientists mention ‘features.’ In that scenario they’re referring to variables, because variables represent the features we are interested in learning more about. However, in this brief lesson I’ll use ‘feature’ to refer to aspects of data, of which there are three I consider very important:

  1. Structure

  2. Content

  3. Level

Structure#

Structure is a simple concept: Your data either has it, or it doesn’t.

Structured data is data that is organized according to some predefined model. So, think of tables and files as two examples. Structured data is well defined within its storage space. It’s also easily searched.

Unstructure data, by comparison, lacks a predefined model. It may have its own internal structure. It may also be stored in a native format and require additional processing before being usable in analytics projects.

Content#

Content refers, in a very broad sense, to what the data contains: Is your data composed of numbers? Or is your data composed of non-numeric content like text or speech?

Data composed of numbers is said to be quantitative. Data not composed of numbers is qualitative.

Level#

Level relates to content in that it provides a more detailed description of data content. To dig into that, there are 4 levels of data:

  • Nominal, categorical data without hierarchy

  • Ordinal, categorical data with hierarchy

  • Interval, continuous data (numeric) with equal intervals and no zero point

  • Ratio, continuous data (numeric) with equal intervals and a zero point

Let me provide examples of each.

Level

Example

Nominal

Type of animal: Cat, Dog, Elephant, Ferret

Ordinal

Frequency of use: Never, Rarely, Montly, Weekly, Daily

Interval

Hours on a clock (1 thru 12)

Ratio

Income: $0, each dollar increase is an increase by the same magnitude.

Language of Methods#

When referring to methods, I’m talking about what we do with data to extract insights. We can start by looking at methods from a very general level of unsupervised vs. supervised methods. Unsupervised methods are methods that derive insights solely from the patterns within the data they are applied to. Supervised methods rely on the data to provide a target feature to train its parameters on for the goal of predicting something (a value or a group membership). I provide some examples below.

Unsupervised Methods

  • KMeans clustering- given a set of quantitative data, the clustering algorithm uses euclidean distance to group data points based on their closeness.

  • Principal Component Analysis- given a quantitative data set, identifies the combination of variables along dimensions that explain the most observed variance in the data set.

  • Semantic clustering- given text data (qualitative data), uses similarity of meaning to group content together.

Supervised Methods

  • Logistic regression- given a set of quantitative predictors, provides the likelihood that unseen data points will fall into a specific category.

  • Polynomial regression- given a set of quantitative predictors, creates a model that can predict a target variable in a non-linear relationship.

We can also look at methods as they apply to goals of analytics. In this case, the high level split of unsupervised and supervised methods no longer holds, because each category can offer something to the more fine-grained analytics goals.

  1. Descriptive analytics, methods that answer the question of, “What happened?” These are the backbone of analytics as a whole.

  • Measures of central tendency

  • Topic modeling

  1. Diagnostic analytics, methods that answer the question of, “What is this?” These methods tell you what something is or could be.

  • Hypothesis testing

  • Correlation

  1. Predictive analytics, methods that answer the question of “What will happen?” As the name suggests, these methods predict some future event based on large amounts of previous data.

  • Linear regression

  • Time series analysis

  1. Prescriptive analytics, methods that help you create solutions or recommendations. They answer the question, “How can I address this?”

  • Combines descriptive, diagnostic, and/or predictive analytics

  • Optimization or simulation