A Brief Exploration of Exploratory Data Analysis (EDA)

You probably wouldn’t buy a new car before checking out some online reviews, reading up on its specs, and taking it for a test run. In a similar manner, it would be unwise to make critical business decisions on the basis of information and assumptions that haven’t been screened or tested in some way. That’s what exploratory data analysis is all about.

What Is Exploratory Data Analysis?

In essence, exploratory data analysis or EDA is a way of getting an overview of the quality and nature of the information available before you begin studying it in more detail. In the context of business intelligence (BI), EDA involves conducting initial investigations on data to discover existing patterns, spot anomalies, test out theories or hypotheses concerning the information and check out the validity of any assumptions made about the data prior to analysis.

The concept isn’t really a new one. Cautious investigators have been making use of the principal for decades, if not centuries. But in 1977, John W. Tukey coined the phrase in his book, Exploratory Data Analysis, and went on to develop the theory as it moved into formalized use.

EDA helps analysts to make sense of the data that they have, then figure out what questions to ask about it and how to frame them, as well as the best ways to manipulate available data sources to get the answers they need.

EDA Tools and Techniques

Using quantitative techniques and visual methods of representation, EDA looks to tell a story about your existing data based on a broad examination of patterns, trends, outliers, and unexpected results. Observations recorded during exploratory data analysis should suggest the next steps you logically take, the questions you’ll ask, and your possible areas of research.

At a higher level, data scientists use the visual and quantitative methods of EDA to understand and summarize a dataset, without making any assumptions about what it contains. This analysis may be a precursor to lines of investigation deploying more sophisticated statistical modeling or machine learning techniques. Exploratory data analysis typically involves a combination of mathematical/statistical methods with visual models used to represent the results as appropriate.

Univariate analysis may be used to describe types of data which consist of observations on only a single characteristic or attribute. Scientists may conduct this analysis on each field in the raw dataset with summary statistics. The output could resemble the figure below: Bivariate analysis looks at two variables (often denoted as X, Y), for the purpose of determining the empirical relationship between them. Bivariate visualizations and summary statistics enable data scientists to assess the relationship between each variable in a dataset and the target variables they’re currently looking at. A typical plot would look like this:

Exploratory Data Analysis (EDA)
(Image source: svds.com)

Bivariate analysis looks at two variables (often denoted as X, Y), for the purpose of determining the empirical relationship between them. Bivariate visualizations and summary statistics enable data scientists to assess the relationship between each variable in a dataset and the target variables they’re currently looking at. A typical plot would look like this:

(Image source: svds.com)

When two or more variable quantities are relevant to an investigation, multivariate visualizations and analysis can map the interactions between different fields in the data, typically yielding graphical results like the figure below:

(Image source: svds.com)

Dimensionality reduction enables analysts to understand the fields in their data which account for the most variance between observations and allows for the processing of a reduced overall volume of data.

Similar observations in a dataset may be assigned to differentiated groupings through a process known as clustering, which collapses the data into a few small data points allowing patterns of behavior to be more easily identified. For example, K-Means clustering creates “centers” for each cluster based on the nearest average. A clustered distribution of data might look like this:

(Image source: svds.com)

Technology and Software

Since its “formal” introduction in the 1970s, EDA has given birth to its own generation of statistical programming tools and software. S-Plus and R are among the most commonly used statistical programming packages for conducting exploratory data analysis. R, in particular, is a powerful and versatile open-source programming language that can be integrated with many business intelligence platforms.

With the appropriate data connectors, you can incorporate EDA data directly into your BI software, acting as a two-way analysis “bridge”. Besides performing initial analysis, statistical models built and run from an EDA package can tap into existing business intelligence data and automatically update as new information flows into a model. As an example, you might use EDA technology to map the lead-to-cash process across your full range of transactions and departments as an aid in streamlining and facilitating the conversion of prospects to actual buyers.

Putting EDA into Context

Exploratory data analysis is primarily about getting to know and understand your data before you make any assumptions about it. It’s an important step in avoiding the risk of building inaccurate business models based on inaccurate information or following up on strategies founded on wrongful assumptions.

During EDA, various technical assumptions are usually assessed to help select the best model for the data and the work ahead. EDA technology helps during the feature engineering stage by suggesting relationships and how they might be efficiently incorporated into a model. A model based on EDA also guards against poor predictions and incorrect conclusions that could have negative consequences for an organization.

Assumptions based on flawed business logic are typically harder to detect – and are often deeply ingrained with the problem and how it’s initially presented. As a best practice, a data scientist will systematically assess the contents of each data field and its interactions with other variables, especially those key metrics which represent behaviors that the business wants to understand or predict.

Ultimately, exploratory data analysis gives the analyst an opportunity to get acquainted with the available data and develop an intuition for what it contains. EDA technologies and techniques allow for the easier identification of glaring errors and more subtle discrepancies that could have unfortunate results later on. This empowers data scientists to ensure that the results they produce are valid, applicable to the desired business objectives, and correctly interpreted.