Data Visualization Fundamentals
Introduction to Data Visualization
Back to Top ↑
Why Is It Important to Visualize Data?
Here I have a set of eleven data points as well as some summary statistics:
Here is that same information plotted on a scatter plot with a linear line of best fit:
Now suppose I add three more data sets with exactly the same summary statistics:
You might expect that since these data sets have the same mean, variance, correlation coefficient, and line of best fit, that they will look very similar when visualized. However...
We can clearly see curvature and outliers have drastically thrown off our summary statistics. This example is known as Anscombe's Quartet and demonstrates how important it is to always plot your data rather than relying on summary statistics alone.
The Power of Sight
For exploratory analysis like in the above plots, we can clearly see patterns (Set 2) or deviations in the data (Set 4) that we might miss from a table or summary statistics. But there's still an even more important reason for creating graphics like this in the first place.
The key lies in the immense power of our visual processing system.
Danish physicist Tor Norretrander converted the bandwidth of our senses to computer terms to help us understand the power of this system.
In this visualization, we can see that sight takes up the majority of the frame. Based on research by Tor and others, we can approximate that human sight can process information up to the speed of computer networks, or an Ethernet cable.
Sense of taste has the bandwidth of a calculator, and the small white box in the bottom right corner of this frame is only .7% of our total bandwidth. That small white box represents what we are aware of while all this processing is happening.
For these reasons it's beneficial to visually display information whenever possible.
Not only might it help catch some underlying pattern, trend, or outlier, but for those who will be interpreting the information, they can gather more by seeing the data than any other way you could present it.
Back to Top ↑
Before we can begin to create or even recreate visualizations, we need to understand data and data types.
Almost anything can be turned into data and will either be classified as quantitative or qualitative data.
You can think of quantitative data as numeric data or any data point with an exact number.
This number could be a measurement, such as a baseball player's height or weight or a count, such as a number of hits or home runs.
Quantitative data is not ordered by time, it's just data that's been collected.
Continuous: time, height, weight, money, interest rates, temperature
Discrete: units sold, number of languages spoken, number of emails you received yesterday
Discrete data has distinct values whereas continuous data can assume any value within a range. A player's number of home runs for example would be a discrete variable. Discrete variables must take the form of whole numbers. A player can have 10, or 25, or 34, home runs but not 13.4 or 22.19. A player either hits a home run or they do not.
Continuous variables are numbers that can fall anywhere within a range. For example, a players batting average which falls between 0 and 1 (0 and 1000). They could have a batting average of .250 or .357 or .511.
Categorical: gender, hair color, country, industry, dog breed
Ordinal: Rankings, school letter grades (A, B, C, D, F), survey questions like, "How do you feel about x?" (1 -hate, 2-negative, 3-neutral, 4-positive, 5-love)
Categorical data represents characteristics such as a player's position, team, hometown, or even handedness. Categorical data can take on numerical values but they don't have mathematical meaning. We can't add them together or take their average.
Ordinal data in some sense is a mix of numerical and categorical data. The data still falls into categories, but those categories have some order or ranking. For plotting purposes, ordinal data is treated in much the same way as categorical data, but the groups are usually ordered from lowest to highest so that we can preserve this ordering.
Time series data is simply a collection of observations obtained through repeated measurements over time.
A time series is a sequence of numbers collected at regular intervals over some period of time. For example, we might measure the average number of home runs per player for many different years.
Time series data is not so different from numerical data. The real difference, is rather than having a bunch of numerical values which don't have any time ordering, time-series data does have some implied ordering. There is a first data point collected and a last data point collected.
Good Vs. Bad Data Viz Examples
Back to Top ↑
Example 1: Segmented Profits
Here is an example of a table that one might see at work.
This table contains the total profit for the the U.S. region broken out by category and segment.
This table can be a little difficult to read. It's hard to tell how the values compare to each other, or if there are any patterns or correlations.
Here is the same data but segments with negative profit are red to make them pop out.
This helps but but it’s still hard to see how categories compare to each other.
In the following chart, we kept the same information and same color scheme, but the numbers are converted to a bar chart. Now relationships become much more apparent.
Without knowing anything about the data we can quickly draw some conclusions:
Copiers are consistently the most profitable.
Bookcases and supplies are profitable (barely) in the corporate and home office segments, but not in the consumer segment.
Home office has the lowest overall profit of all three segments.
Example #2 - A visualization is worth a thousand rows
Another huge benefit to visualizing your data is that you can pack a lot of numbers into a small physical area. Consider the following graph:
This chart plots worldwide profit over time by category and segment. I also included the average monthly profit as a reference line.
This data contains total profit by month from January 2011 to December 2014. Lets think about how many line items this is:
Four years * 12 months per year = 48
48 * 9 segments = 432
432 + 9 average profits = 441
To replicate this chart as a table we would need 441 cells. A table that large would be nearly incomprehensible. Converting it to a graph allows the same information to be condensed and to make it more intuitive and accessible.
Back to Top ↑
Visual encodings are a mapping from data to display elements. Below is a screenshot of the Gapminder data visualization referenced above.
This chart shows the correlation between income per capita and life expectancy in many different countries.
This graph has several visual encodings. Here are the first two:
Life expectancy is encoded visually along the y-axis
Income per capita is encoded along the x-axis.
The intersection of a point on the x and y axes indicates a data point's position. Position is considered a planar variable because it locates points in space. It's likely the most common visual encoding that you'll see in data visualization because we can be perceive it with great accuracy.
The greatest drawback of position is that it only good at encoding two variables. If we wanted to visualize higher dimensions of the data, say a third variable, we could try to plot it on a z-axis however this is typically not a good idea.
Below is a graph that utilizes the z-axis.
Naomi B. Robbins, the author of Creating More Effective Graphs, says "Effective is not the same as beautiful."
The chart above may have data that is 100% accurate, but that doesn't mean anything if the chart can't be easily read.
As an experiment, try and answer the following questions based on the above chart:
Q1: What's Tim's grade in English?
Q2: What is Ryan's grade in History?
Q3: Which is higher, Ryan's grade in Biology or Jill's grade in History?
Q1: 53. Q2: 44. Q3: Jill's grade. (93 vs. 91)
Problem 1: The depth of field makes it difficult to read the correct values. Our brains automatically estimate the values of the columns based on the grid in the background. Unfortunately with a 3D chart this can be misleading.
Tim's grade in Algebra lines up with the gridline for 40 in the background. But his actual grade is 70.
Ryan's grade in Biology is one of the easiest bars to read, and it is still misleading. His grade is 91 but it looks like it falls below the 90% line.
Problem 2: It is more difficult to make quantitative comparisons between points. Jill's grade in History is two points higher than Ryan's grade in Biology but that is very hard to gather from this chart.
Problem 3: It is possible for data in the foreground to completely cover data in the background. This is demonstrated by the obstruction of Ryan's grade in History.
Finally, one of the main reasons to display data in a chart is to facilitate the identification of patterns and trends. This is much harder with 3D charts than with 2D charts.
Instead of using 3D charts, we can use what are called retinal variables to encode additional variables for our data set.
Size (above) is an example of a retinal variable and it's particularly good for ordered data.
Going back to the Gapminder data visualization above, the population of the country is encoded by the size of each of the circles (the area of each point).
Size, orientation, and color saturation are particularly effective for ordered data. However, it may be difficult to perceive quantitative differences using these encodings. For example, how can you quantify the difference between "light blue" and "light light blue."
Color hue, shape and texture are great for encoding nominal variables. In the Gapminder visualization, color hue denotes the geographic region of the country. The choice of color makes it easier to compare income and life expectancy across developed and less developed countries of the world.
Rankings of Visual Encodings:
Now that you're familiar with several different kinds of retinal variables how should you know which to use in different situations?
This paper was the first of it's kind to validate the ranking of encodings using empirical evidence. Position is the most accurate of the encodings and color saturation was the least accurate.
For on Naomi Robbins, read her book Creating More Effective Graphs.
For more on when to use different chart types, read Stephen Few's paper, Data Visualization: Rules for Encoding Values in Graphs.
Exploration vs. Explanation
Back to Top ↑
Data visualizations come in at two points during the data analysis process: when you are exploring the data and when you are explaining the data.
Exploring involves digging through the data to find interesting relationships and questions.
Explaining is when you present those relationships and answers to the questions.
Extract and Clean Data:
The first thing to do is to gather the data. You'll commonly extract the data from a database or parse it from outside records. This is typically the stage that involves something like SQL or web scraping (collecting data from web pages).
A significant amount of time will be spent cleaning the data. In this case, cleaning the data means to organize the rows and columns, fill in any missing information, ensure proper formatting, check for anything else that doesn't make sense. Data visualizations are only as good as the data that goes in it.
Garbage in = Garbage out
Once you've collected and cleaned the data, you'll need to explore it to gain an understanding of the data. This process is called EDA (exploratory data analysis).
This is where you'll look for patterns and trends. You'll see how data is distributed, look for correlations, and understand how categorical data is split.
Data visualization can help at this stage by allowing you to plot distributions of data, and create scatter plots to reveal correlations.
Bar graphs are great to help see how data is split between categorical variables:
Histograms help see how data is distributed with continuous variables. Histograms are similar to bar charts but the variable is "binned" into ranges and then the bins are counted and measured. Histograms' connected bars imply a continuous progression in values.
Histograms are great at showing outliers and displaying how the data is distributed.
Scatter plots help reveal relationships between variables such as correlations or other patterns.
This scatter plot shows the relationship between the 75th percentile of SAT scores and the university's tuition and fees for the school year 2013-2014. This relationship makes sense: schools that are more exclusive, as determined by SAT scores, tend to charge a premium to students for the access to presumably better professors and resources.
The final part of the data visualization process is to look deeper into the patterns you found, and share them with others. This is the explanatory part.
You can think of it as telling stories with data. You create a narrative to lead your viewers through your analysis. Your job here is to facilitate a conversation between your data and your readers.
Interactivity can be very powerful, it allows your reader explore the data themselves. To see this in action, check out the public Tableau Dashboards found here.
THE SPECTRUM OF DATA VISUALIZATION
Back to Top ↑
There are far too many data visualization tools our there to comprehensively summarize here. So this is just a framework to think about your options.
Generally, tools have a trade-off between flexibility and productivity. You can use the most flexible tools to create infinitely-customizable graphics. However, the most flexible tools or languages are more difficult to learn, and the graphics can take longer to build. Conversely, using the most productive tools, you can create visualizations effortlessly, but only from a specified set of visualizations.
The productivity-vs-flexibility trade-off can be visualized in the pyramid below. The base of the pyramid is the amount of flexibility of the tool, while the height represent the productivity, or how easy it is to create visualizations.