Choosing a Chart Type

There are dozens of different types of charts, each with a unique purpose. While each chart type may be slightly better for slight differences in data, in practice you can get by with using only a few.

In this post we'll discuss how to choose a chart type and the benefits and drawbacks of each. 




Choosing the Right Chart

Back to Top ↑

The following graphic is from Andrew Abela's blog, Extreme Presentation.

While each chart type may be slightly better for slight differences in data, in practice you can get by with only a few.

As a reminder, chart types are simply a set of visual encodings applied to data types, and combined with some relationship between those data. 

Choosing the Right Chart.jpg

Further Reading:


Chart Types

Back to Top ↑

This section will cover a few of the chart types that you can expect to use consistently. These graphs work well by themselves but also as building blocks for more complex figures.


Bar Charts, Line Charts, and Scatter Plots

I covered these charts at a basic level in the first post, data visualization fundamentals.

As a very brief refresher:

  • Bar charts are good for comparing groups,

  • Line charts work well for one variable over time, and

  • Scatter plots can help uncover correlations.

All three of these look at data points relatively, not absolutely. Take a look at the following bar chart and line chart. 

These are two charts that you might see in your office.

The first chart, the line chart, shows technology profit over time. From this we can see that sales are cyclical but increasing. Even if the next month in the chart (Jan 2016) were to drop dramatically to $15,000, it would still be 3x the 2011 numbers. This chart shows positive growth and can be useful from a high level but it's impossible to put an exact dollar amount of profit for any given month.

The second chart, the bar chart, similarly shows total sub-category sales from 2011 to 2015. This helps us see that approximately 50% of their sales come from four products and the bottom 10 products only make up 25% of total sales. Bar charts can be great because humans are great at intuitively understanding differences in the length and area of bars. However, like the line chart, it's impossible to put an exact dollar amount on any of the bars.

When specific numbers are required, you can try to add numbers to the chart, but we'll discuss the risks of that later, or you can create a table.

Sometimes it's more important to draw attention to the actual values, rather than comparisons, and in these cases tables work great. We interact with tables everyday without realizing it. Movie show times, prices on a menu, store catalogs, sports scores, and phone lists are just a few examples of every-day tables.

Text tables have their place, but you must know their strengths and weaknesses. As discussed in the first post, tables come with the drawback of not being able to see trends or correlations.  However, you can include color and formatting (bold/italics) to emphasize what you want the viewer to understand from the table. Be cautious when using tables, it's very easy to obscure things in the data and bombard the reader with too much information at the wrong level of detail.

You might also consider offering the reader both a chart and a table. This allows the reader to see the trends in the data, but have the details if they need them. See the chart/table combo below as an example.


Geospatial Plots

Back to Top ↑

Geospatial data is data that has a variable that allows it to be viewed on a map. There are two general types of map charts you may see or use: choropleths and cartograms.

Cloropleths use color to encode a value associated with the location on the map such as population, population density, GDP, etc. It's basically a heat map that uses geographic boundaries.

Cloropleths are useful for seeing how data compares across locations. These maps can be split into regions such as countries, states/provinces/districts, or smaller regions such as counties or census blocks.

Below is an example of a chloropleth that shows adult obesity by state in the United States.

Below is the same chart except it is broken down by county:

Cartograms are similar to choropleth graphs, but they distort the boundaries of regions to encode a value. Cartograms also typically encode another variable with color.

Cartograms can be effective for emphasizing variables, but it can be difficult for readers to understand how these distorted areas compare to each other.

Let's look at an image of the 2012 US election results on a chloropleth map:

If you looked at this map without knowing the results, you'd likely say that "Red" would have won the election by a fair margin. However, as we all know, this wasn't the case.

Barack Obama won 53% of states, 66% of the popular vote, and 62% of the electoral vote. However, this chart shows that Obama won 44% of US acreage vs. Mitt Romney's 56%. This is preceisely how chloropleth maps can be misleading.

In this case, one might opt to use a cartogram. Here is a cartogram for the 2012 US election results:

This chart distorts geographic boundaries to represent a different variable, in this case popular vote. In this chart the size of all of the blue areas add up to 66% of the total size of all regions, which represents the number of votes in the states won by Barack Obama.

While a cartogram is technically more accurate than a chloropleth, they have some serious drawbacks. First, many people find them hard to read, or at least not intuitive. Second, there is no systematic process for creating boundaries. In fact, there are currently 25 popular algorithms for calculating boundaries; this means two people can create cartograms with the exact same data but use different algorithims and the charts will look different. 


Small Multiples

Back to Top ↑

A small multiple is a series of plots with the same scale that make it easy to compare data across groups. The plots could be anything, lines, bars, scatter plots, even maps. Small multiples were coined by Edward Tufte in The Visual Display of Quantitative Information, however you may also see them called facets or trellis plots.

To understand the benefit of small multiples, look at the line chart below.

At this point you should know enough to know that this is a terrible chart. This line chart shows a line for each member of a group. All of the lines are on top of one another so it is difficult to see how the different items compare. If however, we plot the same data on a grid of charts, we can easily compare how the values for each change over time. See the example below.


Bullet Graphs

Back to Top ↑

Bullet graphs were developed by Stephen Few as an extension of bar-charts. Bullet charts layer multiple measures on top of each other for comparison. In the bullet chart below we're comparing 2014 sales (blue bars) with sales from 2013 (gray bars) and total sales  from 2013 (black bar).

From this chart we can see every category is selling more year over year except corporate furniture.

Edward Tufte introduced sparklines in Visual Display of Quantitative Information (along with naming small multiples) as a way to succinctly visualize some quantity changing over time.

Sparklines are simple line charts designed to emphasize the change in the quantity in a small visual area. Sparklines have become very popular in finance, one common use is to show a stock ticker's price history. For example, below is Yahoo Finance's homepage:

Additionally, Microsoft Excel now has a built in sparkline creator. You can see a screenshot of that below.


Bubble Plots

Back to Top ↑

So far we've seen that if you need to plot one-dimensional data you can use a strip chart, a histogram, and for two-dimensional data you can use a scatter plot.

What if you want to plot three or four dimensional data? Enter the bubble chart.

Here is a scatter chart showing average SAT score vs. % of students admitted for colleges in the U.S:

Here is that same chart with a third dimension (color): public vs. private:

Finally, here is the same data with a fourth dimension (size): total enrollment.


Connected Scatter Plots

Back to Top ↑

Connected scatter plots show the normal relationship you would expect from a scatter plot, but with a third dimension, such as time, connecting the dots, providing extra context to the visualization.

Below is a scatter plot that explores the correlation of the number of pitchers in the MLB with the average # of strike outs. Each dot represents a year and is connected by a line.


Cycle Plots

Back to Top ↑

Cycle plots are used to group together data in periodic data. In the below example we grouped total sales by year and month.


Visualizing Distributions

Back to Top ↑

One way to perform a "sanity-check" of your data is to look at the distribution. Sometimes you will have outliers that strongly affect the mean, bar charts can hide this issue. Fortunately, we have a few other tools that can help us explore our data distribution.

Histograms are bar charts that group data by ranges. The ranges of values are typically called bins and the process of grouping data into bins is called binning. Alternatively, for the verb usage, you can say: I binned the data. Knowing the common terms allows you better communicate with other analysts.

Histograms help visualize distributions of continuous variables. However, you need to be aware of bin width and the placement of bin edges because they can drastically affect how the distribution looks.

Creating bins that are too large has the risk of missing subtleties and fine details in the distribution. However, if bins are too narrow there may be too much noise and interesting details might be lost. Similarly, bin edges affect how the histogram looks, this may just take some trial and error to get the bins right.

For example, below is a bar chart of the weights of NFL offensive players in 2013:

You can play around with bin sizes to find the right fit for your data. Below is the same data but slightly smaller bins.

This bin size gives the reader more data without being too noisy.

You might be thinking that creating the smallest bins is the best way to show the most accurate data. Here's what this chart would look like with very small bins:

 

While this chart has the exact same data as the two charts above, it becomes harder to read because it is too busy.

The correct size bin is going to depend on your data, on your audience, and on the question you're trying to answer. In this case, I personally think the second (middle) bin size was the most appropriate.

Box plots are a common visualization to display the general shape of a distribution using intervals. In this case, an interval is a value is a value that is greater than some percentage of the data. As an example:

  • The 50% interval is the value that is greater than 50% of the data (also called the median).

  • The 95% interval is the value greater than 95% of the data.

Box plots all use the 25%, 50%, and 75% intervals, typically called quartiles (because they represent a quarter of the data). Usually, there will also be whiskers (sometimes called fences) that indicate some larger intervals (1.5x the IQR is common), or the minimum and maximum. You'll also often see box plots that show outliers, data points greater or less than the whisker values.

If you're unfamiliar, this is the general structure of a box plot:

This type of visual encoding gives the reader a sense of the underlying distribution. If the intervals are symmetric around the median, the distribution is likely to be normal. However, if the intervals are scrunched up, it indicates the distribution is skewed.

Here's a histogram and box-plot combination chart, which might help you visualize the similarities and differences. 


Violin Plots

Back to Top ↑

Violin plots display a smoothed distribution of data. The distribution is approximated using a method called kernel density estimation. Similar to box plots, intervals are typically included, but the actual data distribution is shown as well which can help expose non-normal distributions.

One downside to violin plots is that the smoothing can hide fine details and it often fails for small amounts of data. Below is an example of violin plots:


Strip Charts

Back to Top ↑

Sometimes you may want to plot out the data directly with a strip chart. Strip charts display the actual data point for one variable as dots (or another shape). For small data groups, you can just plot the data along a line. For larger datasets, the points will often over lap, so you can randomly scatter the data in the non-value dimension or use transparency.

One common use for strip charts is in the margins of a two-dimensional scatter plot. The main plot shows the relationship between two distributions, while the strip charts on the margins show the marginal distribution of each variable individually.


Kernel Density Estimate

Back to Top ↑

Kernel Density Estimates are basically just fancy histograms. Histograms were created to help visualize the distribution of data but the question remained: how wide should the bins be and where should the bin edges fall? KDE were developed to get around those issues and estimate an underlying distribution. The basic idea is that you replace each data point with a “kernel” for instance a normal curve, or a top hat, or a triangle, then sum them all up.

Below is an example from the KDE wikipedia page:

"Comparison of the histogram (left) and kernel density estimate (right) constructed using the same data. The 6 individual kernels are the red dashed curves, the kernel density estimate the blue curves. The data points are the rug plot on the horizontal axis." https://en.wikipedia.org/wiki/Kernel_density_estimation

You are still able to choose parameters for the kernel, such as bandwidth in the case of a Gaussian, which means there is still a bit of arbitrariness in this method, but it’s generally a better (but more complicated) way to visualize distributions compared to histograms.

Visualization is an art

Visualization is somewhat of an art. As a data analyst you will need to think about what question you're trying to answer and which chart is best suited to answer that question.

You'll also have to think about your underlying data. If you have a lot of data, strip charts probably aren't your best choice. On the other hand, if you have a small amount of data you should likely stay away from box plots or violin charts. Choosing and editing charts will involve some trial and error as you get started, but eventually you'll be able to intuit the best chart for any data set or question.

Further Reading: