About two years ago, my own starting point in the field of dataviz/information design was my background as a social scientist: I was familiar with the methodological and analytical aspects of working with data (e.g. interpreting data, epistemological considerations, etc), so the first thing I did was learning to use Python and R. My learning process was mainly following the logic of “how can I get and wrangle data, and then make a bunch of vizzes inspired by what I’ve seen online”, and those two tools are perfect for that so I don’t regret focusing on becoming proficient with them.
But about a year ago, wanting to get better at creating clear and effective visuals, I started looking at the more design-oriented and “visual communication” considerations, best practices and literature. And this has opened up a whole new dimension of my interest in this field/work, because this probably constitutes the part of information design that I was the least familial with given my own background.
What’s brilliant – and this is applies to this field in general – is that the resources, introductions/explainers and literature addressing it are plentiful and easily accessible (except maybe for some of the books, but there are ways to get them freely or cheaply if you know where to look online). So I’m slowly but surely beginning to improve my design process, paying attention to everything – colors and color palettes, typography, chart selection, etc – that impacts the visual/cognitive and communicative/interpretative (and interactive, when it’s not a static graphic) legibility of my visualizations.
Needless to say this is a work in progress, but I wanted to write a brief post on two of my most recent visualizations, and secondly I’ll take a self-critical look back at some of the graphics I’ve created over the past 2-3 years.
Trying to Create a Nice-Looking Stacked Area Chart, Only to Fall Back to Small Multiples
One of the types of charts/visualizations I’ve wanted to create for a while is a visually-appealing stacked area chart like this one Gilbert Fontana shared last year. Yan Holtz‘s R Graph Gallery website includes a tutorial to replicate Fontana’s chart, so this is where I started.
I wanted a dataset that’s kinda similar to the one used by Fontana, so I took the World Inequality Database‘s data on average incomes per adult from 1820 to 2023. I first replicated the visualization very closely, including Fontana’s peculiar ordering of the stacked areas (with the category that has the biggest value for the latest year, is put at the bottom, and then categories are ordered vertically from lowest to (second) highest). Of course had I chosen to stop there, I would have reorganized the list of regions because it doesn’t make sense in my case to keep that ordering: Fonfana’s chart has “Rest of the world” which is contrasted with the values/stacked areas of individual countries, but that’s not the case with my dataset.
However, it quickly became evident to me that even after this modification this smoothed-out version of stacked areas wouldn’t work. Because while it might be visually appealing/nice, as you can see below it significantly distorts the data compared to a regular stacked area chart. But there is a deeper issue, which is highlighted here: from a visual and cognitive standpoint, beyond 3 or 4 categories – and even in the case one has so few categories – it’s almost impossible to examine the actual differences between groups, mainly because they don’t have the same baseline, and outside of the bottom slice/category their baseline isn’t flat (i.e. it’s the edge of the previous category’s area shape).
In other words, the only information one can gather from a stacked area chart (leaving aside the numbers on the right, which is an unusual feature inspired by Fontana’s chart) is the evolution of the whole and the relative proportions between categories. But this is seriously undermined by the fact that it’s impossible to examine the actual evolution of individual categories, nor the actual numerical data. And in this case, since there are ten different regions stacked on top of each other, even the relative proportions aren’t easily noticed, especially for the thinner tiles/slices.
The regular line chart isn’t a solution either here, because again there are too many categories/groups, meaning that if we plot it as a single line chart (below), while the representation is more consistent with the data – with the y axis we’re able to see the actual level of income a given point in a line corresponds to -, it’s confusing and rather ugly. This is the dreaded spaghetti chart, which is perhaps the only type of chart hated as much if not more than the pie chart in dataviz circles.
The solutions to the flaws of both “spaghetti” line charts and stacked area charts, are twofold but follow a similar logic/idea. The static option is to use the small multiples format which is beloved – for good reasons – by information designers. This concept was introduced/formalized by Edward Tufte in his book Envisioning information (1990), and I’ll let Juuso Koponen and Jonatan Hildén take it from here:
Small multiples are a set of small graphics laid out in a matrix, or table, and intended to be viewed side by side. The matrix is generally arranged in such a way that the images on the same row or column show related data. (…)
With small multiples, the eye moves easily between the visualizations, compared to viewing completely distinct figures. Because comparison is possible within an eyespan, working memory and conscious processing capacity is freed for better use. Furthermore, small multiples are simpler and easier to understand than complex visualizations with all the data items crammed into a single figure. The use of small multiples to convey data has also been shown to be more effective and accurate than presenting the same data as an animation for example.
As comparison is crucial in small multiples, Tufte’s principle of “show data variation, not design variation,” mentioned above, is particularly salient here. The comparison of figures arranged side by side is most successful when there are as few non-data-related differences between them as possible. (p. 100)
The other option is creating an interactive chart for the web, where the lines or areas can be examined separately (e.g. by hovering, clicking on a line/area or on the category one wants to look at) while the rest is toned down visually. Interestingly, at least for the examples below the small multiple static graphs kinda constitute a “grid of snapshots” representation of what you would see by hovering and clicking on an interactive chart!
Small multiple line chart(s)
Small multiple area chart(s) with a dark background
My favorite version of all three is this one (below), I think it works best; the small issue with the dark version is that the shaded blue color of the “non-protagonist” area shapes is a bit too visible and distracting.
Small multiple area chart(s) with a light background
Here’s an interactive version of this chart, made with D3 and Svelte. The live page is here!
Why Bar Charts Remain Undefeated; And Median > Mean
Although the above graphics were based on averages, the median is a better statistical indicator (technically, a “measure of central tendency”) that should be favored over the former whenever possible . While the mean (or “average”) is more intuitive, the key flaw is that it is hugely impacted by extreme values aka “outliers”. This brings me to my second latest visualization, based on TidyTuesday‘s October 22, 2024, dataset.
The original data is a bunch of per-country stats that come from the CIA’s World Factbook (by the way, fuck the CIA!); I added region and sub-region categories via this dataset found on GitHub, in order to create charts showing regional stats about internet users. You can check my code and cleaned up dataset here.
Before discussing how the visual representations of the mean and median values for this dataset compare, let’s give one very old-school/basic (some might even say “boring”) but slightly underrated type of dataviz – the bar chart (especially when arranged horizontally) – the flowers it deserves. A lot of chart types that may look more appealing and exciting – we saw above the example of the stacked area chart, and later I’ll show an example of a streamgraph; but this could be said about circular plots (from the pie chart to the “circular barplots“) -, suffer from big flaws/inefficiencies in terms of visual communication/cognition. The bar chart’s simplicity – which might look “boring” but is very effective for representing data in a way that people understand easily and accurately – not only lies in the fact that optical illusions and distractions are rare compared to many other visualizations.
But the reason bar charts are so effective in terms of visual perception is that they are based on what is overwhelmingly recognized by experts as the most accurate and powerful method of visual encoding: position. Visual encoding refers to the process of representing data visually, it’s a form of “encoding” in the sense that it attaches meaning to certain visual elements based on the relationships in the data. For instance, using the area of a shape to represent the proportional size or scale of something – which can be compared to respective values of other categories/groups/etc – is a form of visual encoding. And one of the central concerns of information/dataviz designers is to use the best methods of visual encoding to (re)present the data as clearly as possible.
And what’s helpful is that thanks to research, we know that there is hierarchy to how clear and effective various methods of visual encoding/representation are. Again, quoting Juuso Koponen and Jonatan Hildén‘s (amazing) book:
Position is the best visual encoding method regardless of whether the data presented is measured on a ratio, interval, ordinal or nominal scale. Our visual system is able to discern positions in two dimensions very precisely, and in ideal conditions, we can notice positional differences of a fraction of a millimeter. No other encoding method comes close to this level of precision. As a visual variable, position is particularly superior in encoding numerical data on ratio and interval scales, in which differences between represented objects can be very small. (p. 59)
The single best way to encode information about data visually, is using position as primary visual variable (“primary” because there are often several forms of encoding for a single chart), alongside a consistent and clear number scale. Position, length and area – and potentially more, e.g. color or color gradient – are all visual variables used in bar charts, but it’s position that does the bulk of the the encoding. It minimizes optical distractions or illusions and maximizes the cognitive precision of the reader/viewer/user’s process of visual perception.
In my bar charts below, I represent the summary statistics (the mean and median values) of the data – which, it is worth highlighting, was specifically measured at the level of each country – for each sub-region. In other words, for each region – e.g. Western Europe or South Asia – this presents statistical tendencies of the data on internet users, which is itself measured for individual countries. I used bars to represent regional (mean and median) values for the amount of internet users per square kilometer, and a color gradient for the percent of internet users (in both cases, remember it’s numbers “by country”).
It’s more informative to look at the regional comparisons than merely plot all country-level values. And these aren’t “regions” in the continental sense – “Africa”, “Europe”, etc – but sub-regions which offer a much more meaningful frame of comparison in my view…
I calculated then plotted both mean and median values, as I thought this is a good illustration of why median values are generally preferable. As mentioned above, the mean or average might be more intuitive – “sum of values divided by number of observations” – but unless there aren’t any outliers (data points that differ greatly from the bulk of observations), the median is more insightful, because it’s literally the middle value of the given list of data when arranged in an order. This means that it’s not pulled up or down by extreme values, contrary to the mean.
Chart based on the mean values
These mean values for the numbers of internet users per km² lead to a significantly different ranking order than the median values shown below. Take Southeast Asia, which is ranked first if we take the mean values without checking the data. It turns out that if we remove Singapore – which is the country with the highest number of internet users per km² in the whole dataset, namely 4641.32 – Southeast Asia would drop to 9th (just below South Asia) in terms of mean values. That’s because the mean value with Singapore is 444.30, but without it it is only 24.59. In contrast, the median values with and without it are much closer to each other: 27.59 and 19.05 respectively.
This is less of a problem if we take the other measure I used in this chart: percent values are by definition somewhere between 0 and 1, which means that there is far less “room” for outliers to skew the average… As you can see when comparing the bar colors between the two charts, the colors for each region appear to be consistent.
Chart based on the median values
A Critical Look at Some Previous Visualizations
I’m genuinely thrilled about the progress I’ve made and the gallery of visualizations/graphics I’ve built across the past two years. I could go back to remake some of them with the help of hindsight, but as long as they’re not horrible I’m fine keeping them in their original state, as a testament to my own process of trail and error. I still like the visuals included below, but I’ve noticed a bunch of things I didn’t see/rectify back then because I hadn’t started learning design best practices/considerations. But I’ll also mention some of the things I like about the design choices!
For instance, in this infographic on food waste, which I made with Rawgraphs and Affinity Designer, I think a lot of the design choices make sense and work pretty well, apart from the recurring issue/concern of not putting too much stuff in one image (which is something that probably gets improved with experience/trial and error). However in hindsight there is a glaring issue with the bit of text right below the title: I used “justified” rather than “left-aligned” text formatting, which results in some ugly/distracting spaces in the second line of that phrase/paragraph.
In my infographic on the death toll of the global regime, while I think a lot of was smartly designed (except perhaps using Agency FB – which is better suited for headers – for labelling the charts), the elephant in the room is the inherent flaws of the streamgraph (at the center). While it is “visually appealing”, it has the same issues I mentioned above concerning stacked area chart. The streamgraph is slightly worse because the stacked shapes aren’t piled vertically based on a flat axis, but arranged on both sides of a horizontal line.
Positives: Nunito as main text font works really nicely, the title and header fonts look pretty good (maybe sticking to one instead of two would have been better – multiplying the number of different fonts is usually to be avoided – but I think it still kinda works), I like how I used proportional bubbles/circles both to show the distribution of drowning deaths (top left) and in the legend of the streamgraph.
A little while after making that infographic, I became aware of the visual effectiveness/superiority of small multiples” formats, so I created a small multiple lien chart based on the same data (see chart below). This is better, but the chart labels and text are too small!
One of the types of points-based charts that I like are beeswarm plots, which I used in the infographic below. I think the main problem here is obvious: typography! The text sizes are too small, and while yellow-on-green works for the graph/axis labels (and the chart titles would be fine if they were a bit bigger), they’re almost unreadable at the bottom for the caption, as well as for the club names below the players’ names. I’m also not sure about adding pictures of their faces like this, it doesn’t look great and I don’t think it’s necessary.
This infographic on media concentration, based on transnational study from a decade ago (the only scholarly publication to date on that scale for the 21st century, as far as I’m aware), was one of my very first data-driven and visualization projects. And to be honest it doesn’t look that bad, it’s just a bit too much for a laptop- or mobile phone-sized image! If streamlined and redesigned a little bit, it would work as a big printed poster or maybe as an interactive web-based visualization!
Although I’m sure there could be several improvements made on this radar chart (for example a bit more consistency in the text sizes and colors would be nice, and it’s not ideal to have the metric labels at different angles around the chart), one thing I really like is the fact each slice not only has the gauge-like proportionally-filled area, the actual percentile value is indicated, which is crucial because human eyes/brains aren’t that good at estimating and comparing values based on area size only….
What’s kinda funny, looking back through many of these visuals/charts I’ve made in the past two years, is that one of the aspects of information design I’ve recently spent time researching – typography – was partly intuitive to me already in the sense that most of the fonts/typefaces (Nunito, Ovo, Lato, Quattrocento Sans, Merriweather, Agency FB…) I had used do in fact match the design criteria and “best practices” that are recommended in the literature.
That’s all for today!
Note: I didn’t go into some of the color and typography choices I made for my most recent visualizations, but I did spent a significant amount of time on these design considerations!