20 Redundant coding
In Chapter 19, we have seen that color cannot always convey information as effectively as we might wish. If we have many different items we want to identify, doing so by color may not work. It will be difficult to match the colors in the plot to the colors in the legend (Figure 19.1). And even if we only need to distinguish two to three different items, color may fail if the colored items are very small (Figure 19.11) and/or the colors look similar for people suffering from color-vision deficiency (Figures 19.7 and 19.8). The general solution in all these scenarios is to use color to enhance the visual appearance of the figure without relying entirely on color to convey key information. I refer to this design principle as redundant coding, because it prompts us to encode data redundantly, using multiple different aesthetic dimensions.
20.1 Designing legends with redundant coding
Scatter plots of several groups of data are frequently designed such that the points representing different groups differ only in their color. As an example, consider Figure 20.1, which shows the sepal width versus the sepal length of three different Iris species. (Sepals are the outer leafs of flowers in flowering plants.) The points representing the different species differ in their colors, but otherwise all points look exactly the same. Even though this figure contains only three distinct groups of points, it is difficult to read even for people with normal color vision. The problem arises because the data points for the two species Iris virginica and Iris versicolor intermingle, and their two respective colors, green and blue, are not particularly distinct from each other.
Surprisingly, the green and blue points look more distinct for people with red–green color-vision-deficiency (deuteranomaly or protanomaly) than for people with normal color vision (compare Figure 20.2, top row, to Figure 20.1). On the other hand, for people with blue–yellow deficiency (tritanomaly) the blue and green points look very similar (Figure 20.2, bottom left). And if we print out the figure in gray-scale (i.e., we desaturate the figure), we cannot distinguish any of the iris species (Figure 20.2, bottom right).
There are two simple improvements we can make to Figure 20.1 to alleviate these issues. First, we can swap the colors used for Iris setosa and Iris versicolor, so that the blue is no longer directly next to the green (Figure 20.3). Second, we can use three different symbol shapes, so that the points all look different. With these two changes, both the original version of the figure (Figure 20.3) and the versions under color-vision-deficiency and in grayscale (Figure 20.4) become legible.
Changing the point shape is a simple strategy for scatter plots but it doesn’t necessarily work for other types of plots. In line plots, we could change the line type (solid, dashed, dotted, etc., see also Figure 2.1), but using dashed or dotted lines often yields sub-optimal results. In particular, dashed or dotted lines usually don’t look good unless they are perfectly straight or only gently curved, and in either case they create visual noise. Also, it frequently requires significant mental effort to match different types of dash or dot–dash patterns from the plot to the legend. So what do we do with a visualization such as Figure 20.5, which uses lines to show the change in stock price over time for four different major tech companies?
The figure contains four lines representing the stock prices of the four different companies. The lines are color coded using a colorblind-friendly color scale. So it should be relatively straightforward to associate each line with the corresponding company. Yet it is not. The problem here is that the data lines have a clear visual order. The yellow line, representing Facebook, is clearly the highest line, and the black line, representing Apple, is clearly the lowest, with Alphabet and Microsoft in between, in that order. Yet the order of the four companies in the legend is Alphabet, Apple, Facebook, Microsoft (alphabetic order). Thus, the perceived order of the data lines differs from the order of the companies in the legend, and it takes a surprising amount of mental effort to match data lines with company names.
This problem arises commonly with plotting software that autogenerates legends. The plotting software has no concept of the visual order the viewer will perceive. Instead, the software sorts the legend by some other order, most commonly alphabetical. We can fix this problem by manually reordering the entries in the legend so they match the perceived ordering in the data (Figure 20.6). The result is a figure that makes it much easier to match the legend to the data.
If there is a clear visual ordering in your data, make sure to match it in the legend.
Matching the legend order to the data order is always helpful, but the benefits are particularly obvious under color-vision deficiency simulation (Figure 20.7). For example, it helps in the tritanomaly version of the figure, where the blue and the green become difficult to distinguish (Figure 20.7, bottom left). It also helps in the grayscale version (Figure 20.7, bottom right). Even though the two colors for Facebook and Alphabet have virtually the same gray value, we can see that Microsoft and Apple are represented by darker colors and take the bottom two spots. Therefore, we correctly assume that the highest line corresponds to Facebook and the second-highest line to Alphabet.
20.2 Designing figures without legends
Even though legend legibility can be improved by encoding data redundantly, in multiple aesthetics, legends always put an extra mental burden on the reader. In reading a legend, the reader needs to pick up information in one part of the visualization and then transfer it over to a different part. We can typically make our readers’ lives easier if we eliminate the legend altogether. Eliminating the legend does not mean, however, that we simply not provide one and instead write sentences such as “The yellow dots represent Iris versicolor” in the figure caption. Eliminating the legend means that we design the figure in such a way that it is immediately obvious what the various graphical elements represent, even if no explicit legend is present.
The general strategy we can employ is called direct labeling, whereby we place appropriate text labels or other visual elements that serve as guideposts to the rest of the figure. We have previously encountered direct labeling in Chapter 19 (Figure 19.2), as an alternative to drawing a legend with over 50 distinct colors. To apply the direct labeling concept to the stock-price figure, we place the name of each company right next to the end of its respective data line (Figure 20.8).
Whenever possible, design your figures so they don’t need a legend.
We can also apply the direct labeling concept to the iris data from the beginning of this chapter, specifically Figure 20.3. Because it is a scatter plot of many points that separate into three different groups, we need to direct label the groups rather than the individual points. One solution is to draw ellipses that enclose the majority of the points and then label the ellipses (Figure 20.9).
For density plots, we can similarly direct-label the curves rather than providing a color-coded legend (Figure 20.10). In both Figures 20.9 and 20.10, I have colored the text labels in the same colors as the data. Colored labels can greatly enhance the direct labeling effect, but they can also turn out very poorly. If the text labels are printed in a color that is too light, then the labels become difficult to read. And, because text consists of very thin lines, colored text often appears to be lighter than an adjacent filled area of the same color. I generally circumvent these issues by using two different shades of each color, a light one for filled areas and a dark one for lines, outlines, and text. If you carefully inspect Figure 20.9 or 20.10, you will see how each data point or shaded area is filled with a light color and has an outline drawn in a darker color of the same hue. And the text labels are drawn in the same darker colors.
We can also use density plots such as the one in Figure 20.10 as a legend replacement, by placing the density plots into the margins of a scatter plot (Figure 20.11). This allows us to direct-label the marginal density plots rather than the central scatter plot and hence results in a figure that is somewhat less cluttered than Figure 20.9 with directly-labeled ellipses.
And finally, whenever we encode a single variable in multiple aesthetics, we don’t normally want multiple separate legends for the different aesthetics. Instead, there should be only a single legend-like visual element that conveys all mappings at once. In the case where we map the same variable onto a position along a major axis and onto color, this implies that the reference color bar should run along and be integrated into the same axis. Figure 20.12 shows a case where we map temperature to both a position along the x axis and onto color, and where we therefore have integrated the color legend into the x axis.