Why Visualize?

conceptual

visualization

Why visualize

Author

Josef Fruehwald

Published

September 9, 2024

Different Data, Same Stats

Data visualization is fundamental for statistical analysis, because without visualizing your data, you won’t really be able to understand it. A classic example is Anscombe’s Quartet (Anscombe 1973). When plotted, these look like 4 very distinct data series. If we were going to theorize about what kind of processes gave rise to each of these data sets, our theories would necessarily be very different.

Figure 1: Anscombe’s Quartet

But if we decided to only look at statistical summaries of the data, or statistical models of the data, they’d look nearly identical.

series	mean		sd
series	x	y	x	y
1	9.00	7.50	3.32	2.03
2	9.00	7.50	3.32	2.03
3	9.00	7.50	3.32	2.03
4	9.00	7.50	3.32	2.03

Table 1: Mean and Standard Deviations of x and y for each series.

	series	estimate	std.error	statistic	p.value
intercept	1	3.00	1.12	2.67	0.03
	2	3.00	1.13	2.67	0.03
	3	3.00	1.12	2.67	0.03
	4	3.00	1.12	2.67	0.03
slope	1	0.50	0.12	4.24	0.00
	2	0.50	0.12	4.24	0.00
	3	0.50	0.12	4.24	0.00
	4	0.50	0.12	4.24	0.00

Table 2: Linear model coefficients for each series

This was taken to an extreme degree with the Datasaurus (Matejka and Fitzmaurice 2017)

Table 3: Summary stats for the datasaurus dozen

dataset	mean		sd
dataset	x	y	x	y
away	54.27	47.83	16.77	26.94
bullseye	54.27	47.83	16.77	26.94
circle	54.27	47.84	16.76	26.93
dino	54.26	47.83	16.77	26.94
dots	54.26	47.84	16.77	26.93
h_lines	54.26	47.83	16.77	26.94
high_lines	54.27	47.84	16.77	26.94
slant_down	54.27	47.84	16.77	26.94
slant_up	54.27	47.83	16.77	26.94
star	54.27	47.84	16.77	26.93
v_lines	54.27	47.84	16.77	26.94
wide_lines	54.27	47.83	16.77	26.94
x_shape	54.26	47.84	16.77	26.93

Being cautious about our plots

We also need to be careful about our plots. Here’s another set of plots from the same Matejka and Fitzmaurice (2017) paper. This time, I’ve chosen three different (and somewhat common) methods for visualizing the same dataset. Depending on the visualization method, they wind up looking really different!

References

Anscombe, F. J. 1973. “Graphs in Statistical Analysis.” The American Statistician 27 (1): 17–21. https://doi.org/10.1080/00031305.1973.10478966.

Matejka, Justin, and George Fitzmaurice. 2017. “Same Stats, Different Graphs: Generating Datasets with Varied Appearance and Identical Statistics Through Simulated Annealing.” In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems, 1290–94. ACM. https://doi.org/10.1145/3025453.3025912.

Reuse

CC-BY 4.0

Citation

BibTeX citation:

@online{fruehwald2024,
  author = {Fruehwald, Josef},
  title = {Why {Visualize?}},
  date = {2024-09-09},
  url = {https://lin611-2024.github.io/notes/meetings/2024-09-09_data-viz.html},
  langid = {en}
}

For attribution, please cite this work as:

Fruehwald, Josef. 2024. “Why Visualize?” September 9, 2024. https://lin611-2024.github.io/notes/meetings/2024-09-09_data-viz.html.