1.2 From Data to Shape
Three Experiments, Three Mysteries
At the end of the last section, we promised that random processes create shapes — and that those shapes are the key to everything. Let's find out what that means.
Here are three experiments. Each one involves recording a number many times:
- Experiment A: Roll two standard dice and add the results. Record the sum.
- Experiment B: Measure the time (in seconds) between consecutive cars passing a quiet intersection. Record the wait time.
- Experiment C: Measure the height (in centimeters) of a randomly selected adult. Record the height.
Each experiment produces a number. Each number is unpredictable on any single trial — you can't know in advance whether the dice will sum to 7, or how long you'll wait for the next car, or how tall the next person will be.
But from Section 1.1, you know something important: if you repeat these experiments enough times, patterns emerge. Proportions stabilize. Structure appears.
The question is: does the same pattern appear for all three?
Prediction: Imagine you run each experiment 1,000 times and make a bar chart of the results. Do you think the three bar charts will look similar to each other, or different? If different, how?
Take a moment and commit to a guess. Try to sketch — even mentally — what you think each chart might look like.
Building the Picture
Let's start with Experiment A: rolling two dice and adding.
If you roll two dice, the smallest sum you can get is $1 + 1 = 2$ and the largest is $6 + 6 = 12$. So the possible outcomes are the whole numbers from 2 to 12.
But are all those outcomes equally likely? Let's think about it.
Quick question: How many ways can you roll a sum of 2? How many ways can you roll a sum of 7? Which should show up more often?
There's only one way to get a sum of 2: both dice show 1. But there are six ways to get a sum of 7: (1,6), (2,5), (3,4), (4,3), (5,2), (6,1). So 7 should be much more common than 2.
Now let's see this play out.
[Interactive: Dice Sum Histogram Builder. The student clicks "Roll" to roll two dice. Each roll adds the sum to a histogram with bars for values 2 through 12. The x-axis shows the possible sums, the y-axis shows frequency (count). A "Roll 10" and "Roll 100" button accelerate the process. After enough rolls, the histogram forms a clear triangular/peaked shape centered around 7. A toggle switches the y-axis between frequency (count) and proportion (count / total). A counter shows total number of rolls.]
Try this:
- Roll 20 times. The histogram is choppy and uneven. Hard to see any pattern.
- Roll up to 100 times. A shape starts to emerge. The middle values (6, 7, 8) have taller bars. The extremes (2, 12) are short.
- Roll up to 1,000 times. Now the shape is unmistakable: a peak in the center that tapers off symmetrically on both sides.
What do you see? The histogram has a clear peak at 7, with the bars getting shorter as you move toward 2 or 12. It's roughly symmetric — the left side mirrors the right.
Switch the y-axis to proportion. The shape is identical — only the numbers on the y-axis change. This makes sense: proportion is just frequency divided by the total, so every bar gets scaled by the same amount. The shape is preserved.
Remember from Section 1.1: proportion is the more revealing number, because it doesn't depend on how many times you ran the experiment. A histogram showing proportions lets you compare experiments of different sizes on equal footing.
Now a Different Experiment
Keep that dice histogram in your mind. Now let's look at Experiment B: the time between cars at a quiet intersection.
This is a different kind of measurement. Unlike dice sums (which can only be 2, 3, 4, ..., 12), wait times can be any positive number — 0.3 seconds, 4.7 seconds, 22.1 seconds. There aren't neat categories to count.
To make a histogram, we group the wait times into ranges called bins: 0–5 seconds, 5–10 seconds, 10–15 seconds, and so on. Each bar's height shows how many observations fell into that range.
[Interactive: Wait Time Histogram. The simulation generates random wait times drawn from an exponential-like process (most waits are short, a few are long). Each "Observe" click adds one wait time to the histogram. Buttons for "Observe 10" and "Observe 100" accelerate. The histogram bins are 0–5, 5–10, 10–15, etc. After 500+ observations, a clear pattern forms: the tallest bar is on the left (short waits), with bars getting progressively shorter to the right (long waits are rare). A toggle switches between frequency and proportion. A bin-width slider (2s, 5s, 10s) lets the student see how bin choice affects appearance while the overall shape persists.]
Before you look at the result: Do you think this histogram will have the same peaked, symmetric shape as the dice sums? Or something different?
Run the simulation up to 500 observations and look at what forms.
This shape is nothing like the dice histogram. Instead of a symmetric peak in the middle, you see:
- A tall bar on the far left (short waits are very common)
- Bars that drop off rapidly to the right (long waits are rare)
- A long "tail" stretching out — there's almost always a few very long waits
The shape is lopsided. It piles up on one side and trails off on the other.
And a Third
Now Experiment C: adult heights.
[Interactive: Height Histogram. The simulation generates heights for 1,000 randomly selected adults (drawn from a normal-like distribution with mean around 170 cm and standard deviation around 10 cm). The histogram displays automatically after generation. Bins are 5 cm wide. A "Generate New Sample" button draws a fresh batch. A toggle switches between frequency and proportion.]
This shape is different from the wait times, and similar to but not quite the same as the dice sums. It's symmetric and peaked — but the peak is smoother, the tails extend further, and it has a gentle bell curve shape rather than the triangular look of the dice sums.
The Reveal: Shape Is the Story
Let's put all three side by side.
[Interactive: Three Histograms, Side by Side. Three proportion histograms displayed together: (1) Dice sums — symmetric, triangular peak at 7, (2) Wait times — tall on the left, long tail to the right, (3) Heights — smooth, symmetric bell curve. Each has 1,000 observations. A "Regenerate All" button draws new samples for all three. The shapes stay consistent across regenerations — only small random fluctuations change.]
What changed? What stayed the same?
All three histograms were built the same way: repeat an experiment many times, group the outcomes, and plot how often each group occurred. That process stayed the same.
What changed is the shape. The dice sums make a triangle. The wait times pile up on the left and trail off to the right. The heights make a smooth bell. Three different experiments, three different shapes.
And here's the critical point: if you regenerate the data, the shapes come back. The individual data points are different every time, but the overall shape is stable — just like the proportions in Section 1.1 stabilized. The shape is the pattern.
Each random process has its own shape. That shape is the process's fingerprint — a visual summary of where the outcomes tend to land and how often.
A Name for the Shape
Let's give this shape a name.
The histogram of a random process — the picture that shows which outcomes happen often, which happen rarely, and the overall pattern — is called the distribution of that process.
When we say "the distribution of dice sums" or "the distribution of wait times," we mean: the shape that emerges when you look at all the outcomes and how frequently each one occurs.
A distribution answers a single question: Where do the outcomes tend to land?
- For the dice sums, the distribution says: "Outcomes cluster near 7 and are rare near 2 and 12."
- For the wait times, the distribution says: "Most waits are short; long waits are possible but uncommon."
- For the heights, the distribution says: "Most people are near average; very short and very tall people are rare."
The word "distribution" literally means how something is distributed — how the outcomes are spread out or allocated across the possible values.
Connection to Section 1.1: In the last section, you discovered that each random process has a "fingerprint" — a set of proportions it naturally converges to. The distribution is that fingerprint made visible. Instead of tracking just one proportion (like "proportion of heads"), the distribution shows all the proportions at once — a complete picture of how the randomness is structured.
A Vocabulary for Shape
Now that we can see distributions, we need words to describe them. Let's build that vocabulary through comparison.
Look at these four shapes:
[Image: Four histograms arranged in a 2×2 grid, each labeled. Top-left: "Symmetric" — a histogram that is roughly mirror-image around its center (like the heights histogram). Top-right: "Skewed Right" — a histogram that piles up on the left with a long tail to the right (like the wait times histogram). Bottom-left: "Skewed Left" — a histogram that piles up on the right with a long tail to the left (like exam scores where most students do well). Bottom-right: "Uniform" — a flat histogram where all bars are approximately the same height (like rolling a single fair die).]
Symmetric vs. Skewed
A distribution is symmetric if its left half is approximately a mirror image of its right half. The dice sums and heights are both symmetric — the peak sits in the center, and the distribution falls off evenly on both sides.
A distribution is skewed if one tail is longer than the other. The direction of the skew is named for the long tail, not the pile-up:
- Skewed right (or "positively skewed"): the long tail stretches to the right. The wait times are skewed right — most values are small, but a few are very large.
- Skewed left (or "negatively skewed"): the long tail stretches to the left. Think of exam scores in a class where most students do well — scores pile up high, with a tail of lower scores stretching left.
Common confusion: Students often name the skew for where the data piles up. It's the opposite — skew is named for the tail, the direction where the rare extreme values live. If the tail points right, it's skewed right, even though most of the data is on the left.
Peaked vs. Flat
Beyond symmetry, distributions differ in how "peaked" or "flat" they are:
- A peaked (or concentrated) distribution has most of its data clustered tightly around the center. The heights histogram is peaked — most adults are within about 15 cm of the average.
- A flat (or spread out) distribution has data spread more evenly across the range. Rolling a single fair die gives an approximately flat distribution — each face appears equally often.
Pause and think: Can a distribution be both symmetric AND flat? Both symmetric AND skewed? Both skewed AND peaked?
Symmetric and flat — yes: a single fair die. Symmetric and skewed — no, by definition. Skewed and peaked — yes: the wait times are skewed right and strongly peaked near zero.
These aren't rigid categories. Real distributions exist on a spectrum from perfectly symmetric to heavily skewed, and from very peaked to very flat. The vocabulary gives you a way to describe where on that spectrum a particular distribution falls.
Controlled Contrasts: What Changes the Shape?
Let's isolate what causes these different shapes by doing something careful: we'll change one thing about an experiment and see what happens to the distribution.
[Interactive: Dice Sum Explorer. The student selects how many dice to roll and sum: 1, 2, 3, 5, or 10. For each choice, the simulation rolls 1,000 times and displays a proportion histogram. All histograms can be shown simultaneously for comparison. Key observations: - 1 die: flat (uniform) — each value 1–6 appears equally often - 2 dice: triangular peak centered at 7 - 3 dice: smoother, more bell-shaped peak centered at 10.5 - 5 dice: even smoother bell shape centered at 17.5 - 10 dice: very smooth, very bell-shaped, centered at 35]
Start with 1 die. Then 2 dice. Then 3, 5, and 10.
What changed? What stayed the same?
The experiment is the same type each time — rolling fair dice and adding. The only thing that changed is how many dice we're adding together.
But look at the shape: it goes from flat (1 die) → triangular (2 dice) → smooth bell (3+ dice). The more dice you add, the more the distribution looks like a smooth, symmetric bell curve. This is no accident — you're watching one of the most important phenomena in probability, and you'll meet it formally later in the course.
Now let's try a different kind of contrast.
[Interactive: Fair vs. Loaded Dice. Two side-by-side histograms, each showing the distribution of the sum of two dice over 1,000 rolls. The left histogram uses two fair dice. The right uses two "loaded" dice where 6 appears with proportion 0.5 (five times as likely as each other face). Both accumulate in real time. Observations: - Fair dice: symmetric triangle centered at 7 - Loaded dice: peak shifted to the right (toward 12), and the shape is noticeably skewed left — sums near 12 are much more common, sums near 2 are very rare.]
What changed? What stayed the same?
Same number of dice (two), same process (roll and add). The only change is that the dice are loaded toward 6. The result: the peak shifts right, and the shape becomes skewed. The physical setup of the experiment directly determines the shape of the distribution.
What the Shape Tells You
We've been looking at shapes. But shapes aren't just pretty — they carry information. Let's read a histogram the way you'd read a fingerprint.
Example: Reading the dice sum distribution.
Look back at the proportion histogram for two fair dice. You can read off concrete facts:
- The most common sum is 7 (the tallest bar).
- Sums of 6 and 8 are almost as common as 7.
- A sum of 2 or 12 is very rare — the bars are barely visible.
- About what proportion of rolls give a sum between 5 and 9? You can estimate by adding those bars' heights. It's roughly 0.67 — about two-thirds of all rolls land in that range.
The distribution doesn't just describe the data you collected. It tells you what to expect from future data. If the proportion of sums between 5 and 9 is about 0.67, then in your next 300 rolls, you'd expect roughly $0.67 \times 300 = 201$ of them to fall in that range.
Connection to Section 1.1: This is the same bridge between past proportions and future expectations that we saw with coins. Run enough trials, and the proportions stabilize. Those stable proportions become predictions.
Example: Reading the wait time distribution.
The wait time histogram — skewed right, piled up on the left — tells a different story:
- Most waits are short (under 5 seconds).
- A wait of 30 seconds is possible but rare.
- The distribution has no hard upper limit — very occasionally, you might wait well over a minute.
If you managed a traffic signal at this intersection, this shape would tell you something concrete: optimizing for short waits covers the majority of cases, but you need to design for the occasional much longer gap.
The big idea: The shape of a distribution isn't just a picture. It's a description of how the randomness works — what's typical, what's unusual, and where the extremes live.
Worked Example: Describing a Distribution
Here's a new histogram.
[Image: A proportion histogram showing the distribution of daily tips (in dollars) earned by a barista. The histogram has bars from $0 to $60 in $5 bins. The tallest bar is at $10–$15. Bars decrease gradually to the right, with a few observations in the $40–$60 range. The shape is skewed right.]
Let's describe this distribution step by step:
- Shape: Skewed right. Most of the data is on the left (low tips), with a long tail stretching right (occasionally high tips).
- Center (roughly): The peak is around $10–$15. This is where tips most commonly fall.
- Spread: Tips range from near $0 to about $60, but the bulk is between $5 and $25.
- What it says: On a typical day, the barista earns around $10–$15 in tips. Days with $30+ in tips happen but are uncommon. A $50+ tip day is very rare.
We've described the shape, located the center, described the spread, and translated it into meaning. This four-step pattern — shape, center, spread, meaning — works for any distribution.
Your Turn: Faded Example
Here's another histogram.
[Image: A proportion histogram showing the distribution of scores on a 100-point exam. Most scores are clustered between 70 and 95, with the tallest bars around 80–85. A small tail extends left down to about 30. The shape is skewed left.]
Fill in the description:
- Shape: Skewed ___. Most scores are ___ (high/low), with a tail stretching toward ___ scores.
- Center (roughly): The peak is around ___.
- Spread: Scores range from about ___ to , but most are between ___ and .
- What it says: ___
Check your answers:
- Shape: Skewed left. Most scores are high, with a tail stretching toward low scores.
- Center: The peak is around 80–85.
- Spread: Scores range from about 30 to 100, but most are between 70 and 95.
- What it says: Most students performed well on this exam. A few students scored much lower, but very low scores (below 40) were rare.
Distributions Are Everywhere
Before we move on, let's stretch this idea beyond histograms of numbers. The concept of a distribution — a description of where outcomes tend to land — appears in situations you might not expect.
Prediction: Think about the number of text messages you send per day. If you tracked this for 100 days, what shape would the histogram take? Symmetric? Skewed? Peaked? Flat? Why?
There's no single right answer — it depends on your habits. But most people's message counts are skewed right: many days with a moderate number of messages, and occasional days with a burst of activity. Very few people have a perfectly symmetric distribution of daily texts.
Here are a few more real-world distributions and their shapes:
| What's measured | Typical shape | Why |
|---|---|---|
| Heights of adults | Symmetric, bell-shaped | Most people are near average; very tall and very short are equally rare |
| Income in a country | Skewed right | Most people earn moderate amounts; a small number earn extremely large amounts |
| Number of goals per soccer game | Skewed right, peaked near 2–3 | Most games have a few goals; blowouts are rare |
| Outcome of a single fair die | Flat (uniform) | All faces are equally likely |
| Daily high temperature in July | Symmetric, peaked | Most days are near the monthly average; extreme heat and cool days are rare |
What changed? What stayed the same? Every row describes a random process with many trials. Every row has a distribution. But the shapes are different because the underlying processes are different. The shape encodes the structure of the randomness.
[Interactive: Distribution Gallery. A set of 6 real-world distributions presented as proportion histograms, each with a brief label. The student's task: for each one, classify it as approximately symmetric, skewed right, skewed left, or uniform, and identify whether it's peaked or flat. After submitting their answers, they see the correct classification with a brief explanation of why the process produces that shape. Distributions include: (1) Waiting time at a bus stop, (2) IQ scores, (3) Household net worth, (4) Roll of a single fair die, (5) Number of siblings a person has, (6) Scores on an easy quiz.]
Practice
Level 1: Concrete
Problem 1. You roll a single fair six-sided die 600 times and make a histogram of the results.
(a) What values will appear on the x-axis?
(b) Approximately how tall will each bar be (in frequency)?
(c) What shape will the histogram be?
Think about it, then check.
(a) The values 1, 2, 3, 4, 5, 6. (b) Each value should appear about $\frac{600}{6} = 100$ times. All bars are approximately the same height. (c) Flat — approximately uniform. Each outcome is equally likely, so no value dominates.
Problem 2. A bakery tracks how many croissants it sells each morning for 200 days. The histogram is skewed right with a peak around 30 croissants.
(a) What does "skewed right" mean in this context?
(b) Is a day with 60 croissant sales likely or rare?
(c) Approximately where do most days fall?
Work through it before checking.
(a) Most days, the bakery sells a moderate number of croissants (around 30), but there are occasional days with unusually high sales — the right tail. Those high-sales days might be weekends, holidays, or events. (b) Rare — 60 is in the long right tail. (c) Most days are near the peak, probably between 20 and 40 croissants.
Level 2: Pattern
Problem 3. For each scenario below, predict the shape of the distribution (symmetric, skewed right, skewed left, or uniform). Explain your reasoning.
| Scenario | Your predicted shape | Your reasoning |
|---|---|---|
| (a) Ages of students in a college class | ||
| (b) Lifetimes of light bulbs | ||
| (c) A random number generator that outputs integers 1 through 10 equally | ||
| (d) Scores on a very difficult exam |
Commit to your predictions, then check.
(a) Skewed right. Most students are 18–22, but there are occasionally older students (25, 30, 40+). The right tail is longer. (b) Skewed right. Most bulbs last a moderate amount of time, but some fail early and a few last a very long time. The long right tail represents the lucky long-lasting bulbs. (c) Uniform (flat). By design, all outcomes are equally likely. (d) Skewed right. On a very difficult exam, most students score low, with a tail of higher scores from a few who mastered the material. (Note: if the exam is very easy, the distribution would be skewed left instead — most students score high.)
Problem 4. Two histograms are shown below. Both are based on 500 observations.
[Image: Two histograms side by side. Histogram A is symmetric and bell-shaped, centered around 50, ranging from about 20 to 80. Histogram B is also symmetric and bell-shaped, centered around 50, but narrower — ranging from about 40 to 60.]
(a) What feature is the same in both distributions?
(b) What feature is different?
(c) If these represent test scores from two different classes, what does the difference tell you about the two classes?
Think about it first.
(a) Both are symmetric, both are centered at about 50. (b) The spread is different. Histogram A has much more variability — scores range widely — while Histogram B is more concentrated around the center. (c) In class A, there's a big range of performance — some students did very well, some did poorly. In class B, students performed much more consistently — almost everyone scored near 50. The classes have similar average performance but very different consistency.
Level 3: Structure
Problem 5. Explain why the sum of two fair dice produces a peaked, symmetric distribution rather than a flat one. Your explanation should reference the number of ways to achieve different sums.
Think through the logic before reading.
Each individual die is flat — every face is equally likely. But when you add two dice, the number of combinations that produce each sum is not equal. There are 6 ways to get a sum of 7 (1+6, 2+5, 3+4, 4+3, 5+2, 6+1) but only 1 way to get a sum of 2 (1+1). The middle sums have more combinations feeding into them, so they appear more frequently. The peaked shape is a direct consequence of this unequal counting.
This is a key insight: even when the inputs to a process are flat, the output can be peaked, if some outputs can be reached in more ways than others.
Problem 6. A friend says: "If I collect enough data, any histogram will eventually become bell-shaped." Is this true? Explain.
Careful — this is a common misconception.
No. Collecting more data makes the histogram smoother and more clearly reveals the underlying shape — but it doesn't change the shape itself. The wait-time distribution is skewed right with 100 observations and still skewed right with 100,000 observations. More data → clearer shape, not different shape.
The shape is determined by the underlying process, not by the sample size. A single die roll will always produce a flat distribution, no matter how many times you roll it.
Level 4: Transfer
Problem 7. You're analyzing customer data for an online store. You have the distribution of order values (how much each customer spends per order). The distribution is skewed right, with a peak around \$25 and a long tail extending past \$200.
(a) If you had to choose a single "typical" order value to report to your boss, would you pick a number near the peak (\$25) or somewhere higher? Why?
(b) The company is considering offering free shipping on orders over a certain amount. Using the shape of the distribution, how would you choose the threshold?
(c) A rival company reports that their average order value is \$45. Does that mean their typical customer spends more than yours? What might the shape of their distribution look like?
These are open questions — think about the real-world implications of shape.
(a) The peak (\$25) represents the most common order value — it's where orders "tend to land." A number higher than \$25, like the average, would be pulled up by the few very large orders in the tail. The peak better captures what a typical customer does. (b) You'd want the threshold high enough that only the "tail" customers miss it, but low enough to motivate the cluster of customers near the peak to spend a bit more. The distribution's shape tells you where the bulk of customers are and how far the tail extends — that's exactly the information you need. (c) Not necessarily. A skewed-right distribution can have an average much higher than its peak because the few large values pull the average to the right. Their typical customer might spend the same \$25, but they might have more extreme high-value orders pulling the average up. You'd need to see their distribution's shape, not just the average.
Debug Challenge
Problem 8. Here's a classmate's reasoning. Find the flaw:
"The distribution of wait times at this bus stop is skewed right. That means long waits are more common than short waits."
What's wrong?
Read carefully and think about what a right-skewed shape actually looks like.
It's backwards. In a right-skewed distribution, the data piles up on the left — short waits are the most common. The tail extends to the right, meaning long waits are possible but rare. The classmate confused the direction of the skew (named for the tail) with the direction of the pile-up.
Reflection
In one sentence, what's a distribution? Try to articulate it in your own words before reading on.
Here's one way to put it: A distribution is a description of where the outcomes of a random process tend to land — which outcomes are common, which are rare, and the overall shape of the pattern.
Confidence check: On a scale of 1 to 5, how confident are you that you could look at a histogram and describe its shape using the vocabulary from this section (symmetric, skewed left/right, peaked, flat)?
What's one thing you found surprising? Maybe it's that adding fair dice creates a non-flat shape. Maybe it's that radically different real-world processes (income, heights, light bulbs) each have their own characteristic shape. Whatever surprised you, sit with it — surprise is where learning happens.
Creation
Distribution Scavenger Hunt. Think of three real-world quantities you encounter in your daily life — things that vary and that you could (in principle) measure many times. For each one:
- What is the quantity? (e.g., "minutes I spend on my phone each day")
- What shape do you think its distribution takes? (symmetric, skewed right, skewed left, uniform?)
- Where do you think the peak is?
- How spread out do you think it is?
- Is there a natural boundary? (e.g., can't be negative, can't exceed 24 hours)
Natural boundaries are one of the biggest clues to shape. If a quantity can't go below zero but can be very large, that often produces a right-skewed distribution — the values pile up near the boundary and trail off in the other direction.
Looking Ahead
You now have the core idea: every random process creates a distribution — a shape that describes where outcomes tend to land. And you have a vocabulary for those shapes: symmetric, skewed, peaked, flat.
But right now, our descriptions are qualitative. We say "peaked" or "spread out" without any way to measure how much. Two distributions can both be "symmetric and peaked" yet look very different.
In Section 1.3, we'll put two distributions side by side and ask: What exactly makes them different? That question will push us toward needing precise, numerical descriptions — and that's where the real mathematics begins.