Sampling: Good and Bad (Lesson 4.2)
Chapter 4 - Day 3
Describe how convenience sampling can lead to bias.
Describe how voluntary response sampling can lead to bias.
Explain how random sampling can help to avoid bias.
For many years, we used the Gettysburg Address Activity to teach this lesson. This activity brilliantly reveals the need for a random sample when making inferences for a population. But the context never got the buy-in from students that we wanted. Except for a few history nerds, there were few students who cared to find out the true author of the Gettysburg Address.
Enter Beyonce. When her hit “Crazy in Love” came out, people started questioning whether or not she had written the lyrics. In a Vanity Fair article, Beyonce came back at them:
“‘Crazy in Love’ was really hard to write because there was so much going on … I mean, I had written — what? —seven, eight number one songs with Destiny’s Child, in a row.”
So how do we use statistics to determine if Beyonce wrote the lyrics to “Crazy in Love”?
It is well known that different authors use different styles and word choice. It turns out that the average word length is fairly consistent for each author and can be used as a way to distinguish one author from another. Since we know for sure that Beyonce wrote the lyrics for all of the Destiny’s Child songs (average word length 3.64), we should be able to determine her possible authorship of “Crazy in Love” by finding the average word length.
But we don’t have time to find the actual word length for all of the words in “Crazy in Love” (population). Instead, we will pick 5 words (sample). The first sampling method will be the “quickly circle 5 words” method (convenience sample), which will produce terrible overestimates, leading us to want to use a better sampling method (random sample!).
Rush students into their first sample of 5 words. The more you rush them, the worse they will mess it up.
Play the song on your speakers while students are finding their samples.
If students are going to use RandInt on their calculators to find the simple random sample, be sure that they seed their calculators first.
Students can use their phones to get random numbers (ask Siri or ask Google) or they can use the website www.random.org.
Use sticker dots on a poster board so that you can save the results. You can refer back to this poster when you get to Lesson 4.4 and throughout the remainder of the year.
The true average word length for “Crazy in Love” is 3.53.
This is the first time that students are experiencing a sampling distribution, which will be an important concept when we get to inference. Point at one of the dots and ask “What does this dot represent?” The answer is “A sample of 5 words, and an average calculated from that sample.” Then point at a different dot and ask “What does this dot represent?” The answer is “A different sample of 5 words, and an average calculated from that sample.” The dotplot represents many, many samples and an average calculated for each of those samples (the sampling distribution of a sample mean).
This is the perfect context to discuss how poor sampling methods can lead to bias. The convenience sample has high bias because most of the estimates are overestimates of the truth (most dots are above 3.53) and the random sample has low bias because about half of the estimates are overestimates and half are underestimates.
The true average word length for “Crazy in Love” is 3.53, which is not very far from the known Beyonce average word length of 3.64. This means that we do not have convincing evidence that Beyonce did not write the lyrics. This is not the same as saying “Beyonce did write the lyrics”. The latter statement is equivalent to accepting the null hypothesis (a big no-no in the statistics world).