Updated: Mar 4, 2018
Identify the population and sample in a statistical study.
Identify voluntary response sampling and convenience sampling and explain how these sampling methods can lead to bias.
Describe how to select a simple random sample with technology or a table of random digits.
Activity: Does Beyoncé write her own lyrics?
For 10 years, I have used the Gettysburg Address Activity to teach this lesson. This activity brilliantly reveals the need for a simple random sample when making inferences for a population. But the context never got the buy-in from students that I wanted. Except for a few history nerds, there were few students who cared to find out the true author of the Gettysburg Address.
Enter Beyoncé. When her hit “Crazy in Love” came out, people started questioning whether or not she had written the lyrics. In a Vanity Fair article, Beyoncé came back at them:
'Crazy in Love’ was really hard to write because there was so much going on … I mean, I had written — what? —seven, eight number one songs with Destiny’s Child, in a row.
So how do we use statistics to determine if Beyoncé wrote the lyrics to “Crazy in Love”? It is well known that different authors use different styles and word
choice. It turns out that the average word length is fairly consistent for each author and can be used as a way to distinguish one author from another. Since we know for sure that Beyoncé wrote the lyrics for all of the Destiny’s Child songs (average word length 3.64), we should be able to determine her possible authorship of “Crazy in Love” by finding the average word length.
But we don’t have time to find the actual word length for all of the words in “Crazy in Love” (population). Instead, we will pick 5 words (sample). The first sampling method will be the “quickly circle 5 words” method (convenience sample), which will produce terrible overestimates, leading us to want to use a better sampling method (simple random sample!).
Simple Random Sample:
Rush students into their first sample of 5 words. The more you rush them, the worse they will mess it up.
Play the song on your speakers while students are finding their samples.
If students are going to use RandInt on their calculators to find the simple random sample, be sure that they seed their calculators first. Better yet, have student use their iPhones by asking Siri to “Give me a random number between 1 and 297”.
Use sticker dots on a poster board so that you can save the results. You can refer back to these posters throughout the remainder of the year.
The true average word length for “Crazy in Love” is 3.53.
This might be the first time that students are experiencing a sampling distribution, which will be an important concept when we get to inference. Point at one of the dots and ask “What does this dot represent?” The answer is “A sample of 5 words, and an average calculated from that sample.” Then point at a different dot and ask ‘What does this dot represent?” The answer is “A different sample of 5 words, and an average calculated from that sample.” So the dotplot represents many, many samples and an average calculated for each of those samples (the sampling distribution of a sample mean).
This is the perfect context to discuss how poor sampling methods can lead to bias. The convenience sample has high bias because most of the estimates are overestimates of the truth (most dots are above 3.53) and the simple random sample has low bias because about half of the estimates are overestimates and half are underestimates.
The true average word length for “Crazy in Love” is 3.53, which is not very far from the known Beyoncé average word length of 3.64. This means that we do not have convincing evidence that Beyoncé did not write the lyrics. This is not the same as saying “Beyoncé did write the lyrics”. The latter statement is equivalent to accepting the null hypothesis.
One possible extension for this activity is to have students to take random samples of size 10 and create a third dotplot. This will show students that increasing the sample size decreases the variability of the sampling distribution.