Stats Medic
- Feb 10
- 9 min read

Would This Get Credit? 2023 AP Statistics Exam #4

Chris Viste is a teacher of AP Statistics, AP Calculus BC, and Algebra 2 and is the math team advisor at New Berlin West High School in New Berlin, Wisconsin. Chris has been an AP Statistics Exam Reader, Table Leader and Early Table Leader. She can be reached at chris.viste@nbexcellence.org.

The Question - 2023 #4 (and the rubric)

A medical researcher completed a study comparing an omega-3 fatty acids supplement to a placebo in the treatment of irritability in patients with a certain medical condition. Nineteen patients with the medical condition volunteered to participate in the study. The study was conducted using the following weekly schedule.

Week 1: Each patient took a randomly assigned treatment, omega-3 or placebo.
Week 2: The patients did not take either the omega-3 supplement or the placebo. This was necessary to reduce the possibility of any carryover effect from the assigned treatment taken during week 1.
Week 3: Each patient took the treatment, omega-3 or placebo, that they did not take during week 1.

At the end of week 1 and week 3, each patient’s irritability was given a score on a scale of 0 to 10, with 0 representing no irritability and 10 representing the highest level of irritability.

For each patient, the two irritability scores and the difference in their scores (placebo minus omega-3) were recorded. The results are summarized in the table and boxplots.

The researcher claims the omega-3 supplement will decrease the mean irritability score of all patients with the medical condition similar to the volunteers who participated in the study. Is there convincing statistical evidence to support the researcher’s claim at a significance level of α = 0.05? Complete the appropriate inference procedure to support your answer.

This question was scored in three sections. In the first part, there were four components. Students needed to identify the appropriate procedure for a population mean difference with name or formula (with variables or numbers), state correct hypotheses (with one mean, correct equality in the null hypothesis, and the right direction in the alternative), and include context.

WOULD THIS GET CREDIT (Identify the procedure)?

Students struggled with recognizing this as a matched-pairs t-test. In responses 1 - 3, students were confusing a matched pairs test with a 2-sample test. We saw a lot of responses like this. Response 4 uses a formula to name the procedure, and even though a formula with symbols or numbers was acceptable, this formula was incorrect. Students had to be clear they were working with a single sample in their identification of the procedure. Responses 5 – 9 are all correct ways to identify the procedure.

WOULD THIS GET CREDIT (Hypotheses)?

Hypotheses for a matched-pairs test should be written in terms of a single mean and the null hypothesis should indicate that the population mean difference = 0. Some students named a matched pairs t-test but then went on to use two means in their hypotheses, demonstrating that they weren’t really sure what type of test they were doing. Responses 1 and 2 are incorrect because they include two means, one for each set of scores, not the mean of the differences of the scores for each individual. Response 3 earns partial credit because a single mean was used in the hypotheses, but they referred to a sample mean instead of a population mean. Response 4 is incorrect because the claim is stated as the null hypothesis, instead of the alternative.

WOULD THIS GET CREDIT (Context for parameter)?

Response 1:

μ₁ = the mean irritability score of patients taking the placebo

μ₂ = the mean irritability score of patients taking omega-3

Response 2:

μ = the mean difference of irritability scores between the placebo and omega-3

Response 3:

μ = the difference in mean irritability scores of all patients similar to the volunteers who participated in the study (placebo-omega 3).

Response 4:

μ = the mean difference of irritability scores of all patients who took the placebo and omega-3.

Response 5:

μ = the mean difference of irritability scores of all patients similar to the volunteers who participated in the study (placebo-omega 3).

Response 6:

No parameter was identified but hypotheses said µdiff

The context required for full credit included reference to the population mean difference (and by using µ students met the population element), the sampling units (patients), and response variable (irritability score). Context could have been referred to anywhere in the response, from identifying the parameter to the conditions or the conclusion.

Response 1 is incorrect because it is missing the key element of “mean difference.” Students often had this in their response if they were doing a 2-sample t-test.

Response 2 is incorrect because it didn’t mention the sampling units of patients. If people of any variety were mentioned elsewhere in this response the student could pick up credit for it there.

Response 3 is incorrect because it refers to the difference in mean irritability scores which indicates two means, instead of a single mean difference. Wording of a “mean difference” was important. The response needed to be clear that only one mean was being tested.

Response 4 is incorrect because it uses past tense by saying “patients who took,” referring to the sample and not the population.

Response 5 is correct because it refers to all necessary elements of context.

Response 6 may be correct if patients and irritability score were mentioned somewhere else in the response. We accepted µdiff as referring to “mean difference.” If the student merely said µ though, that was not enough. That student would need to define µd.

Teaching Tips:

Make sure students understand the difference between when to do a matched pairs test and when to do a two-sample test. Give lots of examples of each and help them understand what to look for and how to carry out each test.

When writing hypotheses and using subscripts, a mean difference is not the same thing as a difference of means. Students struggled to write the parameter correctly and understand the difference between a “mean difference” and a “difference of means.”
Responses should always be written in the context of the problem.
The claim should be stated in the alternative hypothesis, not in the null hypothesis.

WOULD THIS GET CREDIT (Conditions, test statistic and p-value)?

Response 1:

Conditions: Random: Stated; Normal: the boxplots are approximately Normal.

t = .72; p-value = .2404

Response 2:

Conditions: Random: treatments are randomly assigned; Normal: the boxplot of the differences shows no outliers.

t = 2.256; p-value = .0159

Response 3:

Conditions: It was a completely randomized experiment; the boxplots show no strong skew, so they are approximately Normal.

t = 3.138; p-value = .0028

Response 4:

Conditions: Random: treatments were randomly assigned; Normal: the boxplot of differences shows no strong skew or outliers.

p-value = .0028

In order to get full credit, the response had to indicate the random assignment was done, the boxplot of differences showed no major skew, and give the proper test statistic and p-value for a matched pairs t-test.

Response 1 is incorrect because saying “random” isn’t enough to demonstrate understanding that it was the treatments that were randomly assigned instead of this being an SRS. Also, the Normal condition is verified by referring to multiple boxplots, when only the boxplot of differences matters. The test statistic is incorrect because it is calculated without dividing by the square root of 19. The response earns credit for the p-value though, since it is correctly calculated, using the stated test statistic, using the upper tail of a t-distribution with 18 degrees of freedom.

Response 2 earns partial credit because while checking conditions, the random assignment condition is stated and verified correctly, but the Normal condition does not describe the shape of the boxplot, only that it didn’t have outliers. Furthermore, the normal condition was being verified for a matched-pairs test, but the test statistic was calculated for a 2-sample test. When this happens, the response is scored as partially correct.

Response 3 earns partial credit because the response correctly identifies the random assignment process in “completely randomized experiment,” but says the boxplots (students need to describe a singular boxplot) appear approximately Normal. Because Normality can’t be established from a boxplot, the response loses credit for saying that. The response correctly identifies the test-statistic and p-value for a matched-pairs test. Again, because the Normal condition checked was for a different test than the test-statistic, only partial credit could be earned.

Response 4 earns full credit because it states and verifies that the random assignment and normal conditions are met for a matched-pairs test, and states the correct test statistic and p-value for that test.

Teaching Tips:

For full credit in this inference procedure, students need to make sure they understand the difference between random sampling and random assignment and be sure to verify the necessary condition. Based on the CED, the condition that “random assignment” satisfies is independence.
Students should understand how to describe the shape of a distribution displayed in a boxplot and that Normality can’t be determined from a boxplot. Also, students should understand that the Normal condition for a matched-pairs test is satisfied when the sampling distribution of the differences can be assumed to be approximately Normal, not when the individual irritability scores distributions or the population distributions are approximately Normal.
Students did not have to show work to earn credit for the test statistic. Students should use their calculators whenever possible to calculate the test statistic and p-value. Indicating the degrees of freedom used was not required, but allowed us to determine if the stated p-value follows correctly from the test statistic.
If the 10% condition was mentioned, this was overlooked and students were not penalized for stating it.

WOULD THIS GET CREDIT (Conclusion)?

Response 1:

With a p-value of .0028 and an alpha level of .05, I reject H₀. I have convincing evidence that omega-3 will decrease the irritability score of all patients with the medical condition similar to the volunteers who participated in the study.

Response 2:

Since a p-value of .25 is greater than α = .05, I fail to reject the H₀. I have convincing evidence that omega-3 decreased the mean irritability score of the patients.

Response 3:

With a p-value of .0028 and an alpha level of .05, I reject H₀. This proves that omega-3 will decrease the irritability score of all patients with the medical condition similar to the volunteers who participated in the study.

Response 4:

P-value of .0159 < α = .05. I have convincing evidence that omega-3 will decrease the mean irritability score of all patients with the medical condition similar to the volunteers who participated in the study.

Response 5:

Because my p-value is less than alpha, we can reject H₀. I have convincing evidence that the true mean difference in irritability scores for all patients similar to the volunteers in the study is greater than 0.

Response 6:

Because the p-value of .9972 is greater than .05, the researchers should fail to reject the null hypothesis. They don’t have convincing evidence that omega-3 will decrease the mean irritability score of all patients.

Response 7:

Because the p-value of .0028 is less than the alpha level of .05, I reject H₀. I have convincing evidence that omega-3 will decrease the mean irritability score of all patients with the medical condition similar to the volunteers who participated in the study.

In order to earn full credit for the conclusion, the response needed a correct comparison of an identified p-value to the significance level, provide a correct decision about the null or alternative hypothesis, and state a correct conclusion in context in terms of the alternative hypothesis using non-deterministic language.

Response 1 is incorrect because there is no comparison between the p-value and the significance level. The response should include “less than” in the comparison here. Also, in the statement of the conclusion in terms of the alternative, the word “mean” is omitted from the response variable “mean irritability score,” so context is not complete.

Response 2 is incorrect because even though the response correctly compares an incorrect p-value to the significance level, and makes the correct decision regarding the null hypothesis, an inconsistent conclusion was stated regarding the alternative hypothesis. The conclusion also incorrectly refers to the sample (“of the patients”), instead of the population of all patients similar to those in the study.

Response 3 is incorrect because the p-value is not compared to the significance level, even though it indicates a correct decision about the null hypothesis, and then uses deterministic language in its conclusion consistent with the alternative hypothesis.

Response 4 earns full credit, because a p-value is correctly compared to the significance level, a correct decision regarding the alternative is implied, and a correct conclusion consistent with the alternative hypothesis, in context, is stated, using the wording from the stem of the problem. The response did not have to explicitly state the we should reject the null hypothesis.

Response 5 earns full credit, assuming the p-value was already identified in the response, because it correctly compares the p-value to the significance level, indicates a correct decision about the null hypothesis, and states a correct conclusion, in context, in terms of the alternative hypothesis.

Response 6 earns full credit, even though an incorrect p-value is used, because it correctly compares a p-value to the significance level, indicates a correct decision about the null hypothesis, for the stated p-value, and states a correct conclusion, for the stated p-value, in context, consistent with the alternative hypothesis, using the wording from the stem of the problem.

Response 7 earns full credit because a p-value is correctly compared to the significance level, a correct decision regarding the null hypothesis is stated, and a correct conclusion consistent with the alternative hypothesis, in context, is stated, using the wording from the stem of the problem.

Teaching Tips:

Students should not interpret the p-value as part of their conclusion to a significance test unless asked to do so. If asked to do so, be sure to include the phrase “equal to or more extreme than the sample results.”
Students should be sure to compare the p-value to the significance level. Just stating each number is not enough.
Students should always state conclusions in the context of the alternative hypothesis.
Encourage students to use the wording stated in the stem of the problem in their conclusion.
Students should understand that a conclusion to a significance test is about the population and not the sample. The wording of a response should include present or future tense. Past tense indicates reference to the sample used in the experiment.

Would This Get Credit? 2023 AP Statistics Exam #4

The Question - 2023 #4 (and the rubric)

WOULD THIS GET CREDIT (Identify the procedure)?

WOULD THIS GET CREDIT (Hypotheses)?

WOULD THIS GET CREDIT (Context for parameter)?

Teaching Tips:

WOULD THIS GET CREDIT (Conditions, test statistic and p-value)?

Teaching Tips:

WOULD THIS GET CREDIT (Conclusion)?

Teaching Tips:

Recent Posts