Make peace with the p-value.
A self-paced, plain-language tour of the ideas you need to read SLP research with confidence — sampling, p-values, confidence intervals, effect size, the common tests, and single-case analysis. Concepts and clinical meaning come first; formulas stay optional. Every chart and number below comes from simulated teaching data — not real clients.
Simulated data Nothing here is real client data. Examples are built to teach interpretation, not to show that any treatment works.
A p-value measures surprise — not proof.
By the end of this lesson you’ll read a results section and know exactly what it claims, what it doesn’t, and whether it matters for a client.
Foundations & mindset
Before any test or p-value: what statistics are for, the difference between summarizing and inferring, and why the type of data you collect quietly decides everything that follows.
Part 1 · Foundations
1.1 Welcome: statistics as careful reasoning
If statistics has ever felt like a wall of symbols, you are in good company — and you are in the right place. This lesson is about reasoning, not arithmetic. The goal is for you to read a results section in a journal article and understand what it is really claiming (and what it is not).
Key idea
Researchers almost never get to measure everyone. They study a sample and use inferential statistics to make cautious, uncertainty-aware claims about the larger group or process they care about. Statistics support clinical reasoning; they never replace your judgment about an individual client.
A 30-second preview of the words people throw around
Tap any card to flip it over — then flip it back whenever you like. You’ll meet each term properly later; nothing to memorize now.
Take-home
You do not need to memorize formulas to begin understanding results. Start with the question and the picture; the notation will make sense later.
Part 1 · Foundations
1.2 Descriptive vs. inferential statistics
Two jobs, two different questions. Getting them straight makes everything else easier.
“What happened in the data I have?”
Descriptive statistics summarize the sample in front of you — the mean, median, standard deviation, a histogram. They make no claim beyond these specific people.
Example: in a class project, 30 children average 72% consonants correct. That number describes those 30 children, full stop.
“What might this say about the bigger picture?”
Inferential statistics use the sample to reason — with explicit uncertainty — about a larger population or process you could not measure directly.
Example: could a result like this plausibly hold for the broader group of children with speech sound disorder, or is it likely just the luck of who we sampled?
Can I do this in Excel?
Yes — descriptive statistics are Excel’s home turf:
=AVERAGE(range), =MEDIAN(range),
=STDEV.S(range), =COUNT(range),
=MIN(range), =MAX(range).
Interpretation reminder: these describe your sample. Reasoning beyond it is the inferential step that comes next.
Key idea
Describe the sample first; infer second. Inferential statistics always carry uncertainty — that uncertainty is the honest part, not a flaw.
Part 1 · Foundations
1.3 Variables, measurement & data types
Here is the secret that makes “choosing a test” far less scary: the test you can use is mostly decided by what kind of variables you have. Learn the types and you are halfway there.
| Type | What it is | SLP example |
|---|---|---|
| Continuous | Numbers on a scale; in-between values make sense | Naming score, intelligibility rating, vocabulary score |
| Categorical | Named groups, no order | Treatment group, diagnosis category, responder / non-responder |
| Ordinal | Ordered categories, uneven gaps | Severity rating (mild / moderate / severe) |
| Count | Whole-number tallies | Number of communicative initiations in a session |
| Percentage | Bounded 0–100%, often skewed near the edges | Percent consonants correct, percent syllables stuttered |
How these map to the four levels of measurement
If you learned the classic NOIR hierarchy — Nominal, Ordinal, Interval, Ratio — the everyday data types above sit right on top of it. The big split for choosing a test is categorical (nominal/ordinal) vs. numeric (interval/ratio); the level just tells you what math is fair to do.
| Level | What it adds | Data type(s) here | Simulated SLP example |
|---|---|---|---|
| Nominal | Named groups, no order | Categorical | Treatment group; responder / non-responder |
| Ordinal | Order, but unequal gaps | Ordinal | Severity rating (mild / moderate / severe) |
| Interval | Equal gaps, but no true zero | Continuous scores without a true zero | A standardized test standard score (mean = 100) |
| Ratio | Equal gaps and a true zero | Most continuous, plus counts & percentages | Naming score, # initiations, % consonants correct |
Two practical reminders: don’t treat ordinal ratings as if the gaps were equal (mild→moderate may not equal moderate→severe), and only ratio scales let you say “twice as much” (a true zero means “0 = none”). Counts and percentages are technically discrete, but they behave as ratio-level for most analyses.
Key idea
Three questions drive almost every test choice: What type is my outcome? How many groups or measurements? Are the same people measured more than once? Keep these in your back pocket.
From sample to population
The single idea at the heart of inference: a sample is a blurry snapshot of something bigger. Once you feel how samples wobble, p-values and confidence intervals stop being mysterious.
Part 2 · Sample to population
2.1 Population and sample
Four words that get used loosely in conversation but precisely in research.
Everyone you care about. All preschoolers receiving speech-sound services in a district.
The few you actually measure. 30 children enrolled in a study.
The true value in the population. Usually unknown.
The value from your sample. Our best estimate of the parameter.
SLP translation
When a study reports a sample mean, it is offering an estimate of a parameter you cannot see. Two honest studies of the same population can land on different numbers — neither is “wrong.”
Part 2 · Sample to population
2.2 Sampling variability and uncertainty
If we repeated the same study many times, the sample means would form their own distribution. Its width tells us how much to trust a single estimate.
Why this matters
Inferential statistics ask: is the difference we observed bigger than the wobble we’d expect from sampling alone? Everything from here builds on that one question.
Part 2 · Sample to population
2.3 Distributions: shape matters
A distribution shows how values are spread out. Three features — center, spread, and shape — tell you most of what you need, and they warn you when a “mean” might be misleading.
Roughly normal
Aphasia naming scores often spread fairly symmetrically around a center — the mean is a good summary.
Right-skewed
Percent syllables stuttered piles up near zero with a long right tail — the mean gets pulled upward, so the median may describe a “typical” client better.
Ceiling effect
After successful articulation treatment, many children score near 100% — the test can’t capture further gains, which can hide real differences.
Can I do this in Excel?
Yes. Make a quick histogram with Insert → Chart → Histogram (newer Excel) or the Analysis ToolPak → Histogram. Interpretation reminder: always look at the shape before trusting a mean — a skew or ceiling can change which summary is honest.
Key idea
Not all SLP outcomes are normally distributed. Percentages and counts are often bounded and skewed, and outliers may be real clinical variability — or a data-entry slip worth checking.
The logic of inference
Hypothesis testing, p-values, confidence intervals, errors, and effect size — the reasoning toolkit, taught as ideas you can explain out loud, not symbols to memorize.
Part 3 · Logic of inference
3.1 Hypothesis testing basics
Hypothesis testing is a structured way of asking, “could the wobble of random sampling alone explain what we saw?”
The core question
If the null hypothesis were a reasonable description of the world, how surprising would our sample result be? A result that would be very unusual under the null gives us evidence against it.
Null hypothesis (H₀)
The “nothing special” baseline. A naming treatment produces no average change in naming score.
Alternative hypothesis (H₁)
The “something is going on” claim. The treatment is associated with a change in naming score.
Take-home
A test statistic just measures “how far out” the result sits; the p-value turns that distance into a probability of seeing something at least this extreme under the null. That is all a p-value is.
Part 3 · Logic of inference
3.2 p-values, confidence intervals & errors
This is where careful clinicians separate themselves from careless ones. Most statistical mistakes in practice are misreadings of these three ideas.
What a p-value is — and is not
Common mistakes
A p-value is not the probability that the null hypothesis is true.
p < .05 is not proof that a treatment works.
p > .05 is not proof that there is no effect — it may just mean
the study was small or noisy.
A more honest sentence: “A p-value describes how unusual our data (or more extreme data) would be if a specified no-effect model were true.”
Confidence intervals show precision, not just yes/no
A 95% confidence interval gives a range of plausible values for the real effect. A narrow interval says “we’ve pinned this down”; a wide one says “we’re still quite unsure.” Two results can both be “significant” while one is far more precise than the other.
False positive
Concluding a fluency intervention helps when the apparent effect is mostly sampling noise. You act on something that isn’t real.
False negative
Missing a genuine AAC intervention effect because the study was too small or too noisy to detect it. You overlook something that is real.
Can I do this in Excel?
Excel functions (e.g., T.TEST) will hand you a p-value, but a number is not
an interpretation. Interpretation reminder: always pair a p-value with the effect size, the
confidence interval, and clinical judgment.
Part 3 · Logic of inference
3.3 Effect size & clinical significance
Three questions that sound alike but are not: Is there evidence of an effect? (significance) How big is it? (effect size) Does it matter for this client? (clinical significance)
The effect size you’ll see for each test
| Analysis | Effect size | Reads as |
|---|---|---|
| Paired t-test | Cohen’s dz (standardized mean change) | Change relative to its variability |
| Independent t-test | Cohen’s d | Group gap in standard-deviation units |
| ANOVA | eta-squared (η²) | Share of variance explained by group |
| Correlation | r and r² | Strength of association; variance shared |
| Regression | slope and R² | Change per unit; variance explained |
| Chi-square | Cramér’s V | Strength of categorical association |
| Single-case | Tau-U / nonoverlap | Supplement to visual analysis |
Take-home
Statistical significance, effect size, and clinical importance answer different questions. A tiny p-value can sit on a trivial effect; a meaningful change can occur in a single client where no group p-value exists at all.
Choosing & reading tests
A friendly decision guide, then a tour of the six tests you’ll meet most often — each with a simulated SLP example, a picture, a results sentence, and what it does (and doesn’t) tell you. We finish with assumptions and how to decode a results paragraph.
Part 4 · Choosing & reading tests
4.1 Choosing a common statistical test
You rarely need to invent anything. Match your question, your variable types, and your design to a familiar test.
| Research situation | Common test | Simulated SLP example |
|---|---|---|
| Same participants measured twice | Paired t-test | Aphasia naming before vs. after treatment |
| Two independent groups | Independent t-test | Two articulation treatment approaches |
| Three or more independent groups | One-way ANOVA | Low / moderate / high language-treatment dosage |
| Two continuous variables | Correlation | Therapy attendance and vocabulary gain |
| Predict a continuous outcome | Regression | Predict vocabulary gain from therapy hours + baseline |
| Two categorical variables | Chi-square | Service model and responder category |
| Repeated measures in one / a few cases | Single-case visual analysis (+ optional Tau-U) | AAC initiations across baseline & intervention |
Pattern to remember
Outcome type + number of groups + repeated or not → test. The six sections below all follow the same shape so the pattern becomes second nature.
Part 4 · The six tests
4.2 Paired t-test
Question: did the same people change? Population: adults with aphasia. Design: one group, measured before and after naming treatment.
Why this test: each person contributes a pair of scores (pre and post), so we analyze each person’s change. Pairing removes a lot of person-to-person noise.
Simulated result
In 20 simulated adults with aphasia, mean naming score rose from 49.6 before treatment to 58.3 after — an average gain of 8.7 points (SD of change = 6.6). A paired-samples t-test indicated a change larger than sampling noise alone would readily produce, t(19) = 5.84, p < .001, 95% CI for the mean change [5.6, 11.8], Cohen’s dz = 1.31.
Plain-language interpretation
On average, scores improved by about 9 points, and the confidence interval (5.6–11.8) stays well above zero — so a real average gain in this simulated sample is plausible across that whole range. The effect is large for these data (dz ≈ 1.3).
Clinical caution
A gain on a naming test is not the same as better everyday communication, and this is simulated data — it says nothing about whether this treatment “works.” Also, a pre-post design alone can’t rule out practice effects or natural recovery.
Optional: how this is done in R
# Simulated aphasia naming data (n = 20); seed for reproducibility
set.seed(680)
pre <- rnorm(20, mean = 48, sd = 9)
post <- pre + rnorm(20, mean = 10, sd = 7)
# Paired-samples t-test (gives t, df, p, and the 95% CI)
t.test(post, pre, paired = TRUE)
# Effect size: standardized mean change (Cohen's dz)
change <- post - pre
mean(change) / sd(change)
Method shown in R; the teaching values above come from this simulated dataset.
Can I do this in Excel?
Yes: =T.TEST(pre_range, post_range, 2, 1) (the 2 = two-tailed,
1 = paired). Interpretation reminder: Excel returns only the p-value — you
still need the mean change, its confidence interval, and an effect size to interpret the result.
Quick check
The CI for the mean change is [5.6, 11.8]. What does that tell a clinician?
A plausible-value answer: across this interval the average improvement is positive, and even the low end (≈5.6 points) is a meaningful-sized gain on this scale — though “meaningful for the client” still depends on the measure and the person.
Part 4 · The six tests
4.3 Independent t-test
Question: do two separate groups differ on average? Population: children with speech sound disorder. Design: two independent groups — traditional vs. motor-based articulation treatment.
Why this test: different children are in each group (no pairing), so we compare two group means and ask whether the gap is bigger than sampling variability would casually produce.
Simulated result
Post-treatment accuracy averaged 72.2% (traditional, n = 20) vs. 77.3% (motor-based, n = 20) — a difference of 5.1 points. A Welch independent-samples t-test gave t(37.5) = 2.01, p = .052, 95% CI for the difference [−0.0, 10.3], Cohen’s d = 0.64.
The teaching moment: p = .052
This p-value sits just above the usual .05 line. A careless reading says “no difference.” But
look closer: the effect is medium (d = 0.64), and the confidence interval runs from
essentially zero up to about 10 points. The honest summary is “inconclusive — possibly a meaningful
difference, but this study can’t pin it down.” .05 is a convention, not a wall, and
p > .05 is not proof of no effect.
Effect size & clinical note
A medium effect that didn’t reach significance may suggest the study was underpowered or too imprecise. A larger sample might clarify it, but interpretation still depends on the CI, design, and measurement quality. Clinically, a ~5-point accuracy edge may or may not matter — that depends on the child, the goals, and the cost of each approach.
Optional: how this is done in R
# Simulated articulation data: two independent groups of 20
set.seed(680)
trad <- rnorm(20, 70, 9)
motor <- rnorm(20, 78, 9)
score <- c(trad, motor)
group <- factor(rep(c("Traditional", "Motor"), each = 20))
# Welch's t-test (unequal variances) is R's default — a safe beginner choice
t.test(score ~ group)Can I do this in Excel?
=T.TEST(group1_range, group2_range, 2, 3) (the final 3 requests an
unequal-variance / Welch-style two-sample test). Interpretation reminder: a borderline p-value is a
cue to weigh the effect size and CI, not to declare “nothing here.”
Quick check
A colleague says “p = .052, so the treatments are equally effective.” What’s wrong with that?
It treats “not significant” as “no difference.” The medium effect (d = 0.64) and a CI reaching to +10 points mean a real, useful difference is still quite plausible — the study just wasn’t precise enough to confirm it.
Part 4 · The six tests
4.4 One-way ANOVA
Question: do three or more groups differ? Population: children in pediatric language intervention. Design: three independent dosage groups — low, moderate, high.
Why this test: with three groups, running several t-tests inflates the chance of a false positive. ANOVA asks one combined question first: is there any difference among the group means?
Simulated result
Mean improvement was 5.5 (low), 11.7 (moderate), and 16.5 (high), 15 children per group. A one-way ANOVA found differences unlikely under a no-difference model, F(2, 42) = 23.18, p < .001, η² = 0.52.
Common mistake
A significant ANOVA tells you at least one group differs — not which ones. To compare specific pairs (e.g., moderate vs. high), you need planned contrasts or post-hoc tests (like Tukey’s HSD), which control for multiple comparisons.
Effect size & clinical note
η² = 0.52 means about half the variability in improvement tracks with dosage group — a large effect here. Clinically, “more is better” has limits: higher dosage costs time and money and can hit ceilings or fatigue. Effect size informs the trade-off; it doesn’t settle it.
Optional: how this is done in R
# Simulated dosage data: 15 children per group
set.seed(680)
improve <- c(rnorm(15, 8, 4.5), rnorm(15, 12, 4.5), rnorm(15, 16, 4.5))
dose <- factor(rep(c("Low","Moderate","High"), each = 15),
levels = c("Low","Moderate","High"))
fit <- aov(improve ~ dose)
summary(fit) # F, df, p
TukeyHSD(fit) # which groups differ (post-hoc)Can I do this in Excel?
Data → Data Analysis → Anova: Single Factor (enable the Analysis ToolPak first). Interpretation reminder: the ToolPak gives the overall F and p, but won’t do post-hoc pairwise comparisons for you — and ANOVA never tells you whether a difference is clinically important.
Quick check
The ANOVA is significant. Can you conclude “high dosage beats moderate”?
Not yet. ANOVA only says the groups aren’t all equal. You’d need a post-hoc comparison (e.g., Tukey) to claim a specific high-vs-moderate difference.
Part 4 · The six tests
4.5 Correlation
Question: do two continuous things move together? Population: children in a vocabulary program. Variables: therapy attendance (%) and vocabulary gain.
Why this test: both variables are continuous and we’re asking about association (direction and strength), not group differences. The correlation coefficient r ranges from −1 to +1.
Simulated result
Across 40 simulated children, attendance and vocabulary gain were moderately and positively associated, r = 0.47, p = .002, 95% CI [0.19, 0.68], r² = 0.22.
Correlation is not causation
Children who attend more might also have more home support, milder profiles, or higher motivation — any of which could drive vocabulary gains. A correlation flags a pattern worth studying; it does not show that attendance causes the gain. Always view the scatterplot: one outlier or a curved pattern can masquerade as (or hide) a linear r.
Effect size note
Here r itself is the effect size. r² = 0.22 means about 22% of the variation in gain is shared with attendance — leaving ~78% to everything else. Moderate, not destiny.
Optional: how this is done in R
# Simulated attendance & vocabulary-gain data (n = 40)
set.seed(680)
attendance <- rnorm(40, 70, 14)
vocab_gain <- 0.35 * attendance + rnorm(40, 0, 7)
cor.test(attendance, vocab_gain) # r, p, and 95% CI for r
plot(attendance, vocab_gain) # always look at the scatterplotCan I do this in Excel?
=CORREL(x_range, y_range) (or =PEARSON(...)) gives r.
Interpretation reminder: pair it with a scatterplot, and never read causation into r.
Quick check
r = 0.47. A parent asks, “So therapy attendance causes vocabulary growth?”
Careful answer: attendance and gain tend to rise together in these data, but a correlation can’t establish cause — other factors may explain both. We’d need a controlled design to speak to causation.
Part 4 · The six tests
4.6 Regression
Question: can we predict an outcome, and how much does a predictor matter? Population: children in a vocabulary program. Outcome: vocabulary gain, predicted from therapy hours (and baseline language).
Why this test: regression draws the best-fit line through the data. Its slope says how much the outcome changes per unit of the predictor; R² says how much of the outcome’s variation the model explains.
Simulated result
Simple model: gain = 11.72 + 0.41 × hours. Each additional therapy hour predicts about +0.41 points of gain, p < .001, slope 95% CI [0.22, 0.59], R² = 0.29. Adding baseline language as a second predictor: each hour +0.34 (p < .001), each baseline point +0.25 (p = .002), R² = 0.43.
Prediction ≠ proof of causation
A significant slope means the predictor helps forecast the outcome in these data. It does not prove that adding hours causes gains — unmeasured factors (severity, support) may drive both. And R² = 0.43 still leaves most of the variation unexplained.
So where does “cause” come from? The design — not the test
This trips up almost everyone, so let’s say it plainly: no statistical test proves causation on its own. A t-test, ANOVA, and regression are all just tools for detecting differences or associations in whatever data you hand them. What can support a causal interpretation is how the study was run — especially whether participants were randomly assigned to conditions (a true experiment). Random assignment is a major reason a design can support a causal interpretation because it helps make groups comparable, assuming the study is otherwise well implemented.
That’s why the same regression can mean different things. Run it on a randomized treatment and the slope estimates a causal effect; run it on observational predictors people weren’t randomized to — therapy hours they chose, baseline severity they came in with — and it only describes association, because confounders may drive both. So the question to ask isn’t “which test?” but “were participants randomly assigned?” (The same caution applies to t-tests and ANOVA run on groups that formed naturally rather than by randomization.)
Optional: how this is done in R
# Simulated data (n = 50)
set.seed(680)
hours <- runif(50, 5, 40)
baseline <- rnorm(50, 60, 12)
gain <- 2 + 0.45*hours + 0.15*baseline + rnorm(50, 0, 6)
summary(lm(gain ~ hours)) # slope, p, R-squared
summary(lm(gain ~ hours + baseline)) # multiple regression
confint(lm(gain ~ hours)) # 95% CI for the slopeCan I do this in Excel?
Data → Data Analysis → Regression, or formulas =SLOPE(y,x),
=INTERCEPT(y,x), =RSQ(y,x),
=LINEST(y,x,TRUE,TRUE). Interpretation reminder: focus on the slope, its
CI and p-value, and R² — and remember: causation comes from the study design and how well it was implemented, not from the test.
Quick check
The slope for hours is 0.41. What does that mean in words?
For every extra therapy hour, the model predicts about 0.41 more points of vocabulary gain, on average, within the observed range — a prediction, not a guarantee for any one child.
Part 4 · The six tests
4.7 Chi-square test
Question: are two categorical variables associated? Population: clients across service models. Variables: service model (individual / group / hybrid) and response category (responder / non-responder).
Why this test: both variables are categories, and we’re comparing counts. Chi-square asks whether the observed counts differ from what we’d expect if the two variables were unrelated.
| Service model | Responder | Non-responder | Total |
|---|---|---|---|
| Individual | 31 | 9 | 40 |
| Group | 13 | 27 | 40 |
| Hybrid | 24 | 16 | 40 |
| Total | 68 | 52 | 120 |
Expected count for a cell = row total × column total ÷ grand total.
Simulated result
Response category was associated with service model, χ²(2) = 16.76, p < .001, Cramér’s V = 0.37 (n = 120). Responder rates were 78% (individual), 33% (group), and 60% (hybrid).
What chi-square does and doesn’t say
It tests whether observed counts differ from the counts expected if the two categorical variables were independent — it is not a test of means, and it doesn’t tell you why or whether the difference is clinically important. Self-selection (who ends up in each model) could explain a lot. Cramér’s V (0.37) gives the strength of the association.
Optional: how this is done in R
# Observed counts as a 3 x 2 table (rows = model, cols = response)
tab <- matrix(c(31, 9, 13, 27, 24, 16), nrow = 3, byrow = TRUE,
dimnames = list(c("Individual","Group","Hybrid"),
c("Responder","Non-responder")))
chisq.test(tab) # chi-square, df, p
chisq.test(tab)$expected # expected countsCan I do this in Excel?
=CHISQ.TEST(observed_range, expected_range) returns the p-value — but you must
build the expected table first (row × column ÷ grand total). Interpretation reminder:
chi-square counts patterns; it doesn’t measure clinical importance.
Quick check
Why can’t we conclude “individual therapy is the best model” from this test?
Chi-square shows the variables are associated, not that the model caused better response. Clients weren’t randomly assigned, so differences in who chose each model could drive the pattern.
Part 4 · Choosing & reading tests
4.8 Assumptions, in plain language
Reframe
Assumptions are not magic rules that make a test “legal” or “illegal.” They are conditions that affect how much we can trust the result. When they’re badly violated, the p-value and CI can mislead.
Independence
Observations don’t influence each other. Testing 10 kids from the same classroom twice is not 20 independent data points.
Approximate normality
For t-tests/ANOVA, the data (or residuals) are roughly bell-shaped. Matters most with small samples.
Similar variability
Groups have roughly comparable spread. Welch’s t-test relaxes this for two groups.
Linearity
For correlation/regression, the relationship is roughly a straight line — check the scatterplot.
Outliers
A few extreme points can swing means, r, and slopes. Investigate; don’t just delete.
Expected counts
Chi-square gets shaky when expected cell counts are very small (a common rule of thumb is < 5).
What if an assumption isn’t met?
First, look — a plot or a quick check often shows the violation isn’t serious. If it is, these beginner-friendly alternatives lean on ranks instead of raw values, so they don’t need normality and they resist outliers:
| Instead of… | Try… | Helpful when… |
|---|---|---|
| Paired t-test | Wilcoxon signed-rank test | pre/post differences are skewed or have outliers |
| Independent t-test | Mann–Whitney U (Wilcoxon rank-sum) | two groups, non-normal or ordinal outcome |
| One-way ANOVA | Kruskal–Wallis test | three+ groups, non-normal outcome |
| Pearson correlation | Spearman’s rank correlation (ρ) | the relationship is monotonic but not straight, or has outliers |
| Chi-square | Fisher’s exact test | small expected cell counts (rule of thumb < 5) |
Beyond this lesson: when the outcome isn’t continuous-and-normal at all — counts, yes/no responses, rates — generalized linear models (e.g., logistic or Poisson regression) are the principled tool. And switching tests isn’t automatic: each alternative carries its own assumptions and answers a slightly different question (rank-based tests, for instance, trade away the tidy mean difference), so always look at the data first.
Take-home
You don’t need to memorize every assumption. Ask: were the observations independent? does the picture look reasonable? are there wild outliers? Those three questions catch most real problems.
Part 4 · Choosing & reading tests
4.9 Reading a results paragraph
Journal results sections are dense by design. Once you can name each symbol, they become readable. Click any highlighted term below.
Reading strategy
Read results in this order: (1) what was compared, (2) the effect size and confidence interval (how big, how precise), (3) the p-value, and (4) whether the authors’ clinical claim is bigger than their data support.
Single-case research
Much of SLP evidence comes from carefully studying one client at a time. The logic is different from group statistics — and the graph is the star of the show.
Part 5 · Single-case research
5.1 Single-case design, visual analysis & Tau-U
Single-case experimental designs are not just “one person, before and after.” They rely on repeated measurement over time, distinct baseline and intervention phases, and replication (across behaviors, people, or settings) to build a convincing case.
The six things visual analysis looks for
Where Tau-U fits
Once the graph is convincing, a quantitative index can summarize the change. Tau-U (Parker et al., 2011) rolls two things you can already see on the graph into one number: how little the intervention points overlap with baseline, and the trend across the phases. Some versions also adjust for a trend that was already present during baseline.
Why a flat baseline matters
If a behavior is already climbing during the baseline — before treatment even starts — then part of the later rise might have happened anyway, so we can’t fully credit the intervention. That “rising baseline” is a classic threat to single-case studies. It’s why strong designs aim for a stable baseline first, and why careful readers (and trend-aware indices like Tau-U) try not to reward a treatment for a climb that was already underway. You don’t need the math behind that adjustment — just the habit of asking, “was the baseline flat before things changed?”
Simulated single-case indices
Baseline initiations averaged 1.8; intervention averaged 6.2. Nonoverlap was high — NAP = 0.99, PND = 92% — and the trend-aware Tau-U = 0.78.
Reporting note
Tau-U here is a teaching value; in real reporting, specify the software/function and whether baseline-trend correction was used.
One number never replaces the graph
Quantitative indices such as Tau-U can supplement visual analysis, but they do not replace strong design, repeated measurement, replication, and careful visual interpretation. Notice that near-perfect nonoverlap (NAP = 0.99) and the trend-aware Tau-U (0.78) tell slightly different stories — that’s why we look at several signals, and the picture, together.
Other indices you may see
Percentage of Nonoverlapping Data (PND), Percentage of Data Exceeding the Median (PEM), Nonoverlap of All Pairs (NAP), Tau / Tau-U, randomization tests (when the design allows), and standardized-mean-difference approaches. Each has strengths and blind spots.
Can I do this in Excel?
Excel is great for plotting session-by-session data (a line chart with a phase divider). But Tau-U is not a built-in Excel function — use R or a dedicated calculator and treat the value as a teaching figure. Interpretation reminder: graph first, index second.
Optional: plotting & indices in R
# Simulated AAC initiations: 6 baseline + 12 intervention sessions
session <- 1:18
phase <- rep(c("Baseline","Intervention"), c(6, 12))
y <- c(2,1,2,2,3,1, 4,3,5,4,4,5,7,7,9,7,8,11)
plot(session, y, type = "b") # line graph; add a phase divider
abline(v = 6.5, lty = 2) # phase-change line
# Tau-U: use a verified package/calculator (e.g., the 'SingleCaseES' package)
Indices above were computed for this simulated dataset; Tau-U values can differ slightly across software variants.
Integration & review
Pull it together: the reminders that keep you honest, the red flags to watch for, and a low-stakes quiz to check your reasoning.
Part 6 · Integration & review
6.1 Bringing it all together
Six interpretation reminders
1 · Question first
Outcome type + groups + repeated? decides the test — not the other way around.
2 · p is not proof
It’s “surprise under the null,” nothing more. Small ≠ important; large ≠ no effect.
3 · Size & precision
Always read the effect size and the confidence interval, not just the p-value.
4 · Statistical ≠ clinical
Ask whether the change matters for communication, participation, safety, or quality of life.
5 · Association ≠ cause
Correlation and regression describe patterns; design earns causal claims.
6 · Picture beats index
Especially single-case: the graph leads, Tau-U supports.
Red-flag phrases to question
“The p-value proves the treatment works.” · “p > .05, so it doesn’t work.” · “A significant result means a clinically important result.” · “The correlation shows X causes Y.” · “Tau-U = 0.8, so we don’t need the graph.” · “The effect is large, so it must matter for every client.”
Open the one-page cheat sheet
| If you want to… | Use | Effect size | Excel |
|---|---|---|---|
| Compare the same people twice | Paired t-test | Cohen’s dz | T.TEST(…,2,1) |
| Compare two separate groups | Independent t-test | Cohen’s d | T.TEST(…,2,3) |
| Compare 3+ groups | One-way ANOVA | η² | ToolPak › ANOVA |
| Relate two continuous variables | Correlation | r, r² | CORREL |
| Predict an outcome | Regression | slope, R² | ToolPak › Regression |
| Relate two categories | Chi-square | Cramér’s V | CHISQ.TEST |
| Track one client over time | Visual analysis (+ Tau-U) | Tau-U / nonoverlap | Plot only |
You made it
You don’t need to compute these by hand to be a sharp consumer of research. If you can name the question, read the picture, and weigh size, precision, and clinical meaning, you can interpret most SLP studies with confidence. Revisit any section any time — your progress isn’t stored, so explore freely.