Inferential Statistics for SLP Research — A Beginner-Friendly Interactive Overview (SLP 680)

Part 1 of 6

Foundations & mindset

Before any test or p-value: what statistics are for, the difference between summarizing and inferring, and why the type of data you collect quietly decides everything that follows.

Part 1 · Foundations

1.1 Welcome: statistics as careful reasoning

If statistics has ever felt like a wall of symbols, you are in good company — and you are in the right place. This lesson is about reasoning, not arithmetic. The goal is for you to read a results section in a journal article and understand what it is really claiming (and what it is not).

Key idea

Researchers almost never get to measure everyone. They study a sample and use inferential statistics to make cautious, uncertainty-aware claims about the larger group or process they care about. Statistics support clinical reasoning; they never replace your judgment about an individual client.

A 30-second preview of the words people throw around

Tap any card to flip it over — then flip it back whenever you like. You’ll meet each term properly later; nothing to memorize now.

Take-home

You do not need to memorize formulas to begin understanding results. Start with the question and the picture; the notation will make sense later.

Part 1 · Foundations

1.2 Descriptive vs. inferential statistics

Two jobs, two different questions. Getting them straight makes everything else easier.

Describe

“What happened in the data I have?”

Descriptive statistics summarize the sample in front of you — the mean, median, standard deviation, a histogram. They make no claim beyond these specific people.

Example: in a class project, 30 children average 72% consonants correct. That number describes those 30 children, full stop.

Infer

“What might this say about the bigger picture?”

Inferential statistics use the sample to reason — with explicit uncertainty — about a larger population or process you could not measure directly.

Example: could a result like this plausibly hold for the broader group of children with speech sound disorder, or is it likely just the luck of who we sampled?

Can I do this in Excel?

Yes — descriptive statistics are Excel’s home turf: =AVERAGE(range), =MEDIAN(range), =STDEV.S(range), =COUNT(range), =MIN(range), =MAX(range). Interpretation reminder: these describe your sample. Reasoning beyond it is the inferential step that comes next.

Key idea

Describe the sample first; infer second. Inferential statistics always carry uncertainty — that uncertainty is the honest part, not a flaw.

Part 1 · Foundations

1.3 Variables, measurement & data types

Here is the secret that makes “choosing a test” far less scary: the test you can use is mostly decided by what kind of variables you have. Learn the types and you are halfway there.

Simulated SLP examples of each variable type.
Type	What it is	SLP example
Continuous	Numbers on a scale; in-between values make sense	Naming score, intelligibility rating, vocabulary score
Categorical	Named groups, no order	Treatment group, diagnosis category, responder / non-responder
Ordinal	Ordered categories, uneven gaps	Severity rating (mild / moderate / severe)
Count	Whole-number tallies	Number of communicative initiations in a session
Percentage	Bounded 0–100%, often skewed near the edges	Percent consonants correct, percent syllables stuttered

How these map to the four levels of measurement

If you learned the classic NOIR hierarchy — Nominal, Ordinal, Interval, Ratio — the everyday data types above sit right on top of it. The big split for choosing a test is categorical (nominal/ordinal) vs. numeric (interval/ratio); the level just tells you what math is fair to do.

The four levels of measurement (NOIR) and the data types they cover.
Level	What it adds	Data type(s) here	Simulated SLP example
Nominal	Named groups, no order	Categorical	Treatment group; responder / non-responder
Ordinal	Order, but unequal gaps	Ordinal	Severity rating (mild / moderate / severe)
Interval	Equal gaps, but no true zero	Continuous scores without a true zero	A standardized test standard score (mean = 100)
Ratio	Equal gaps and a true zero	Most continuous, plus counts & percentages	Naming score, # initiations, % consonants correct

Two practical reminders: don’t treat ordinal ratings as if the gaps were equal (mild→moderate may not equal moderate→severe), and only ratio scales let you say “twice as much” (a true zero means “0 = none”). Counts and percentages are technically discrete, but they behave as ratio-level for most analyses.

Key idea

Three questions drive almost every test choice: What type is my outcome? How many groups or measurements? Are the same people measured more than once? Keep these in your back pocket.

Part 2 of 6

From sample to population

The single idea at the heart of inference: a sample is a blurry snapshot of something bigger. Once you feel how samples wobble, p-values and confidence intervals stop being mysterious.

Part 2 · Sample to population

2.1 Population and sample

Four words that get used loosely in conversation but precisely in research.

Population

Everyone you care about. All preschoolers receiving speech-sound services in a district.

Sample

The few you actually measure. 30 children enrolled in a study.

Parameter

The true value in the population. Usually unknown.

Statistic

The value from your sample. Our best estimate of the parameter.

Try it · Draw a sample

Each draw is a new study. Watch the sample mean dance.

Simulated population of preschool speech-sound accuracy Simulated

Sample mean: —

What to notice

The population never changed, yet each sample mean is a little different. That wobble is sampling variability — and it is exactly what inference has to account for. The true population mean here is about 70.

SLP translation

When a study reports a sample mean, it is offering an estimate of a parameter you cannot see. Two honest studies of the same population can land on different numbers — neither is “wrong.”

Part 2 · Sample to population

2.2 Sampling variability and uncertainty

If we repeated the same study many times, the sample means would form their own distribution. Its width tells us how much to trust a single estimate.

Try it · Sample-size slider

Bigger samples, steadier estimates

Distribution of 600 simulated sample means Simulated

Sample size (n): n = 10

What to notice

Larger samples usually produce less variable estimates (the spread of sample means — the standard error — shrinks). But no single sample is ever perfect, and a big sample of a biased process is still biased.

Why this matters

Inferential statistics ask: is the difference we observed bigger than the wobble we’d expect from sampling alone? Everything from here builds on that one question.

Part 2 · Sample to population

2.3 Distributions: shape matters

A distribution shows how values are spread out. Three features — center, spread, and shape — tell you most of what you need, and they warn you when a “mean” might be misleading.

Roughly normal

Aphasia naming scores often spread fairly symmetrically around a center — the mean is a good summary.

Right-skewed

Percent syllables stuttered piles up near zero with a long right tail — the mean gets pulled upward, so the median may describe a “typical” client better.

Ceiling effect

After successful articulation treatment, many children score near 100% — the test can’t capture further gains, which can hide real differences.

Can I do this in Excel?

Yes. Make a quick histogram with Insert → Chart → Histogram (newer Excel) or the Analysis ToolPak → Histogram. Interpretation reminder: always look at the shape before trusting a mean — a skew or ceiling can change which summary is honest.

Key idea

Not all SLP outcomes are normally distributed. Percentages and counts are often bounded and skewed, and outliers may be real clinical variability — or a data-entry slip worth checking.

Part 3 of 6

The logic of inference

Hypothesis testing, p-values, confidence intervals, errors, and effect size — the reasoning toolkit, taught as ideas you can explain out loud, not symbols to memorize.

Part 3 · Logic of inference

3.1 Hypothesis testing basics

Hypothesis testing is a structured way of asking, “could the wobble of random sampling alone explain what we saw?”

The core question

If the null hypothesis were a reasonable description of the world, how surprising would our sample result be? A result that would be very unusual under the null gives us evidence against it.

Null hypothesis (H₀)

The “nothing special” baseline. A naming treatment produces no average change in naming score.

Alternative hypothesis (H₁)

The “something is going on” claim. The treatment is associated with a change in naming score.

Try it · Predict, then reveal

How surprising is this result under the null?

The curve shows what sample results look like if the null (no effect) were true. The marker shows what one study actually observed.

“No-effect” world vs. the observed result Conceptual

Your call: would a result this far out be surprising if the treatment truly did nothing?

Take-home

A test statistic just measures “how far out” the result sits; the p-value turns that distance into a probability of seeing something at least this extreme under the null. That is all a p-value is.

Part 3 · Logic of inference

3.2 p-values, confidence intervals & errors

This is where careful clinicians separate themselves from careless ones. Most statistical mistakes in practice are misreadings of these three ideas.

What a p-value is — and is not

Common mistakes

A p-value is not the probability that the null hypothesis is true. p < .05 is not proof that a treatment works. p > .05 is not proof that there is no effect — it may just mean the study was small or noisy.

A more honest sentence: “A p-value describes how unusual our data (or more extreme data) would be if a specified no-effect model were true.”

Confidence intervals show precision, not just yes/no

A 95% confidence interval gives a range of plausible values for the real effect. A narrow interval says “we’ve pinned this down”; a wide one says “we’re still quite unsure.” Two results can both be “significant” while one is far more precise than the other.

Type I error

False positive

Concluding a fluency intervention helps when the apparent effect is mostly sampling noise. You act on something that isn’t real.

Type II error

False negative

Missing a genuine AAC intervention effect because the study was too small or too noisy to detect it. You overlook something that is real.

Can I do this in Excel?

Excel functions (e.g., T.TEST) will hand you a p-value, but a number is not an interpretation. Interpretation reminder: always pair a p-value with the effect size, the confidence interval, and clinical judgment.

Part 3 · Logic of inference

3.3 Effect size & clinical significance

Three questions that sound alike but are not: Is there evidence of an effect? (significance) How big is it? (effect size) Does it matter for this client? (clinical significance)

The effect size you’ll see for each test

Common effect-size measures by analysis (rough conventions; context always wins).
Analysis	Effect size	Reads as
Paired t-test	Cohen’s d_z (standardized mean change)	Change relative to its variability
Independent t-test	Cohen’s d	Group gap in standard-deviation units
ANOVA	eta-squared (η²)	Share of variance explained by group
Correlation	r and r²	Strength of association; variance shared
Regression	slope and R²	Change per unit; variance explained
Chi-square	Cramér’s V	Strength of categorical association
Single-case	Tau-U / nonoverlap	Supplement to visual analysis

Take-home

Statistical significance, effect size, and clinical importance answer different questions. A tiny p-value can sit on a trivial effect; a meaningful change can occur in a single client where no group p-value exists at all.

Part 4 of 6

Choosing & reading tests

A friendly decision guide, then a tour of the six tests you’ll meet most often — each with a simulated SLP example, a picture, a results sentence, and what it does (and doesn’t) tell you. We finish with assumptions and how to decode a results paragraph.

Part 4 · Choosing & reading tests

4.1 Choosing a common statistical test

You rarely need to invent anything. Match your question, your variable types, and your design to a familiar test.

A beginner-friendly starting point — not a rigid rulebook.
Research situation	Common test	Simulated SLP example
Same participants measured twice	Paired t-test	Aphasia naming before vs. after treatment
Two independent groups	Independent t-test	Two articulation treatment approaches
Three or more independent groups	One-way ANOVA	Low / moderate / high language-treatment dosage
Two continuous variables	Correlation	Therapy attendance and vocabulary gain
Predict a continuous outcome	Regression	Predict vocabulary gain from therapy hours + baseline
Two categorical variables	Chi-square	Service model and responder category
Repeated measures in one / a few cases	Single-case visual analysis (+ optional Tau-U)	AAC initiations across baseline & intervention

Pattern to remember

Outcome type + number of groups + repeated or not → test. The six sections below all follow the same shape so the pattern becomes second nature.

Part 4 · The six tests

4.2 Paired t-test

Question: did the same people change? Population: adults with aphasia. Design: one group, measured before and after naming treatment.

Why this test: each person contributes a pair of scores (pre and post), so we analyze each person’s change. Pairing removes a lot of person-to-person noise.

Naming score: before → after Simulated

Each line is one simulated participant. Most slope upward — but not all.

Distribution of change scores Simulated

Post − pre for each person. Centered well above zero (the dashed line).

Simulated result

In 20 simulated adults with aphasia, mean naming score rose from 49.6 before treatment to 58.3 after — an average gain of 8.7 points (SD of change = 6.6). A paired-samples t-test indicated a change larger than sampling noise alone would readily produce, t(19) = 5.84, p < .001, 95% CI for the mean change [5.6, 11.8], Cohen’s d_z = 1.31.

M change +8.795% CI 5.6 to 11.8t(19) 5.84p < .001d_z 1.31 (large)

Plain-language interpretation

On average, scores improved by about 9 points, and the confidence interval (5.6–11.8) stays well above zero — so a real average gain in this simulated sample is plausible across that whole range. The effect is large for these data (d_z ≈ 1.3).

Clinical caution

A gain on a naming test is not the same as better everyday communication, and this is simulated data — it says nothing about whether this treatment “works.” Also, a pre-post design alone can’t rule out practice effects or natural recovery.

Optional: how this is done in R

# Simulated aphasia naming data (n = 20); seed for reproducibility
set.seed(680)
pre  <- rnorm(20, mean = 48, sd = 9)
post <- pre + rnorm(20, mean = 10, sd = 7)

# Paired-samples t-test (gives t, df, p, and the 95% CI)
t.test(post, pre, paired = TRUE)

# Effect size: standardized mean change (Cohen's dz)
change <- post - pre
mean(change) / sd(change)

Method shown in R; the teaching values above come from this simulated dataset.

Can I do this in Excel?

Yes: =T.TEST(pre_range, post_range, 2, 1) (the 2 = two-tailed, 1 = paired). Interpretation reminder: Excel returns only the p-value — you still need the mean change, its confidence interval, and an effect size to interpret the result.

Quick check

The CI for the mean change is [5.6, 11.8]. What does that tell a clinician?

A plausible-value answer: across this interval the average improvement is positive, and even the low end (≈5.6 points) is a meaningful-sized gain on this scale — though “meaningful for the client” still depends on the measure and the person.

Part 4 · The six tests

4.3 Independent t-test

Question: do two separate groups differ on average? Population: children with speech sound disorder. Design: two independent groups — traditional vs. motor-based articulation treatment.

Why this test: different children are in each group (no pairing), so we compare two group means and ask whether the gap is bigger than sampling variability would casually produce.

Post-treatment accuracy by group Simulated

Dots are individual children; the heavy marker is each group mean with its 95% CI.

Simulated result

Post-treatment accuracy averaged 72.2% (traditional, n = 20) vs. 77.3% (motor-based, n = 20) — a difference of 5.1 points. A Welch independent-samples t-test gave t(37.5) = 2.01, p = .052, 95% CI for the difference [−0.0, 10.3], Cohen’s d = 0.64.

Difference 5.1 pts95% CI −0.0 to 10.3t(37.5) 2.01p = .052d 0.64 (medium)

The teaching moment: p = .052

This p-value sits just above the usual .05 line. A careless reading says “no difference.” But look closer: the effect is medium (d = 0.64), and the confidence interval runs from essentially zero up to about 10 points. The honest summary is “inconclusive — possibly a meaningful difference, but this study can’t pin it down.” .05 is a convention, not a wall, and p > .05 is not proof of no effect.

Effect size & clinical note

A medium effect that didn’t reach significance may suggest the study was underpowered or too imprecise. A larger sample might clarify it, but interpretation still depends on the CI, design, and measurement quality. Clinically, a ~5-point accuracy edge may or may not matter — that depends on the child, the goals, and the cost of each approach.

Optional: how this is done in R

# Simulated articulation data: two independent groups of 20
set.seed(680)
trad  <- rnorm(20, 70, 9)
motor <- rnorm(20, 78, 9)
score <- c(trad, motor)
group <- factor(rep(c("Traditional", "Motor"), each = 20))

# Welch's t-test (unequal variances) is R's default — a safe beginner choice
t.test(score ~ group)

Can I do this in Excel?

=T.TEST(group1_range, group2_range, 2, 3) (the final 3 requests an unequal-variance / Welch-style two-sample test). Interpretation reminder: a borderline p-value is a cue to weigh the effect size and CI, not to declare “nothing here.”

Quick check

A colleague says “p = .052, so the treatments are equally effective.” What’s wrong with that?

It treats “not significant” as “no difference.” The medium effect (d = 0.64) and a CI reaching to +10 points mean a real, useful difference is still quite plausible — the study just wasn’t precise enough to confirm it.

Part 4 · The six tests

4.4 One-way ANOVA

Question: do three or more groups differ? Population: children in pediatric language intervention. Design: three independent dosage groups — low, moderate, high.

Why this test: with three groups, running several t-tests inflates the chance of a false positive. ANOVA asks one combined question first: is there any difference among the group means?

Language improvement by treatment dosage Simulated

Dots are individual children; markers show each group mean with its 95% CI. Note the upward step from low → high.

Simulated result

Mean improvement was 5.5 (low), 11.7 (moderate), and 16.5 (high), 15 children per group. A one-way ANOVA found differences unlikely under a no-difference model, F(2, 42) = 23.18, p < .001, η² = 0.52.

Means 5.5 / 11.7 / 16.5F(2,42) 23.18p < .001η² 0.52 (large)

Common mistake

A significant ANOVA tells you at least one group differs — not which ones. To compare specific pairs (e.g., moderate vs. high), you need planned contrasts or post-hoc tests (like Tukey’s HSD), which control for multiple comparisons.

Effect size & clinical note

η² = 0.52 means about half the variability in improvement tracks with dosage group — a large effect here. Clinically, “more is better” has limits: higher dosage costs time and money and can hit ceilings or fatigue. Effect size informs the trade-off; it doesn’t settle it.

Optional: how this is done in R

# Simulated dosage data: 15 children per group
set.seed(680)
improve <- c(rnorm(15, 8, 4.5), rnorm(15, 12, 4.5), rnorm(15, 16, 4.5))
dose    <- factor(rep(c("Low","Moderate","High"), each = 15),
                  levels = c("Low","Moderate","High"))

fit <- aov(improve ~ dose)
summary(fit)            # F, df, p
TukeyHSD(fit)           # which groups differ (post-hoc)

Can I do this in Excel?

Data → Data Analysis → Anova: Single Factor (enable the Analysis ToolPak first). Interpretation reminder: the ToolPak gives the overall F and p, but won’t do post-hoc pairwise comparisons for you — and ANOVA never tells you whether a difference is clinically important.

Quick check

The ANOVA is significant. Can you conclude “high dosage beats moderate”?

Not yet. ANOVA only says the groups aren’t all equal. You’d need a post-hoc comparison (e.g., Tukey) to claim a specific high-vs-moderate difference.

Part 4 · The six tests

4.5 Correlation

Question: do two continuous things move together? Population: children in a vocabulary program. Variables: therapy attendance (%) and vocabulary gain.

Why this test: both variables are continuous and we’re asking about association (direction and strength), not group differences. The correlation coefficient r ranges from −1 to +1.

Attendance vs. vocabulary gain Simulated

Each dot is a child. The line summarizes the trend; the scatter around it shows the relationship is far from perfect.

Simulated result

Across 40 simulated children, attendance and vocabulary gain were moderately and positively associated, r = 0.47, p = .002, 95% CI [0.19, 0.68], r² = 0.22.

r 0.47 (moderate)95% CI 0.19 to 0.68p = .002r² 0.22

Correlation is not causation

Children who attend more might also have more home support, milder profiles, or higher motivation — any of which could drive vocabulary gains. A correlation flags a pattern worth studying; it does not show that attendance causes the gain. Always view the scatterplot: one outlier or a curved pattern can masquerade as (or hide) a linear r.

Effect size note

Here r itself is the effect size. r² = 0.22 means about 22% of the variation in gain is shared with attendance — leaving ~78% to everything else. Moderate, not destiny.

Optional: how this is done in R

# Simulated attendance & vocabulary-gain data (n = 40)
set.seed(680)
attendance <- rnorm(40, 70, 14)
vocab_gain <- 0.35 * attendance + rnorm(40, 0, 7)

cor.test(attendance, vocab_gain)   # r, p, and 95% CI for r
plot(attendance, vocab_gain)       # always look at the scatterplot

Can I do this in Excel?

=CORREL(x_range, y_range) (or =PEARSON(...)) gives r. Interpretation reminder: pair it with a scatterplot, and never read causation into r.

Quick check

r = 0.47. A parent asks, “So therapy attendance causes vocabulary growth?”

Careful answer: attendance and gain tend to rise together in these data, but a correlation can’t establish cause — other factors may explain both. We’d need a controlled design to speak to causation.

Part 4 · The six tests

4.6 Regression

Question: can we predict an outcome, and how much does a predictor matter? Population: children in a vocabulary program. Outcome: vocabulary gain, predicted from therapy hours (and baseline language).

Why this test: regression draws the best-fit line through the data. Its slope says how much the outcome changes per unit of the predictor; R² says how much of the outcome’s variation the model explains.

Simulated result

Simple model: gain = 11.72 + 0.41 × hours. Each additional therapy hour predicts about +0.41 points of gain, p < .001, slope 95% CI [0.22, 0.59], R² = 0.29. Adding baseline language as a second predictor: each hour +0.34 (p < .001), each baseline point +0.25 (p = .002), R² = 0.43.

Slope (hours) 0.4195% CI 0.22 to 0.59p < .001R² 0.29 → 0.43 w/ baseline

Prediction ≠ proof of causation

A significant slope means the predictor helps forecast the outcome in these data. It does not prove that adding hours causes gains — unmeasured factors (severity, support) may drive both. And R² = 0.43 still leaves most of the variation unexplained.

So where does “cause” come from? The design — not the test

This trips up almost everyone, so let’s say it plainly: no statistical test proves causation on its own. A t-test, ANOVA, and regression are all just tools for detecting differences or associations in whatever data you hand them. What can support a causal interpretation is how the study was run — especially whether participants were randomly assigned to conditions (a true experiment). Random assignment is a major reason a design can support a causal interpretation because it helps make groups comparable, assuming the study is otherwise well implemented.

That’s why the same regression can mean different things. Run it on a randomized treatment and the slope estimates a causal effect; run it on observational predictors people weren’t randomized to — therapy hours they chose, baseline severity they came in with — and it only describes association, because confounders may drive both. So the question to ask isn’t “which test?” but “were participants randomly assigned?” (The same caution applies to t-tests and ANOVA run on groups that formed naturally rather than by randomization.)

Optional: how this is done in R

# Simulated data (n = 50)
set.seed(680)
hours    <- runif(50, 5, 40)
baseline <- rnorm(50, 60, 12)
gain     <- 2 + 0.45*hours + 0.15*baseline + rnorm(50, 0, 6)

summary(lm(gain ~ hours))               # slope, p, R-squared
summary(lm(gain ~ hours + baseline))    # multiple regression
confint(lm(gain ~ hours))               # 95% CI for the slope

Can I do this in Excel?

Data → Data Analysis → Regression, or formulas =SLOPE(y,x), =INTERCEPT(y,x), =RSQ(y,x), =LINEST(y,x,TRUE,TRUE). Interpretation reminder: focus on the slope, its CI and p-value, and R² — and remember: causation comes from the study design and how well it was implemented, not from the test.

Quick check

The slope for hours is 0.41. What does that mean in words?

For every extra therapy hour, the model predicts about 0.41 more points of vocabulary gain, on average, within the observed range — a prediction, not a guarantee for any one child.

Part 4 · The six tests

4.7 Chi-square test

Question: are two categorical variables associated? Population: clients across service models. Variables: service model (individual / group / hybrid) and response category (responder / non-responder).

Why this test: both variables are categories, and we’re comparing counts. Chi-square asks whether the observed counts differ from what we’d expect if the two variables were unrelated.

Responder rate by service model Simulated

Bars show the % of responders within each model.

Observed counts (expected counts ≈ 22.7 responders / 17.3 non-responders per row).
Service model	Responder	Non-responder	Total
Individual	31	9	40
Group	13	27	40
Hybrid	24	16	40
Total	68	52	120

Expected count for a cell = row total × column total ÷ grand total.

Simulated result

Response category was associated with service model, χ²(2) = 16.76, p < .001, Cramér’s V = 0.37 (n = 120). Responder rates were 78% (individual), 33% (group), and 60% (hybrid).

χ²(2) 16.76p < .001Cramér’s V 0.37 (moderate)n 120

What chi-square does and doesn’t say

It tests whether observed counts differ from the counts expected if the two categorical variables were independent — it is not a test of means, and it doesn’t tell you why or whether the difference is clinically important. Self-selection (who ends up in each model) could explain a lot. Cramér’s V (0.37) gives the strength of the association.

Optional: how this is done in R

# Observed counts as a 3 x 2 table (rows = model, cols = response)
tab <- matrix(c(31, 9, 13, 27, 24, 16), nrow = 3, byrow = TRUE,
              dimnames = list(c("Individual","Group","Hybrid"),
                              c("Responder","Non-responder")))
chisq.test(tab)                 # chi-square, df, p
chisq.test(tab)$expected        # expected counts

Can I do this in Excel?

=CHISQ.TEST(observed_range, expected_range) returns the p-value — but you must build the expected table first (row × column ÷ grand total). Interpretation reminder: chi-square counts patterns; it doesn’t measure clinical importance.

Quick check

Why can’t we conclude “individual therapy is the best model” from this test?

Chi-square shows the variables are associated, not that the model caused better response. Clients weren’t randomly assigned, so differences in who chose each model could drive the pattern.

Part 4 · Choosing & reading tests

4.8 Assumptions, in plain language

Reframe

Assumptions are not magic rules that make a test “legal” or “illegal.” They are conditions that affect how much we can trust the result. When they’re badly violated, the p-value and CI can mislead.

Independence

Observations don’t influence each other. Testing 10 kids from the same classroom twice is not 20 independent data points.

Approximate normality

For t-tests/ANOVA, the data (or residuals) are roughly bell-shaped. Matters most with small samples.

Similar variability

Groups have roughly comparable spread. Welch’s t-test relaxes this for two groups.

Linearity

For correlation/regression, the relationship is roughly a straight line — check the scatterplot.

Outliers

A few extreme points can swing means, r, and slopes. Investigate; don’t just delete.

Expected counts

Chi-square gets shaky when expected cell counts are very small (a common rule of thumb is < 5).

What if an assumption isn’t met?

First, look — a plot or a quick check often shows the violation isn’t serious. If it is, these beginner-friendly alternatives lean on ranks instead of raw values, so they don’t need normality and they resist outliers:

Common nonparametric (rank-based) alternatives.
Instead of…	Try…	Helpful when…
Paired t-test	Wilcoxon signed-rank test	pre/post differences are skewed or have outliers
Independent t-test	Mann–Whitney U (Wilcoxon rank-sum)	two groups, non-normal or ordinal outcome
One-way ANOVA	Kruskal–Wallis test	three+ groups, non-normal outcome
Pearson correlation	Spearman’s rank correlation (ρ)	the relationship is monotonic but not straight, or has outliers
Chi-square	Fisher’s exact test	small expected cell counts (rule of thumb < 5)

Beyond this lesson: when the outcome isn’t continuous-and-normal at all — counts, yes/no responses, rates — generalized linear models (e.g., logistic or Poisson regression) are the principled tool. And switching tests isn’t automatic: each alternative carries its own assumptions and answers a slightly different question (rank-based tests, for instance, trade away the tidy mean difference), so always look at the data first.

Take-home

You don’t need to memorize every assumption. Ask: were the observations independent? does the picture look reasonable? are there wild outliers? Those three questions catch most real problems.

Part 4 · Choosing & reading tests

4.9 Reading a results paragraph

Journal results sections are dense by design. Once you can name each symbol, they become readable. Click any highlighted term below.

Try it · Click to decode

Two simulated results sentences

“Naming improved from () to (); the gain was significant, , , , .”

“Dosage groups differed, , , . Attendance related to gain, , and therapy hours explained of the variance.”

Tap a highlighted term to see what it means, what it tells you, and what it does not tell you on its own.

Reading strategy

Read results in this order: (1) what was compared, (2) the effect size and confidence interval (how big, how precise), (3) the p-value, and (4) whether the authors’ clinical claim is bigger than their data support.

Part 5 of 6

Single-case research

Much of SLP evidence comes from carefully studying one client at a time. The logic is different from group statistics — and the graph is the star of the show.

Part 5 · Single-case research

5.1 Single-case design, visual analysis & Tau-U

Single-case experimental designs are not just “one person, before and after.” They rely on repeated measurement over time, distinct baseline and intervention phases, and replication (across behaviors, people, or settings) to build a convincing case.

AAC communicative initiations across sessions Simulated

A vertical line separates the baseline phase from the intervention phase. Read the picture before any number.

The six things visual analysis looks for

Try it · Visual-analysis checklist

Inspect the graph above, then check what you see

Level — did the average shift between phases? (Compare baseline vs. intervention height.) Trend — is there a slope within a phase? (Flat in baseline, rising in intervention?) Variability — how bouncy are the points around the trend? Immediacy — how fast did behavior change right after the phase line? Overlap — do intervention points overlap with baseline values? (Less overlap = stronger effect.) Consistency — would the pattern repeat across similar phases or cases?

Where Tau-U fits

Once the graph is convincing, a quantitative index can summarize the change. Tau-U (Parker et al., 2011) rolls two things you can already see on the graph into one number: how little the intervention points overlap with baseline, and the trend across the phases. Some versions also adjust for a trend that was already present during baseline.

Why a flat baseline matters

If a behavior is already climbing during the baseline — before treatment even starts — then part of the later rise might have happened anyway, so we can’t fully credit the intervention. That “rising baseline” is a classic threat to single-case studies. It’s why strong designs aim for a stable baseline first, and why careful readers (and trend-aware indices like Tau-U) try not to reward a treatment for a climb that was already underway. You don’t need the math behind that adjustment — just the habit of asking, “was the baseline flat before things changed?”

Simulated single-case indices

Baseline initiations averaged 1.8; intervention averaged 6.2. Nonoverlap was high — NAP = 0.99, PND = 92% — and the trend-aware Tau-U = 0.78.

M baseline 1.8M intervention 6.2NAP 0.99PND 92%Tau-U 0.78

Reporting note

Tau-U here is a teaching value; in real reporting, specify the software/function and whether baseline-trend correction was used.

One number never replaces the graph

Quantitative indices such as Tau-U can supplement visual analysis, but they do not replace strong design, repeated measurement, replication, and careful visual interpretation. Notice that near-perfect nonoverlap (NAP = 0.99) and the trend-aware Tau-U (0.78) tell slightly different stories — that’s why we look at several signals, and the picture, together.

Other indices you may see

Percentage of Nonoverlapping Data (PND), Percentage of Data Exceeding the Median (PEM), Nonoverlap of All Pairs (NAP), Tau / Tau-U, randomization tests (when the design allows), and standardized-mean-difference approaches. Each has strengths and blind spots.

Can I do this in Excel?

Excel is great for plotting session-by-session data (a line chart with a phase divider). But Tau-U is not a built-in Excel function — use R or a dedicated calculator and treat the value as a teaching figure. Interpretation reminder: graph first, index second.

Optional: plotting & indices in R

# Simulated AAC initiations: 6 baseline + 12 intervention sessions
session <- 1:18
phase   <- rep(c("Baseline","Intervention"), c(6, 12))
y       <- c(2,1,2,2,3,1, 4,3,5,4,4,5,7,7,9,7,8,11)

plot(session, y, type = "b")          # line graph; add a phase divider
abline(v = 6.5, lty = 2)             # phase-change line
# Tau-U: use a verified package/calculator (e.g., the 'SingleCaseES' package)

Indices above were computed for this simulated dataset; Tau-U values can differ slightly across software variants.

Part 6 of 6

Integration & review

Pull it together: the reminders that keep you honest, the red flags to watch for, and a low-stakes quiz to check your reasoning.

Part 6 · Integration & review

6.1 Bringing it all together

Six interpretation reminders

1 · Question first

Outcome type + groups + repeated? decides the test — not the other way around.

2 · p is not proof

It’s “surprise under the null,” nothing more. Small ≠ important; large ≠ no effect.

3 · Size & precision

Always read the effect size and the confidence interval, not just the p-value.

4 · Statistical ≠ clinical

Ask whether the change matters for communication, participation, safety, or quality of life.

5 · Association ≠ cause

Correlation and regression describe patterns; design earns causal claims.

6 · Picture beats index

Especially single-case: the graph leads, Tau-U supports.

Red-flag phrases to question

“The p-value proves the treatment works.” · “p > .05, so it doesn’t work.” · “A significant result means a clinically important result.” · “The correlation shows X causes Y.” · “Tau-U = 0.8, so we don’t need the graph.” · “The effect is large, so it must matter for every client.”

Open the one-page cheat sheet

Quick reference. All examples on this page use simulated data.
If you want to…	Use	Effect size	Excel
Compare the same people twice	Paired t-test	Cohen’s d_z	`T.TEST(…,2,1)`
Compare two separate groups	Independent t-test	Cohen’s d	`T.TEST(…,2,3)`
Compare 3+ groups	One-way ANOVA	η²	ToolPak › ANOVA
Relate two continuous variables	Correlation	r, r²	`CORREL`
Predict an outcome	Regression	slope, R²	ToolPak › Regression
Relate two categories	Chi-square	Cramér’s V	`CHISQ.TEST`
Track one client over time	Visual analysis (+ Tau-U)	Tau-U / nonoverlap	Plot only

You made it

You don’t need to compute these by hand to be a sharp consumer of research. If you can name the question, read the picture, and weigh size, precision, and clinical meaning, you can interpret most SLP studies with confidence. Revisit any section any time — your progress isn’t stored, so explore freely.

Make peace with the p-value.

Foundations & mindset

1.1 Welcome: statistics as careful reasoning

Key idea

A 30-second preview of the words people throw around

How confident do you feel right now about interpreting statistics in SLP articles?

Take-home

1.2 Descriptive vs. inferential statistics

“What happened in the data I have?”

“What might this say about the bigger picture?”

Can I do this in Excel?

Key idea

1.3 Variables, measurement & data types

How these map to the four levels of measurement

Which data type is each one?

Key idea

From sample to population

2.1 Population and sample

Each draw is a new study. Watch the sample mean dance.

What to notice

SLP translation

2.2 Sampling variability and uncertainty

Bigger samples, steadier estimates

What to notice

Why this matters

2.3 Distributions: shape matters

Real SLP outcomes are not always bell-shaped

Roughly normal

Right-skewed

Ceiling effect

Can I do this in Excel?

Key idea

The logic of inference

3.1 Hypothesis testing basics

The core question

Null hypothesis (H₀)

Alternative hypothesis (H₁)

How surprising is this result under the null?

Take-home

3.2 p-values, confidence intervals & errors

What a p-value is — and is not

Common mistakes

Confidence intervals show precision, not just yes/no

False positive

False negative

True or false?

Can I do this in Excel?

3.3 Effect size & clinical significance

What does “Cohen’s d” actually look like?

The effect size you’ll see for each test

Sort each simulated result

Take-home

Choosing & reading tests

4.1 Choosing a common statistical test

Which test fits each study?

Pattern to remember

4.2 Paired t-test

Simulated result

Plain-language interpretation

Clinical caution

Can I do this in Excel?

4.3 Independent t-test

Simulated result

The teaching moment: p = .052

Effect size & clinical note

Can I do this in Excel?

4.4 One-way ANOVA

Simulated result

Common mistake

Effect size & clinical note

Can I do this in Excel?

4.5 Correlation

Simulated result

Correlation is not causation

Effect size note

Can I do this in Excel?

4.6 Regression

Slide therapy hours; read the model’s prediction

Simulated result

Prediction ≠ proof of causation