Principles of Assessment — SLP 680: Research Methods

Part One · Recap & foundation

Reliability & Validity

Two different promises a test makes — and the surprising fact that a test can keep one while completely breaking the other.

1.1 · Why scrutinize a score?

A test is a sample, so error is built in

We can never directly observe “language ability.” A test only samples a behavior to estimate the construct underneath. Because it’s a sample, some measurement error is unavoidable — which is exactly why we judge every test on two questions.

Key idea

A score is an estimate of a true ability, not the ability itself. Test developers run studies to shrink the error — and they report two things so you can judge the test: how consistent the scores are, and whether the intended interpretation and use are supported.

Why it matters clinically

Eligibility, diagnosis, and progress decisions ride on these numbers. Knowing how to read them — and where they mislead — is a core clinical skill, not just a stats exercise.

1.2 · The core distinction

Supported interpretation is not the same as consistency

Validity = supported interpretation

Are we justified in using the score this way?

Does the evidence support interpreting this score as the construct — and using it for the decision we care about?

Reliability = consistency

Does it give the same result?

Repeat it — same child, a different rater, another week — do we land near the same value?

The dartboard picture

Neither
scattered & off-target

Reliable, not valid
tight cluster, wrong spot

Reliable & valid
tight cluster, on target

Consistency clusters the shots; validity asks whether the cluster supports the target interpretation. You can have consistency without a defensible interpretation — that middle target is today’s whole point.

1.3 · Vocabulary you’ll meet in manuals

The flavors of each

Tap a card to flip it. You don’t need to memorize every term today — just recognize that reliability words describe consistency, and validity words describe evidence for the intended interpretation and use of a score.

Reliability — consistency

Validity — interpretation

One-line translation

Reliability is a property of the scores; validity is about the interpretation you give them. A reliability number alone never proves you picked the right tool.

1.4 · The punch line · live demo

A perfectly reliable test of the wrong thing

Goal: measure competency as an English-speaking SLP. Our instrument: a 100-item Mandarin Chinese vocabulary test. Our examinees don’t read Chinese. We’ll check two different kinds of reliability — and both come back excellent.

Try it

中文词汇测验 · 100-item Mandarin Vocabulary Test · the 10 examinees don’t read Chinese

1.语言= ? 2.沟通= ? 3.吞咽= ? 4.嗓音= ? ⋯ 100.治疗= ?

Ten English-speaking SLPs take it twice (a week apart); two raters score every paper. Step through the evidence — you can move back and forward:

1 · Test–retest reliability

SLP	Time 1	Time 2
1	1	1
2	0	0
3	2	2
4	0	1
5	3	3
6	1	1
7	0	0
8	2	2
9	1	1
10	4	3

Each SLP scores about the same a week later — near-zero, and the dots hug the diagonal. Test–retest r = .95 (very consistent over time)

2 · Inter-rater reliability

SLP	Rater A	Rater B
1	1	1
2	0	0
3	2	2
4	0	1
5	3	3
6	1	1
7	0	0
8	2	2
9	1	1
10	4	4

Two raters give identical scores on 9 of 10 papers; the one disagreement (SLP 4) is a single point. Inter-rater agreement = 90% (r = .97)

3 · Pause & predict

Both reliability checks pass with flying colors. So — is this a valid measure of someone’s competence as an English-speaking SLP?

4 · Verdict

Two different kinds of reliability — both excellent — yet the test measures Chinese vocabulary, not English-SLP competence. Its validity for that purpose is essentially zero. Reliable, but not valid.

Key idea

Reliability comes in several forms (test–retest, inter-rater, internal consistency), and a test can ace all of them while still being invalid — consistently measuring the wrong thing. Treat reliability as necessary but never sufficient: always ask the validity question first: what interpretation and use are supported?

1.5 · Check yourself

Two quick questions

Part Two · New material

Diagnostic Accuracy

When a test decides “impaired” vs. “typical,” how often is it right? Two numbers carry the weight — and we’ll build their formulas by hand.

2.1 · Two questions, one table

Sensitivity catches; specificity clears

Sensitivity

Of the children who truly have the disorder, what fraction does the test catch? (impaired → impaired)

Specificity

Of the children who are truly typical, what fraction does the test correctly clear? (typical → typical)

Both compare the test against a reference standard — the best available basis for deciding who really has the disorder. In SLP, that standard is rarely perfect, so always check how the comparison group was defined. Every result lands in one of four cells:

	Has LI	Typical
Test +	82TP · true positive ✔	30FP · false alarm ✘
Test −	18FN · miss ✘	70TN · true negative ✔

Teaching scenario: 100 children truly have language impairment (LI), 100 are typically developing (TD).

2.2 · Your turn · drag & drop

Build the formulas yourself

A new test, with new numbers. Drag each count from the 2×2 table into the right slot — a number can be used more than once (dragging copies it). Click a filled slot to clear it.

Drag the numbers from the table

	Has LI	Typical
Test +
Test −

Teaching scenario: 100 children truly have LI, 100 are typically developing.

Sensitivity = + =?

Specificity = + =?

The trick to remember

The column you’re working in is the column you divide within. Sensitivity lives entirely in the “Has LI” column (TP, FN); specificity lives in the “Typical” column (TN, FP).

2.3 · Standards & trade-offs · drag the line

Move the cut-off, move both numbers

Line the children up by screening score, low to high. Children with language impairment (LI) tend to score lower; typically developing (TD) children tend to score higher — but they overlap in the middle. The cut-off flags everyone at or below it as “at risk.” Drag it and watch both numbers change.

10 LI children & 10 TD children, by screening score

LI child TD child flagged “at risk”

flag fewer flag more

Everyone at or to the left of the dashed line is flagged. Raise the line and you flag more children — of both kinds.

Sensitivity

—

Specificity

—

How good is good enough? (Plante & Vance)

● ≥ 90%	good / preferred
● 80–89%	fair / acceptable
● < 80%	unacceptable

The trade-off

Raise the cut-off → catch more true cases (↑ sensitivity) but flag more TD children (↓ specificity). Lower it → correctly clear more TD children (↑ specificity) but miss some true cases (↓ sensitivity). Sensitivity and specificity belong to a particular cut-off in a particular validation sample; these values often live in research, not the manual.

2.4 · Same table, different field

In AI papers, the words change

AI & NLP work on communication disorders — at venues like ACL, EMNLP, Interspeech — describes the same 2×2 with different names. If you read that literature, you’ll meet these:

=

Recall = TP / (TP + FN) — identical to sensitivity. “Of the real cases, how many did we find?”

≠

Precision = TP / (TP + FP) — a new question. “Of the cases we flagged, how many were truly positive?”

~

F1 = harmonic mean of precision & recall — one balanced number.

The intuition

A model can have high recall but low precision by flagging almost everyone — exactly like a screener that is sensitive but not specific.

2.5 · Check yourself

Can you compute on the fly?

Part Three · The foundation

Basic Statistics

Before we can read a test score, we need a few statistics: where a group of scores centers, how spread out they are, where one score sits, and the shape they make — which leads straight to the normal curve.

3.1 · Statistics warm-up

Measures of center: one typical score

You previewed these — here’s the quick recap. A set of scores is hard to hold in your head; a measure of central tendency gives you one number that best represents the group. Tap each card to flip it.

One outlier moves the mean

Add a single very low score and the mean drops, while the median barely moves. For skewed data or outliers, the median is the more honest “typical” value.

3.2 · Statistics warm-up

Spread: the standard deviation

The center isn’t the whole story. Two groups can share the same mean yet look completely different — one bunched tight, one widely scattered. The standard deviation (SD) is the single number that captures that spread: the typical distance of a score from the mean.

Same mean (50), different spread

Both groups average 50 — but group A hugs the mean while group B is far more spread. Let’s compute each SD by hand and see the difference.

The recipe

① subtract the mean from each score → ② square each result → ③ average those squares (that’s the variance) → ④ take the square root. For this teaching example, we treat the five scores as the whole group, so we divide by N = 5. Run it on both groups:

Group A · clustered · mean = 50

Score	Deviation (score − 50)	Squared
47	−3	9
49	−1	1
50	0	0
51	+1	1
53	+3	9

sum of squares = 20 → variance = 20 ÷ 5 = 4 → SD = √4 = 2

Group B · spread · mean = 50

Score	Deviation (score − 50)	Squared
44	−6	36
48	−2	4
50	0	0
52	+2	4
56	+6	36

sum of squares = 80 → variance = 80 ÷ 5 = 16 → SD = √16 = 4

Population vs. sample SD

Because we are treating these five scores as the complete group, the example divides by N. In research software, you will usually see the sample SD, which divides by n − 1. The concept is the same, but the exact number changes.

What the SD tells you

Same mean (50), but SD = 2 vs SD = 4 — group B’s scores sit, on average, twice as far from the mean. That single number is the whole point of the SD. In practice software computes it for you — but now you know exactly what it’s doing, and why a bigger SD means more spread.

Why it matters next

Standardized scores are built directly from the SD. “One SD below the mean” becomes a z-score of −1 — the standardized-score sections rest on this one number.

3.3 · Statistics warm-up

Position: percentile rank & quartiles

Besides spread, we often want a score’s position — where it sits among the rest. Good news: this builds straight on the median you already know.

Start from the median

The median splits the data in half — 50% of scores fall below it, 50% above. That is exactly the 50th percentile. Percentile rank and quartiles simply extend that one idea to any cut-point.

Quartiles split the ordered scores into four equal groups

Q1, the median (Q2), and Q3 cut the data at the 25th, 50th, and 75th percentiles — three “medians” in disguise.

Percentile rank

The percent of peers who scored below a child. The median is the 50th percentile; the 75th percentile means the child scored higher than 75% of peers.

Quartiles

Three cut-points that split the data into four equal parts: Q1 (25th), Q2 = the median (50th), Q3 (75th).

Optional · how to compute percentile rank & quartiles

Percentile rank = (number of peers who scored below ÷ total) × 100.

Example — in a class of 20, 15 children scored below Maria → 15 ÷ 20 × 100 = 75th percentile.

Quartiles — order the scores, then take three medians:

Q2 = the median (the middle of all the scores)
Q1 = the median of the lower half
Q3 = the median of the upper half

Example — 2, 4, 5, 7 │ 8, 10, 12, 15 → Q2 = (7 + 8) / 2 = 7.5; lower half {2, 4, 5, 7} → Q1 = (4 + 5) / 2 = 4.5; upper half {8, 10, 12, 15} → Q3 = (10 + 12) / 2 = 11.

3.4 · Statistics warm-up

Pictures of data: five workhorse graphs

You previewed these five. Tap each card to flip from the graph’s job to a quick example of what it looks like.

Where we’re headed

One of these — the histogram — is the bridge to everything that follows. It shows the shape of a single variable, and that shape is the key to standardized scores.

3.5 · Statistics warm-up

From a histogram to the normal curve

A histogram groups one variable into bins and shows how many scores land in each. Measure a few children and it looks ragged; measure more and more, and the histogram becomes a more stable picture of the underlying distribution. If the underlying scores are approximately normal, the bars begin to resemble a smooth, symmetric bell.

Histogram of a language score as the sample grows

few children many children

Drag to measure more children; overlay the curve to see the bell-shaped pattern when the underlying scores are approximately normal.

68 – 95 – 99.7

In a normal distribution, about 68% of scores fall within 1 SD of the mean, 95% within 2 SD, and 99.7% within 3 SD. That regularity makes normal-curve interpretations convenient. Z-scores can be computed for many distributions, but mapping them to normal-curve percentiles assumes scores are approximately normal.

Part Four · The everyday skill

Standardized Scores

A raw score of “22 correct” means nothing on its own. Now we build the shared ruler — the normal curve, z-scores, standard scores, and percentiles — to say where a child stands, how to read the number, and how it gets misused.

4.1 · The shared ruler

Standardized, norm-referenced tests

Many major clinical language tests are both standardized and norm-referenced — and together those two ideas are what turn a raw number into a meaningful score when the child is represented by the test’s norming and validation evidence.

Standardized

Given and scored by fixed, uniform rules, so scores are less dependent on the individual clinician or testing session.

Norm-referenced

The score is interpreted against a normative sample of peers (matched on age — sometimes grade or region). It answers: where does this child stand relative to peers? When a child differs substantially from the norm sample in language, dialect, culture, or other key characteristics, treat the standardized score as one piece of evidence, not the whole answer.

Because the score is read against peers, a raw “22 correct” only becomes meaningful once we convert it. Z-scores and standard scores are built from the norm group’s mean and SD. Percentile ranks come from the child’s position in the ordered norm distribution, often reported directly in the manual. The next sections build each one.

4.2 · How far from the mean?

The z-score

A z-score answers one question: how many standard deviations is this score from the mean? On the z-scale the mean is 0 and one SD is 1.

The formula

z = score − meanSD

Walk through one

A child scores 85 on a test with mean 100 and SD 15. Predict: is that above or below average, and by how much?

Step 1 — subtract the mean. 85 − 100 = −15. The score is 15 points below the mean.

Step 2 — divide by the SD. −15 ÷ 15 = −1.0. So the child is exactly one SD below the mean.

Read it

z = −1.0 means “one standard deviation below average.” On the normal curve that’s about the 16th percentile (50% − 34%). Depending on the test manual, that may be described as low-average, borderline, or mildly below average. It is not, by itself, a diagnosis.

4.3 · Friendlier numbers

The standard-score scale

z-scores have decimals and negatives, which are awkward to report. Many language-test composite or index standard scores use a familiar scale with mean 100, SD 15, the same metric used by many IQ composites. Some subtests use other scales, such as scaled scores with mean 10 and SD 3, so always check the manual.

Convert between them

z = (SS − 100) / 15 • SS = 100 + z × 15

Check the scale

The mean 100, SD 15 metric is common for composite/index scores on major language tests (e.g., CELF-5, PLS-5). A standard score of 115 means “+1 SD” only when that test uses the 100/15 scale.

4.4 · The same spot, three names

One curve, three scales

When the norm distribution is approximately normal, z-score, standard score, and percentile can name the same place on the normal curve. Reveal each scale to see the idealized alignment.

Reveal the scales to see how they align on an ideal normal curve

On the ideal normal curve, a z of +1, a standard score of 115, and the 84th percentile all point to the same place. Manuals may use empirical percentile tables, so trust the manual when it differs from the idealized curve.

Watch the spacing

Percentiles are not evenly spaced in score points. Near the middle, a few points move you many percentiles; out in the tails, big score gaps barely change the percentile. That’s the bell curve at work.

4.5 · Practice · drag & drop

Convert it yourself

Three short builds. Every number is a chip — drag the right one into each slot. Some chips are distractors you won’t need, and the chips don’t say what they are: press Check to reveal each one’s role.

Exercise C · raw → standard score → z

Johnny’s raw score is 22 on the SPELT-P2. First read his standard score from the norm table, then drag it into the z formula.

Raw score	Standard score
20	89
21	91
22	94
23	97

z = − =?

4.6 · The honest part

Every score is an estimate

Measurement error is unavoidable — think how your own typing speed varies test to test. So a score is really a band, not a point. A confidence interval is built from the observed score and the standard error of measurement (SEM): observed score ± a confidence-level multiplier × SEM. Good manuals usually print the completed band for you.

SPELT-P2 manual — raw → standard score, 90% confidence interval & percentile (ages 4-0 to 4-5)

Raw	Std score	90% CI	%ile
25	101	91–110	52
24	98	88–108	46
23	96	86–106	41
22	94	84–104	38
21	92	82–102	30
20	89	79–99	25
19	87	77–97	21

Johnny’s raw 22 → standard score 94, with a 90% confidence interval of 84–104. The printed interval already includes the confidence-level multiplier; his score estimate is a band, not exactly 94.

Why the band matters

Now suppose Johnny is tested before and after treatment on the same test. Did he really improve? (Both readings come straight from the manual.)

Time 1

SS 78

90% CI: 68–88

Time 2

SS 87

90% CI: 77–97

The bands overlap

68–88 and 77–97 share a lot of ground (77–88), so the 9-point rise could be real improvement — or just measurement noise. These two scores alone can’t tell us which. To actually demonstrate progress, don’t rely on a re-given standardized test; supplement it with criterion-referenced probes on the child’s goals, language-sample measures (e.g., MLU, % of targets correct in conversation), or a single-case design with repeated measurement over time.

4.7 · Put it together

Read a full profile: Johnny, age 4;1

SPELT-P2 score report

raw = 22SS = 94%ile = 38

z = (94 − 100) / 15 = −0.40 — just under half an SD below the mean.

Percentile 38 = higher than ~38% of same-age peers → within the average range for this manual, and not diagnostic by itself.

Honesty note

A pure normal model gives about the 34th percentile for z = −0.40, while the manual lists 38th. Real test norms aren’t perfectly normal — when they differ, trust the manual’s table over the idealized math.

4.8 · Interpret & communicate

Write it in a clinical report

Numbers are only useful if you can explain them to families and other professionals. Build a clear, defensible sentence by dragging Johnny’s values into the report. Then compare good vs. risky phrasing.

Build the sentence · drag

On the SPELT-P2, Johnny earned a standard score of (th percentile), placing his expressive language within the average range for his age on this manual. The 90% confidence interval (–) indicates his score estimate is best interpreted as a band rather than a single exact point.

✓ Say this

“Johnny’s score is within the average range for his age. Scores are estimates, so we report the confidence interval rather than treating one number as exact.”

4.9 · The intuitive score

Age equivalents (AE)

An age-equivalent (AE) answers a different — and easily misunderstood — question. It compares a child’s raw score to the average performance of different age groups, telling you which age group’s typical raw score matches the child’s, rather than comparing the child to peers of their own exact age.

SPELT-P2 manual · Age Equivalency Ranges

Raw score	Age-equivalent range
0–18	below 3-0
19–21	3-0 through 3-5
22–24	3-6 through 3-11
25–27	4-0 through 4-5

Johnny again — read plainly

Johnny (4;1), raw 22 → AE = 3;6–3;11. His raw score matches the average raw-score range for children around 3;6–3;11. This does not mean Johnny “functions like” a younger child.

That same score is a standard score of 94 (38th percentile) — within the average range for his own age on this manual. The two just answer different questions: “which age band had a similar average raw score?” vs. “how does he compare with his own-age peers?”

When AE helps

AE can feel concrete, but it is easy to misinterpret. Use it only as a carefully framed supplement, alongside the standard score and percentile — never as the lead score.

Just keep in mind

AE intervals aren’t equal, and because raw scores climb steeply with age, a small score gap can look like a bigger age gap. Do not use AE for diagnosis, eligibility, or placement decisions; pair it with the standard score and percentile rather than leaning on it alone — more on that in the misuse cases below.

4.10 · One more distinction

Norm-referenced vs. criterion-referenced

Everything so far has been norm-referenced — comparing a child to a sample of peers. There’s a second, equally useful approach that compares the child to a fixed standard of mastery instead.

	Norm-referenced	Criterion-referenced
Compared to…	a normative sample of peers	a fixed criterion / mastery level
Question it answers	“Where does this child stand relative to peers?”	“Has this child mastered this skill?”
Typical scores	standard score, percentile, z-score	% correct, pass / fail, mastery level
Often useful for…	describing performance relative to peers; one piece of eligibility or diagnostic evidence	identifying specific skills, planning treatment, and monitoring progress

Don’t confuse the rulers

90% correct (criterion-referenced — how much of a skill the child has) is not the 90th percentile (norm-referenced — better than 90% of peers). Same digits, completely different meaning.

4.11 · Learn from real mistakes

How standardized scores get misused

Four classic errors. Step through them one case at a time — each with what happened, what went wrong, and what to do instead.

Wrap-up · the whole lesson

Three ideas to keep

Reliability & validity

Consistent ≠ correct

A test can be perfectly reliable and still measure the wrong thing. Always ask the validity question: what interpretation and use are supported?

Diagnostic accuracy

Catch vs. clear

Sensitivity catches true cases; specificity clears true negatives; the cut-off trades them. In AI/NLP, sensitivity is recall; precision asks how many flagged cases were truly positive.

Standardized scores

A score is a band

SS and z use the mean and SD; percentiles show position in the norm distribution. Report a confidence interval; never let one number diagnose.

The whole lesson in a sentence

Trust the number — but only after you know what kind of number it is.