Principles of Assessment
Can you trust the number? A test score is only useful once you know what kind of number it is. This lesson builds the reasoning — and the arithmetic — behind reliability & validity, sensitivity & specificity, and standardized scores, with hands-on practice at every step.
Reliability & Validity
Two different promises a test makes — and the surprising fact that a test can keep one while completely breaking the other.
A test is a sample, so error is built in
We can never directly observe “language ability.” A test only samples a behavior to estimate the construct underneath. Because it’s a sample, some measurement error is unavoidable — which is exactly why we judge every test on two questions.
Key idea
A score is an estimate of a true ability, not the ability itself. Test developers run studies to shrink the error — and they report two things so you can judge the test: how consistent the scores are, and whether the intended interpretation and use are supported.
Why it matters clinically
Eligibility, diagnosis, and progress decisions ride on these numbers. Knowing how to read them — and where they mislead — is a core clinical skill, not just a stats exercise.
Supported interpretation is not the same as consistency
Are we justified in using the score this way?
Does the evidence support interpreting this score as the construct — and using it for the decision we care about?
Does it give the same result?
Repeat it — same child, a different rater, another week — do we land near the same value?
Neither
scattered & off-target
Reliable, not valid
tight cluster, wrong spot
Reliable & valid
tight cluster, on target
The flavors of each
Tap a card to flip it. You don’t need to memorize every term today — just recognize that reliability words describe consistency, and validity words describe evidence for the intended interpretation and use of a score.
Reliability — consistency
Validity — interpretation
One-line translation
Reliability is a property of the scores; validity is about the interpretation you give them. A reliability number alone never proves you picked the right tool.
A perfectly reliable test of the wrong thing
Goal: measure competency as an English-speaking SLP. Our instrument: a 100-item Mandarin Chinese vocabulary test. Our examinees don’t read Chinese. We’ll check two different kinds of reliability — and both come back excellent.
Key idea
Reliability comes in several forms (test–retest, inter-rater, internal consistency), and a test can ace all of them while still being invalid — consistently measuring the wrong thing. Treat reliability as necessary but never sufficient: always ask the validity question first: what interpretation and use are supported?
Two quick questions
Diagnostic Accuracy
When a test decides “impaired” vs. “typical,” how often is it right? Two numbers carry the weight — and we’ll build their formulas by hand.
Sensitivity catches; specificity clears
Sensitivity
Of the children who truly have the disorder, what fraction does the test catch? (impaired → impaired)
Specificity
Of the children who are truly typical, what fraction does the test correctly clear? (typical → typical)
Both compare the test against a reference standard — the best available basis for deciding who really has the disorder. In SLP, that standard is rarely perfect, so always check how the comparison group was defined. Every result lands in one of four cells:
| Has LI | Typical | |
|---|---|---|
| Test + | 82TP · true positive ✔ | 30FP · false alarm ✘ |
| Test − | 18FN · miss ✘ | 70TN · true negative ✔ |
Teaching scenario: 100 children truly have language impairment (LI), 100 are typically developing (TD).
Build the formulas yourself
A new test, with new numbers. Drag each count from the 2×2 table into the right slot — a number can be used more than once (dragging copies it). Click a filled slot to clear it.
The trick to remember
The column you’re working in is the column you divide within. Sensitivity lives entirely in the “Has LI” column (TP, FN); specificity lives in the “Typical” column (TN, FP).
Move the cut-off, move both numbers
Line the children up by screening score, low to high. Children with language impairment (LI) tend to score lower; typically developing (TD) children tend to score higher — but they overlap in the middle. The cut-off flags everyone at or below it as “at risk.” Drag it and watch both numbers change.
| ● ≥ 90% | good / preferred |
| ● 80–89% | fair / acceptable |
| ● < 80% | unacceptable |
The trade-off
Raise the cut-off → catch more true cases (↑ sensitivity) but flag more TD children (↓ specificity). Lower it → correctly clear more TD children (↑ specificity) but miss some true cases (↓ sensitivity). Sensitivity and specificity belong to a particular cut-off in a particular validation sample; these values often live in research, not the manual.
In AI papers, the words change
AI & NLP work on communication disorders — at venues like ACL, EMNLP, Interspeech — describes the same 2×2 with different names. If you read that literature, you’ll meet these:
The intuition
A model can have high recall but low precision by flagging almost everyone — exactly like a screener that is sensitive but not specific.
Can you compute on the fly?
Basic Statistics
Before we can read a test score, we need a few statistics: where a group of scores centers, how spread out they are, where one score sits, and the shape they make — which leads straight to the normal curve.
Measures of center: one typical score
You previewed these — here’s the quick recap. A set of scores is hard to hold in your head; a measure of central tendency gives you one number that best represents the group. Tap each card to flip it.
One outlier moves the mean
Add a single very low score and the mean drops, while the median barely moves. For skewed data or outliers, the median is the more honest “typical” value.
Spread: the standard deviation
The center isn’t the whole story. Two groups can share the same mean yet look completely different — one bunched tight, one widely scattered. The standard deviation (SD) is the single number that captures that spread: the typical distance of a score from the mean.
The recipe
① subtract the mean from each score → ② square each result → ③ average those squares (that’s the variance) → ④ take the square root. For this teaching example, we treat the five scores as the whole group, so we divide by N = 5. Run it on both groups:
| Score | Deviation (score − 50) | Squared |
|---|---|---|
| 47 | −3 | 9 |
| 49 | −1 | 1 |
| 50 | 0 | 0 |
| 51 | +1 | 1 |
| 53 | +3 | 9 |
| Score | Deviation (score − 50) | Squared |
|---|---|---|
| 44 | −6 | 36 |
| 48 | −2 | 4 |
| 50 | 0 | 0 |
| 52 | +2 | 4 |
| 56 | +6 | 36 |
Population vs. sample SD
Because we are treating these five scores as the complete group, the example divides by N. In research software, you will usually see the sample SD, which divides by n − 1. The concept is the same, but the exact number changes.
What the SD tells you
Same mean (50), but SD = 2 vs SD = 4 — group B’s scores sit, on average, twice as far from the mean. That single number is the whole point of the SD. In practice software computes it for you — but now you know exactly what it’s doing, and why a bigger SD means more spread.
Why it matters next
Standardized scores are built directly from the SD. “One SD below the mean” becomes a z-score of −1 — the standardized-score sections rest on this one number.
Position: percentile rank & quartiles
Besides spread, we often want a score’s position — where it sits among the rest. Good news: this builds straight on the median you already know.
Start from the median
The median splits the data in half — 50% of scores fall below it, 50% above. That is exactly the 50th percentile. Percentile rank and quartiles simply extend that one idea to any cut-point.
Percentile rank
The percent of peers who scored below a child. The median is the 50th percentile; the 75th percentile means the child scored higher than 75% of peers.
Quartiles
Three cut-points that split the data into four equal parts: Q1 (25th), Q2 = the median (50th), Q3 (75th).
Optional · how to compute percentile rank & quartiles
Percentile rank = (number of peers who scored below ÷ total) × 100.
Example — in a class of 20, 15 children scored below Maria → 15 ÷ 20 × 100 = 75th percentile.
Quartiles — order the scores, then take three medians:
- Q2 = the median (the middle of all the scores)
- Q1 = the median of the lower half
- Q3 = the median of the upper half
Example — 2, 4, 5, 7 │ 8, 10, 12, 15 → Q2 = (7 + 8) / 2 = 7.5; lower half {2, 4, 5, 7} → Q1 = (4 + 5) / 2 = 4.5; upper half {8, 10, 12, 15} → Q3 = (10 + 12) / 2 = 11.
Pictures of data: five workhorse graphs
You previewed these five. Tap each card to flip from the graph’s job to a quick example of what it looks like.
Where we’re headed
One of these — the histogram — is the bridge to everything that follows. It shows the shape of a single variable, and that shape is the key to standardized scores.
From a histogram to the normal curve
A histogram groups one variable into bins and shows how many scores land in each. Measure a few children and it looks ragged; measure more and more, and the histogram becomes a more stable picture of the underlying distribution. If the underlying scores are approximately normal, the bars begin to resemble a smooth, symmetric bell.
68 – 95 – 99.7
In a normal distribution, about 68% of scores fall within 1 SD of the mean, 95% within 2 SD, and 99.7% within 3 SD. That regularity makes normal-curve interpretations convenient. Z-scores can be computed for many distributions, but mapping them to normal-curve percentiles assumes scores are approximately normal.
Standardized Scores
A raw score of “22 correct” means nothing on its own. Now we build the shared ruler — the normal curve, z-scores, standard scores, and percentiles — to say where a child stands, how to read the number, and how it gets misused.
Standardized, norm-referenced tests
Many major clinical language tests are both standardized and norm-referenced — and together those two ideas are what turn a raw number into a meaningful score when the child is represented by the test’s norming and validation evidence.
Standardized
Given and scored by fixed, uniform rules, so scores are less dependent on the individual clinician or testing session.
Norm-referenced
The score is interpreted against a normative sample of peers (matched on age — sometimes grade or region). It answers: where does this child stand relative to peers? When a child differs substantially from the norm sample in language, dialect, culture, or other key characteristics, treat the standardized score as one piece of evidence, not the whole answer.
Because the score is read against peers, a raw “22 correct” only becomes meaningful once we convert it. Z-scores and standard scores are built from the norm group’s mean and SD. Percentile ranks come from the child’s position in the ordered norm distribution, often reported directly in the manual. The next sections build each one.
The z-score
A z-score answers one question: how many standard deviations is this score from the mean? On the z-scale the mean is 0 and one SD is 1.
The formula
z = score − meanSD
The standard-score scale
z-scores have decimals and negatives, which are awkward to report. Many language-test composite or index standard scores use a familiar scale with mean 100, SD 15, the same metric used by many IQ composites. Some subtests use other scales, such as scaled scores with mean 10 and SD 3, so always check the manual.
Convert between them
z = (SS − 100) / 15 • SS = 100 + z × 15
Check the scale
The mean 100, SD 15 metric is common for composite/index scores on major language tests (e.g., CELF-5, PLS-5). A standard score of 115 means “+1 SD” only when that test uses the 100/15 scale.
One curve, three scales
When the norm distribution is approximately normal, z-score, standard score, and percentile can name the same place on the normal curve. Reveal each scale to see the idealized alignment.
Watch the spacing
Percentiles are not evenly spaced in score points. Near the middle, a few points move you many percentiles; out in the tails, big score gaps barely change the percentile. That’s the bell curve at work.
Convert it yourself
Three short builds. Every number is a chip — drag the right one into each slot. Some chips are distractors you won’t need, and the chips don’t say what they are: press Check to reveal each one’s role.
Every score is an estimate
Measurement error is unavoidable — think how your own typing speed varies test to test. So a score is really a band, not a point. A confidence interval is built from the observed score and the standard error of measurement (SEM): observed score ± a confidence-level multiplier × SEM. Good manuals usually print the completed band for you.
| Raw | Std score | 90% CI | %ile |
|---|---|---|---|
| 25 | 101 | 91–110 | 52 |
| 24 | 98 | 88–108 | 46 |
| 23 | 96 | 86–106 | 41 |
| 22 | 94 | 84–104 | 38 |
| 21 | 92 | 82–102 | 30 |
| 20 | 89 | 79–99 | 25 |
| 19 | 87 | 77–97 | 21 |
Read a full profile: Johnny, age 4;1
z = (94 − 100) / 15 = −0.40 — just under half an SD below the mean.
Percentile 38 = higher than ~38% of same-age peers → within the average range for this manual, and not diagnostic by itself.
Honesty note
A pure normal model gives about the 34th percentile for z = −0.40, while the manual lists 38th. Real test norms aren’t perfectly normal — when they differ, trust the manual’s table over the idealized math.
Write it in a clinical report
Numbers are only useful if you can explain them to families and other professionals. Build a clear, defensible sentence by dragging Johnny’s values into the report. Then compare good vs. risky phrasing.
✓ Say this
“Johnny’s score is within the average range for his age. Scores are estimates, so we report the confidence interval rather than treating one number as exact.”
Age equivalents (AE)
An age-equivalent (AE) answers a different — and easily misunderstood — question. It compares a child’s raw score to the average performance of different age groups, telling you which age group’s typical raw score matches the child’s, rather than comparing the child to peers of their own exact age.
| Raw score | Age-equivalent range |
|---|---|
| 0–18 | below 3-0 |
| 19–21 | 3-0 through 3-5 |
| 22–24 | 3-6 through 3-11 |
| 25–27 | 4-0 through 4-5 |
Johnny (4;1), raw 22 → AE = 3;6–3;11. His raw score matches the average raw-score range for children around 3;6–3;11. This does not mean Johnny “functions like” a younger child.
That same score is a standard score of 94 (38th percentile) — within the average range for his own age on this manual. The two just answer different questions: “which age band had a similar average raw score?” vs. “how does he compare with his own-age peers?”
When AE helps
AE can feel concrete, but it is easy to misinterpret. Use it only as a carefully framed supplement, alongside the standard score and percentile — never as the lead score.
Just keep in mind
AE intervals aren’t equal, and because raw scores climb steeply with age, a small score gap can look like a bigger age gap. Do not use AE for diagnosis, eligibility, or placement decisions; pair it with the standard score and percentile rather than leaning on it alone — more on that in the misuse cases below.
Norm-referenced vs. criterion-referenced
Everything so far has been norm-referenced — comparing a child to a sample of peers. There’s a second, equally useful approach that compares the child to a fixed standard of mastery instead.
| Norm-referenced | Criterion-referenced | |
|---|---|---|
| Compared to… | a normative sample of peers | a fixed criterion / mastery level |
| Question it answers | “Where does this child stand relative to peers?” | “Has this child mastered this skill?” |
| Typical scores | standard score, percentile, z-score | % correct, pass / fail, mastery level |
| Often useful for… | describing performance relative to peers; one piece of eligibility or diagnostic evidence | identifying specific skills, planning treatment, and monitoring progress |
Don’t confuse the rulers
90% correct (criterion-referenced — how much of a skill the child has) is not the 90th percentile (norm-referenced — better than 90% of peers). Same digits, completely different meaning.
How standardized scores get misused
Four classic errors. Step through them one case at a time — each with what happened, what went wrong, and what to do instead.
Three ideas to keep
Consistent ≠ correct
A test can be perfectly reliable and still measure the wrong thing. Always ask the validity question: what interpretation and use are supported?
Catch vs. clear
Sensitivity catches true cases; specificity clears true negatives; the cut-off trades them. In AI/NLP, sensitivity is recall; precision asks how many flagged cases were truly positive.
A score is a band
SS and z use the mean and SD; percentiles show position in the norm distribution. Report a confidence interval; never let one number diagnose.
The whole lesson in a sentence
Trust the number — but only after you know what kind of number it is.