HoB icon

Assessment: Validity & reliability

Validity flows from human judgment about the persuasiveness of a particular validity argument and the evidence on which that argument has been fashioned.

W. James Popham

Procedures and guiding principles to make more consistent, reliable, & valid assessment decision


This page includes suggestions and procedures to logically evaluate the validity of inferences and claims made about and with assessments.

Procedures to follow to create, implement, evaluate, and report assessments with more validity and reliability.

And examples with some initial ideas for using hight stakes testing for teacher retention and portfolios for preservice retention and certification.

Suggestions and procedures to evaluate assessment claims

Inspired by Greg Thompson, David Rutkowski, & Leslie Rutkowski

Every claim made about an assessment results, needs for and against arguments and claims to make valid interpretations. It is the responsibility of those making a claim to provide evidence for arguments that both support and do not support their claim. With the most supported claims being the most valid. A process to do this can be achieved by altering Stephen Toulmin's position analysis heuristic.

Position Analysis Heuristic


Position analysis by Toulmin

Model to evaluate the validity of claims about assessment data

Altered to create a model to evaluate a claim and its opposite claim.

Validity model

Let's review some vocabulary:

Validity - An assessment that is valid, measures what it claims to measure.

Reliable - An assessment that is reliable, provides consistent results across time and users.

An assessment must be reliable to be valid, but doesn't have to be valid to be reliable. A test may give the same consistent range of results over time and different groups (reliable) , but it may not measure what it is claimed to measure (valid).

Determining validity requires constructing and evaluating arguments for and against the intended interpretation of assessment scores and their relevance to the proposed use.

If there is no compelling evidence (or argument) that undermines or falsifies an inference, then that inference is reasonably valid.

Let's review a procedure to use with the model to determine validity.

Procedure to use the model

The model can be used to verify or falsify an inference or intended use of data with a procedure such as:

  1. Outline the claim to gather enough information to clearly understood and communicate it.
  2. Outline the opposite (falsifying) claim to gather enough information to clearly understood and communicate it.
  3. Find the evidence - use multiple sources to collect supporting and falsifying information.
  4. Consider the stakes - the higher the stakes the stronger the quality and scale of evidence that needs to be gathered for both claims and their inferences.
  5. Decide on the most valid interpretation - is the claim or opposite claim best supported by the collected evidence.


All information should be revisited until there is sufficient trust in how the claims and opposite claims will be used or rejected; and that use or non use can be communicated in easily understandable ways to those affected by any decisions.


Procedures to follow to create and use valid and reliable assessment

Inspired by Barbara S. Plake - Buros Center for Testing.

Procedure to create assessments

A diverse team composed of the following members met once a week for two hours during a school year to create learning outcomes, concepts, and assessment items, for the express purpose of using them to create assessments to measure student achievement and progress on the school and state standards: curriculum director, external consultant, classroom teachers from the grade for which the assessments were being developed and classroom teachers above and below that grade level.

The team considered the following principles when making decisions in an attempt to assure the possibility of maximizing the validity of each item and instrument created. The principles are organized by tasks:

  • Development,
  • Implementation,
  • Scoring and evaluation, and
  • Reporting.

Development of assessment instruments

  • The construct (what is being assessed) is understood and can be written as concepts and or skills (how the information is expressed mentally) and outcomes (artifacts created by the students to demonstrate their level of understanding).
  • The assessment item is appropriate for the construct being measured.
  • The assessment item is appropriate for the developmental level of the students.
  • Other reasons the student might do well on the construct have been searched for and ruled out.
  • Reasons the student might do poorly on the construct, not related to the construct, have been searched for and ruled out.
  • Students’ learning of the related information will be facilitated in a similar manner (materials used, kinds of questions asked or tasks given, types of answers expected, time allowance, …) as it will be assessed.
  • All necessary and sufficient information needed by the students to successfully complete each item is or will be taught.
  • All students have sufficient opportunities of access to learn the necessary and sufficient information needed to be successful on each item.
  • The number of items in different areas are proportional to what is emphasized during instruction.
  • Cognitive demands of items match the intended interpretations of the assessments.
  • Tasks similar to the selected tasks would give similar results.

Implementation of assessment instrument

  • Students have been informed of the assessment and feel confident they are ready for the challenge.
  • Students have been given opportunities to ask questions about the test.
  • Students are motivated to take the test.
  • Instructions have been clearly explained to all students in a similar manner.

Scoring and evaluation of assessment instrument

  • The scoring key has been validated.
  • The scoring process is clear.
  • The scoring rubric is clearly understood by all evaluators.
  • Performance level descriptors are meaningful.
  • Performance level indicators are developmentally appropriate.
  • The rubric fits the construct.
  • The rubric is congruent with the instructional emphasis.
  • The scores reflect the students’ abilities and skills and are not due to scorer bias.
  • Scoring is replicable and different raters are confident that similar abilities and skills are scored similarly.
  • It is evident that students’ scores indicate that the instructions were clear.
  • It is evident that students knew what they needed to do to be successful on the task.
  • Performance category and cutscore decisions are sound and defensible.
  • Results are consistent with teacher expectations.
  • There are previous results to suggest that these results are probably accurate.
  • Students’ individual scores on one particular assessment relate to their average scores on other similar assessments. Any performance that isn’t consistent does not relate to the construct.
  • Students’ performance on a particular assessment will be helpful to make instructional decisions to facilitate students’ learning.

Reporting of assessment results

  • Students understand their score and its relationship to levels of performance.
  • Students understand how the scores will be used.
  • Score reports are clear.
  • Reporting of growth is comparable across time.
  • Reports are consistent with the precisions reported or suggested in the report and the actual precision level of the assessment.
  • There will be an increase in teacher collaboration until it becomes a significant element of the school culture.
  • There will be an increased reporting of students’ ability to learn until it becomes a significant element of the school culture.
  • Reporting of decreased behavioral interactions until they become minimal.
  • Reporting of decreased referrals for alternative and remedial programs until they become supportive mainstream programs.
  • Reporting in decrease of student completion of task completion until virtually nonexistent.
  • Reporting of teacher satisfaction in program development and curriculum decisions.
  • Reporting of positive teacher and student morale.
  • Reporting of student support for the curriculum.

Assessment evaluation for validity examples with initial ideas

Hight stakes testing for teacher retention

Claim - End of Year tests (EOY) can be used for teacher assessment and retention.

Opposite (falsifying) claim - End of Year tests (EOY) can NOT be used for teacher assessment and retention.

Inference - The EOY test directly reflects teacher quality.

Assumptions & data:

  • Teachers are fully responsible for student outputs and no other outside factors contribute to student scores.
  • The assessment results are an accurate and comprehensive measure of the implemented curriculum.
  • All children have had an opportunity to learn the curriculum
  • All students are motivated and do their best on the assessment.
  • The test is reliable.
  • Technical reports are available to assure the reliability of the test.
  • Technical report has only data on the takers achievement, no reference to the reliability of it as a measure of the teacher effectiveness of the instruction of the material being tested is included.
  • Therefore, EOY assessment is not a reliable measure of teacher instruction.

Portfolios for retention and certification

Claim - Preservice teachers portfolio artifacts can be used for their retention and certification.

Opposite (falsifying) claim - Preservice teachers portfolio artifacts can NOT be used for their retention and certification.

Inference - The portfolio artifacts directly reflect teacher qualifications for certification.

Assumptions & data:

  • Teachers are fully responsible for the artifacts they collect.
  • The artifacts represent an accurate and comprehensive measure of their performance.
  • All Preservice teachers have had an opportunity to learn the curriculum.
  • The artifacts scores are reliable.
  • Multiple sources create a better sample of a teacher's work than a single sample.