Link to book
Construct validity refers to the degree to which inferences can legitimately be made from the operationalizations in your study to the theoretical constructs on which those operationalizations were based. We might think of construct validity as a “labeling” issue. When you implement a program that you call a “Head Start” program, is your label an accurate one? When you measure what you term “self esteem” is that what you were really measuring?
We said yesterday that medication adherence is a definite context; hence we do not need to make sure that what we mean by it everyone else means the same. Obviously since PHR is something we can see/touch/point at there is no issue of construct validity.
Do we need to check the questionnaire as to have construct validity? (makes no sense to me to do so, since both our variables are well defined?!)
In face validity, you look at the operationalization and see whether “on its face” it seems like a good translation of the construct. This is probably the weakest way to try to demonstrate construct validity. For instance, you might look at a measure of math ability, read through the questions, and decide that yep, it seems like this is a good measure of math ability (i.e., the label “math ability” seems appropriate for this measure). Of course, if this is all you do to assess face validity, it would clearly be weak evidence because it is essentially a subjective judgment call. We can improve the quality of face validity assessment considerably by making it more systematic. For instance, if you are trying to assess the face validity of a math ability measure, it would be more convincing if you sent the test to a carefully selected sample of experts on math ability testing and they all reported back with the judgment that your measure appears to be a good measure of math ability.
Content validity refers to how accurately an assessment or measurement tool taps into the various aspects of the specific construct in question. In other words, do the questions really assess the construct in question, or are the responses by the person answering the questions influenced by other factors? This approach assumes that you have a good detailed description of the content domain, something that’s not always true.
In predictive validity, we assess the operationalization’s ability to predict something it should theoretically be able to predict. For instance, we might theorize that a measure of math ability should be able to predict how well a person will do in an engineering-based profession. We could give our measure to experienced engineers and see if there is a high correlation between scores on the measure and their salaries as engineers. A high correlation would provide evidence for predictive validity — it would show that our measure can correctly predict something that we theoretically think it should be able to predict.
In concurrent validity, we assess the operationalization’s ability to distinguish between groups that it should theoretically be able to distinguish between. For example, if we come up with a way of assessing manic-depression, our measure should be able to distinguish between people who are diagnosed manic-depression and those diagnosed paranoid schizophrenic. If we want to assess the concurrent validity of a new measure of empowerment, we might give the measure to both migrant farm workers and to the farm owners, theorizing that our measure should show that the farm owners are higher in empowerment. As in any discriminating test, the results are more powerful if you are able to show that you can discriminate between two groups that are very similar.
In convergent validity, we examine the degree to which the operationalization is similar to (converges on) other operationalizations that it theoretically should be similar to. For instance, to show the convergent validity of a Head Start program, we might gather evidence that shows that the program is similar to other Head Start programs. Or, to show the convergent validity of a test of arithmetic skills, we might correlate the scores on our test with scores on other tests that purport to measure basic math ability, where high correlations would be evidence of convergent validity.
In discriminant validity, we examine the degree to which the operationalization is not similar to (diverges from) other operationalizations that it theoretically should be not be similar to. For instance, to show the discriminant validity of a Head Start program, we might gather evidence that shows that the program is notsimilar to other early childhood programs that don’t label themselves as Head Start programs. Or, to show the discriminant validity of a test of arithmetic skills, we might correlate the scores on our test with scores on tests that of verbal ability, where low correlations would be evidence of discriminant validity.
True Score Theory – random error
The variability of your measure is the sum of the variability due to true score and the variability due to random error, which sometimes considered noise.
It reminds us that most measurement has an error component. Second, true score theory is the foundation of reliability theory. A measure that has no random error (i.e., is all true score) is perfectly reliable; a measure that has no true score (i.e., is all random error) has zero reliability. Third, true score theory can be used in computer simulations as the basis for generating “observed” scores with certain known properties.
Measurement Error – systematic error
systematic error is sometimes considered to be bias in measurement.
Reducing Measurement Error
- Pilot test your instruments, getting feedback from your respondents regarding how easy or hard the measure was and information about how the testing environment affected their performance.
- If you are gathering measures using people to collect the data (as interviewers or observers) you should make sure you train them thoroughly so that they aren’t inadvertently introducing error.
- When you collect the data for your study you should double-check the data thoroughly. All data entry for computer analysis should be “double-punched” and verified. This means that you enter the data twice, the second time having your data entry machine check that you are typing the exact same data you did the first time.
- You can use statistical procedures to adjust for measurement error. These range from rather simple formulas you can apply directly to your data to very complex modeling procedures for modeling the error and its effects.
- One of the best things you can do to deal with measurement errors, especially systematic errors, is to use multiple measures of the same construct. Especially if the different measures don’t share the same systematic errors, you will be able to triangulate across the multiple measures and get a more accurate sense of what’s going on.
Theory of Reliability
In research, the term reliability means “repeatability” or “consistency”. A measure is considered reliable if it would give us the same result over and over again (assuming that what we are measuring isn’t changing!).
If our measure, X, is reliable, we should find that if we measure or observe it twice on the same persons that the scores are pretty much the same.
the error score is assumed to be random.
Reliability is a ratio or fraction. we can’t compute reliability because we can’t calculate the variance of the true scores
we can estimate the reliability as the correlation between two observations of the same measure.
Types of Reliability
- Inter-Rater or Inter-Observer Reliability
Used to assess the degree to which different raters/observers give consistent estimates of the same phenomenon.
- Test-Retest Reliability
Used to assess the consistency of a measure from one time to another.
- Parallel-Forms Reliability
Used to assess the consistency of the results of two tests constructed in the same way from the same content domain.
- Internal Consistency Reliability
Used to assess the consistency of results across items within a test.