Messick observes that it is important to collect evidence of both positive and negative consequences of performance assessments. If the promised benefits to teaching and learning occur, then this is evidence in support of the validity of performance assessments. According to experts https://samedaypapers.com/write-my-essay, if such benefits do not occur, or if there are negative consequences, that is also important to document. If negative consequences result, it is important to determine their causes. If some examinees receive low scores because something is missing from the assessment, this is evidence of construct under-representation. That is, for example, if a writing assessment consists only of questions about how to revise the text, and does not allow an examinee to demonstrate an ability to produce text, then the construct as defined by the Hayes and Flower model of writing is underrepresented. Additionally, low scores should not occur because the assessment contains irrelevant questions. In writing assessment, for example, an essay prompt on a topic that examinees are unlikely to be familiar with could affect performance unfairly.
Unfortunately, evidence on the consequences of new forms of assessment is rarely assembled. In the health professions, for example, Swanson, Norman, and Linn could find only two examples of systematic research on the impact of changes in examinations. One reason for the failure to conduct research on consequences is that it is difficult. It may require a number of years for a new assessment to produce observable changes in the behaviors of students or teachers, or the changes may be so gradual that they are not easily detected.
Comparability. In order to make comparisons of assessments from year to year or from administration to administration, the assessments must mean the same thing on different occasions. This means that they must be of comparable content and of comparable difficulty. In writing skill assessment, comparability is a particular trouble- some problem because individual free-response tasks are quite often not comparable. Comparability problems are alleviated to some extent through the use of multiple tasks, as in NAEP and some statewide assessments, or through the combination of free-response tasks with multiple-choice items, as is done for SAT II, Advanced Placement, the GED, and Praxis.
To ensure comparable content requires careful attention to test specifications. With traditional multiple-choice tests, test specifications are made comparable across testing occasions by balancing the number of items of each type. With a large number of items, balancing test specifications is not difficult. In writing skill assessment using free responses, however, the number of tasks is usually quite small, and each task may require from 20 minutes to an hour of time. As a result, comparability of content may be difficult to maintain. It is, of course, essential, in addition, to control the exposure of free-response tasks so that the content does not become known prior to a test administration. The scoring of free-response tasks must be carefully controlled across administrations by use of the same scoring rubrics and reader training from year to year or from administration to administration.