|









 |
|
Volume 10 |
April 2008 |
Issue#1 |
|
Examining Panelist Data from a Bilingual Standard
Setting Study.
Elaine M. Rodeck,
Tzu-Yun Chin, Susan L. Davis,
Barbara S. Plake,
Buros Center for Testing,
University of Nebraska-Lincoln
Abstract
This study examined
the relationships between the evaluations obtained from standard
setting panelists and changes in ratings between different
rounds of a standard setting study that involved setting
standards on different language versions of an exam We
investigated panelists’ evaluations to determine if their
perceptions of the standard setting were related to adjustments
they made in their recommended cut scores across rounds of the
process. The standard setting was conducted for a high school
mathematics test composed of multiple-choice and constructed
response items. The test was designed for a population of
students who speak and receive primary instruction in either
English or French. Results indicated panelists’ ratings of their
ratings and their comfort with the process were related to how
their ratings changed across sequential rounds of the process.
Differences in the degree to which
the evaluations influenced the standard setting judgments were
observed across the English and French panelists, with the
French group reporting increasing comfort across rounds in
contrast to the English group that had relatively higher comfort
at the beginning of the process.
The results illustrate how standard
setting evaluation data can provide insight into factors that
affect panelists’ ratings.
|
|
Volume 9 |
April 2008 |
Issue#3 |
|
Examining Panelist Data from a Bilingual Standard
Setting Study.
Elaine M. Rodeck,
Tzu-Yun Chin, Susan L. Davis,
Barbara S. Plake,
Buros Center for Testing,
University of Nebraska-Lincoln
Abstract
This study examined
the relationships between the evaluations obtained from standard
setting panelists and changes in ratings between different
rounds of a standard setting study that involved setting
standards on different language versions of an exam We
investigated panelists’ evaluations to determine if their
perceptions of the standard setting were related to adjustments
they made in their recommended cut scores across rounds of the
process. The standard setting was conducted for a high school
mathematics test composed of multiple-choice and constructed
response items. The test was designed for a population of
students who speak and receive primary instruction in either
English or French. Results indicated panelists’ ratings of their
ratings and their comfort with the process were related to how
their ratings changed across sequential rounds of the process.
Differences in the degree to which
the evaluations influenced the standard setting judgments were
observed across the English and French panelists, with the
French group reporting increasing comfort across rounds in
contrast to the English group that had relatively higher comfort
at the beginning of the process.
The results illustrate how standard
setting evaluation data can provide insight into factors that
affect panelists’ ratings.
|
|
Volume 8 |
February 2007 |
Issue#1 |
|
Does Quantity Equal Quality? The Relationship Between Length of
Response and Scores on the SAT Essay. Jennifer L. Kobrin,
Hui Deng, and Emily J. Shaw, The College Board
Abstract
This study was
designed to address two frequent criticisms of the SAT essay --
that essay length is the best predictor of scores, and that
there is an advantage in using more "sophisticated" examples as
opposed to personal experience. The study was based on 2,820
essays from the first three administrations of the new SAT. Each
essay was coded for number of words, number of paragraphs,
whether or not the response included first-person, and whether
or not the response went to the second page. Analyses included
descriptive statistics and group comparisons on the essay
response features, correlations between essay length and scores,
and hierarchical multiple regression to examine the contribution
of each essay feature variable to the prediction of essay
scores. The number of words in the essay explained 39% of the
variance of essay scores. Whether or not the essay reached the
second page explained an additional 1.5%, and whether or not the
essay was written in first person explained an additional 1.1% .
An examination of these features potentially affecting SAT essay
scores is essential to maintain that the SAT writing section
promotes valid interpretations of students’ writing skills. The
research described in this paper may benefit other testing
programs that include essay assessments. The careful analysis of
response features and the identification of potential
construct-irrelevant features in essay assessments are important
for evaluating the content and construct validity of writing
assessments. |
|
Volume 7 |
July 2005 |
Issue#3 |
|
Evaluating Computer Automated Scoring: Issues, Methods, and an Empirical
Illustration.
Yongwei Yang, The Gallup
Organization, Chad W. Buckendahl, Buros Center for Testing, University
of Nebraska-Lincoln, Piotr J. Juszkiewicz, The Gallup Organization,
Dennison S. Bhola, James Madison University, July 2005
Abstract
With the continual progress
of computer technologies, computer automated scoring (CAS) has become a
popular tool for evaluating writing assessments. Research of
applications of these methodologies to new types of performance
assessments is still emerging. While research has generally shown a high
agreement of CAS system generated scores with those produced by human
raters, concerns and questions have been raised about appropriate
analyses and validity of decisions/interpretations based on those
scores. In this paper we expand the emerging discussions on validation
strategies on CAS by illustrating several analyses can be accomplished
with available data. These analyses compare the degree to which two CAS
systems accurately score data from a structured interview using the
original scores provided by human raters as the criterion. Results
suggest key differences across the two systems as well as differences in
the statistical procedures used to evaluate them. The use of several
statistical and qualitative analyses is recommended for evaluating
contemporary CAS systems.
|
|
Volume 7 |
April 2005 |
Issue#2 |
|
Some
Useful Cost-Benefit Criteria for Evaluating Computer-based Test Delivery
Models and Systems. Richard
M. Luecht, University of North Carolina at Greensboro, April, 2005 Abstract
Computer-based testing (CBT)
is typically implemented using one of three general test delivery
models: (1) multiple fixed testing (MFT); (2) computer-adaptive testing
(CAT); or (3) multistage testing (MSTs). This article reviews some of
the real cost drivers associated with CBT implementation—focusing on
item production costs, the costs associated with administering the
tests, and system development costs—and elaborates three classes of
cost-benefit-related factors useful for evaluating CBT models: (1) real
measurement efficiency; (2) testing system performance; and (3)
provision for data quality control/assurance.
|
|
Volume 7 |
April 2005 |
Issue#1 |
|
Strategies to
Assess the Core Academic Knowledge of English Language Learners.
Stanley Rabinowitz, Sri Ananda, &
Andrew Bell, WestEd
Abstract
This paper focuses on this
assessment issue: How do you increase the validity of assessments
of ELL student performance on core academic content? We
begin by exploring NCLB expectations for ELL assessments and an
increasingly popular approach to meeting these requirements proposed by
some states—translation of assessments into students’ native languages.
Then, we present key research findings on attempts to increase access to
and validity of assessment for ELLs. We conclude by proposing a
comprehensive strategy for the assessment of ELL students’ performance
in core academic content.
|
|
Volume 6 |
May 2004 |
Issue#1 |
|
Creating
Better Tests for Everyone Through Universally Designed Assessments,
Sandra Thompson and Martha Thurlow, University of Minnesota,
David B. Malouf, U.S. Department of Education, May, 2004
Abstract
Universally designed assessments are designed and developed to allow
participation of the widest possible range of students, in a way that
results in valid inferences about performance on grade-level standards for
all students who participate in the assessment. This paper explores the
development of universal design and considers its application to large-scale
assessments. Building on universal design principles presented by the Center
for Universal Design (Center for Universal Design, 1997), seven elements of
universally designed assessments are identified and described. These
elements were derived from a review of literature on universal design,
assessment and instructional design, and research on topics such as
assessment accommodations (Thompson, Johnstone, & Thurlow, 2002). The seven
elements are:
-
Inclusive assessment
population
-
Precisely defined
constructs
-
Accessible, non-biased items
-
Amenable to accommodations
-
Simple, clear, and intuitive instructions and
procedures
-
Maximum readability and comprehensibility
-
Maximum legibility
Each of the elements is explored in this paper. Numerous
resources relevant to each of the elements are identified, with specific
suggestions for ways in which assessments can be designed to meet the needs
of the widest range of students possible. Challenges and opportunities
arising from the application of universally designed assessments are
identified.
|
|
Volume 5 |
July 2003 |
Issue#1 |
|
The Ideal Role of Large-Scale
Testing in a Comprehensive Assessment System, Charles A. DePascale, National Center for the Improvement of Educational Assessment,
July 2003
Abstract
The role of large-scale
assessment in public education has grown tremendously since the mid-1980s
and unquestionably will continue to grow with the implementation of the
assessment and accountability requirements of the No Child Left Behind Act.
In the rush to meet the demand to measure validly and reliably the
performance of all students, however, it must not be forgotten that
large-scale assessment is only one component of a comprehensive assessment
system. The factors that led to the predominance of large-scale assessment
are reviewed and the appropriate role of large-scale assessment in a
comprehensive assessment system is discussed.
|
|
Volume 4 |
July 2002 |
Issue#1 |
|
Ensuring Fair Testing Practices:
The Responsibilities of Test Sponsors, Test Developers, Test
Administrators, and Test Takers in Ensuring Fair Testing Practices,
Barbara S. Plake, Buros Center for Testing, University of Nebraska-Lincoln, Patrick Jones, Excelsior College, July, 2002 Abstract
The focus of tests
today oftentimes centers on ways to provide good quality tests to
test takers in a cost-effective manner. Test sponsors are
concerned about the policy issues related to test use; test developers must
prepare a test that meets both the purpose and specifications articulated by
the test sponsor and the technical standards for quality tests. Test
administrators are responsible for test delivery in ways that protect the
integrity of the test scores and the security of the test product. Test
takers often have limited options in when, how, or why they are taking the
test, and may feel victimized in the process. The purpose of this paper is
to focus on the
test taker and to consider how all parties in the test process
(test sponsor, test developer, test administrator, and test taker) have a
role to play in ensuring fair testing practices and valid test results.
|
|
Volume 3 |
January 2001 |
Issue#1 |
|
Megatrends in Personnel Testing: A Practitioner’s
Perspective,
John W. Jones, Ph.D., Kelly
D. Higgins, M.A., NCS Pearson, January 2001 Abstract
This paper briefly reviewed current personnel testing trends as documented in
the literature. Yet research literature often misses fast-moving megatrends that
will ultimately change the face of personnel testing practices. Therefore, the
primary purpose of this paper was to list and describe, from a practitioner’s
perspective, 10 dominant megatrends that are impacting the personnel testing
industry. Five megatrends were classified as being "technocentric,"
and five were classified as being "content-specific." Technocentric
themes were related to virtual career centers, integrated assessment platforms,
a number of Internet-age access concerns, media-rich assessments, and data
warehousing and mining. Content-specific trends were related to certification
testing, 21st century test constructs, human resource lifecycle
assessments, technology-friendly tests, and bottom-line impact and return on
investment (ROI) studies. A review of these 10 megatrends suggests that the
personnel testing industry is keeping pace with rapid technological innovations.
|
|
Volume 2 |
September 2000 |
Issue#1 |
|
Promoting Stakeholder Acceptance of
CBT, J. Patrick Jones,
Professional Examination Service, September 2000 Abstract
This article describes the major elements
of a communication plan for the implementation of a computer-based testing
(CBT) program. The major benefits and potential drawbacks of a CBT program
are reviewed, and the information needs of various stakeholder groups
are identified. The article concludes with an overview of communication
strategies and evaluation techniques that can facilitate the transition
to CBT.
|
|
Volume 1 |
August 1999 |
Issue#1 |
|
Increasing the Validity
of Adapted Tests: Myths to be Avoided and Guidelines for Improving Test
Adaptation Practices, Ronald K. Hambleton and
Liane Patsula,
University of Massachusetts at Amherst, August 1999 Abstract
Adapting or translating achievement, aptitude, and personality tests and
questionnaires from one language and culture to others has been done for a long
time. Unfortunately, there is substantial evidence to suggest that often these
adapted tests are problematic because of a failure to do the test adaptation
work correctly. The purposes of this paper are to describe five myths about test
adaptation that need to be discarded and to offer a set of steps to follow in
test adaptation projects. The International Test Commission guidelines for
adapting tests are also presented in the paper.
|
|