• 検索結果がありません。

Development of Listening Prochievement Tests for Third-Year Japanese Junior High School Students Studying English as a Foreign Language (Part I)

N/A
N/A
Protected

Academic year: 2021

シェア "Development of Listening Prochievement Tests for Third-Year Japanese Junior High School Students Studying English as a Foreign Language (Part I)"

Copied!
11
0
0

読み込み中.... (全文を見る)

全文

(1)

Development of Listening Prochievement Tests for Third-Year Japanese Junior High School Students Studying English

as a Foreign Language (Part I)

Hidetoshi S AITO *, Takashi S AITO **,and Itsumi K INO ***

(Received November 30, 2010)

Abstract

   This study reports a process of developing a listening prochievement test for Japanese third-year junior high school students. The report describes the overall plan and the results of four administrations of the two forms (Forms A and B) of the listening test. The main purpose of this test was to detect the test takers learning gains during a year-long English course at the junior high schools.

Introduction

   This report describes a process of developing a “prochievement” (proficiency and achievement) test of listening for junior high school students. The original purpose of this test was to measure student achievement in listening after a year of study in the third year English courses at a junior high school. The teachers in this school started introducing an increasingly popular drill activity called “shadowing,” with which the student repeats the sentences in low voice while listening to them at the same time. Students practiced shadowing in class, and their shadowing skills were tested eight times throughout the year. While this assessment could capture their development of shadowing skills, the teachers also wanted to assess the development of student overall listening skills because of the purported benefits of shadowing for improving listening skills.

   The term prochievement originated in a language testing conference in 1982 from Clark’s (1989) definition. He defined prochievement as a dual-purpose test that can provide both ‘instruction-oriented feedback’ and ‘real-life decision making.’ Such a test of ‘speaking’ should assess, for example, the test- takers’ general communicative use of the target language, i.e., proficiency, in terms of the restricted range of vocabulary, structures, and content specific to the language course—achievement (Clark, 1989). The purpose of the present test slightly differs from this original description. The intention of our test is to assess students’

gain in the course comprising listening test tasks that are relevant both directly—achievement-oriented—and non-directly—proficiency-oriented—to the instruction of the course. It is fair to say that restricted ranges of

Dpt. of English, College of Education, Ibaraki University, Mito 310-8512 Japan Graduate School of Education, Ibaraki University, Mito 310-8512 Japan Tsuchiura Daisan Junior High School, Tsuchiura 300-0843 Japan

*

**

***

(2)

forms (such as choice of vocabulary and grammar) in the test tasks reflects an achievement aspect of the test while a topical and content variety reflects a proficiency aspect of the test.

    The present report first explains the background of the test based on Downing’s (2006) steps for effective test development.

Overall Plan

    The primary purpose of this study was to measure junior high school students’ achievement in listening after a yearlong instruction in the course. The tests also required the teacher to discriminate students’

ability. Thus the direction was necessarily towards both proficiency and achievement. Since there were approximately 140 students in the third year taught by one instructor, practicality in scoring was another priority.

   The target construct for the present test is listening ability of third year junior high school students in English as a foreign language. Because measuring learning gains was the original focus, two equivalent forms of the test were needed.

    When interpreting the results, both norm-referenced test (NRT) interpretation and criterion- referenced test (CRT) interpretation are necessary. In the NRT perspective, test items should cover a wide- range of the test-takers’ abilities for the purpose of maximum discrimination. In the CRT perspective, strong discrimination is not necessarily sought after because student success on the test suggests that they all learned the target content and acquired the target skills. These two seemingly conflicting perspectives need to be somehow reconciled in many decisions based on the analysis of the test data.

Content Definition

    The target content for achievement was two textbooks (New Horizon 3 and Total English 3) approved for the national curriculum called the Course of Study. In the course, Total English 3 was used as a main textbook and New Horizon 3 was used as a supplementary textbook. Topics in these two textbooks included, for example, wind power, Stevie Wonder, gestures around the world (Total English 3), learning braille, Nepal, and use of cell phones (New Horizon 3). The target content for proficiency was not clearly determined.

However, a general target level was decided to be Pre-2

nd

Grade of STEP test—comparable to A2 level of CEFR (Common European Framework of Reference for Language) according to the Society for Testing English Proficiency (STEP), Inc. (2009). The STEP test is an English proficiency test, which is widely recognized and used for high-stakes decisions, such as employment, promotion, and college admissions in Japan.

Test Format

    All multiple-choice questions were chosen mainly for the ease of scoring and for limiting the construct

of investigation to listening skills since any written responses require writing skills. Test takers listened to the

audio text only once, and responded to three or four option multiple-choice questions. The length of the audio

(3)

text varied depending on questions. Reading speed of the audio text was necessarily slower than natural speed.

Item Development

    The original questions were all taken from three different sources. Practically, it was very difficult for the practitioner to prepare recordings of listening test texts, given limited resources of both time and cost.

Thus, we decided to use available resources that were relevant to instruction. These included the supplemental materials for the two nationally approved textbook, New Horizon English Course 3: Teacher’s Manual (Tokyo Shoseki, 2006) and Total English 3 (Oyo-hatten ban waakushiito syuu) (2006). The first version of the present test contained questions only from the latter, while the second version included questions from both supplemental materials. Questions from a Pre-2

nd

Grade of STEP test preparation audio book (Eiken Test in Practical English Proficiency Pre-2

nd

Grade, 2008) were also included in the Versions 2 and 3.

    When revising the items, changes in the original recording of the text could not be made due to the practical constraints. Only the prompts such as illustrations, stems, and options were revised, when necessary,.

Item Selection and Analysis

   The Rasch analysis played a critical role in item selection and analysis. The Rasch analysis has numerous advantages over classical test theory. One of the advantages involves the use of linking to equate a number of parallel test forms. By linking, one can interpret the test results of different administrations or different versions on the equal footing because overlapping items serve as an anchor point. In the present study, simultaneous linking was used for Version 3. The results of Version 3 were interpreted by linking Form A of Version 3 with Version 2. It is planned that the results of the fifth administration of the test—which has not been administered yet—will have both Forms A and B linked across different Versions.

   Another advantage of the Rasch measurement is the use of fit statistics. These statistics flag the breach of unidimentionality—that is, questions on the test measuring a single construct—by generating aberrant values. A common upper limit cut-off value used for these statistics is 1.4 for infit and outfit mean squares and above 2.00 for standardized zs (see Wright & Linacre, 1994; Fisher, 2007). Infit mean squares indicate how the person’s item response patterns deviate from expected response patterns on the items approaching the person’s ability. Outfit mean squares indicate how the person’s item response patterns deviate from expected item response patterns at the upper and lower ends of test items, i.e., more ‘difficult’ and ‘easy’ ranges.

Standardized zs are mean squares that are standardized in order to compare the mean squares of different items on the same test. These statistics along with a conventional item discrimination index of point-biserial correlation coefficients were mainly used in making decisions on item revision and deletion.

   Experts’ judgments on surface level difficulty of items, item content, question types, and length were also used in the item selection and revision procedure. In addition, as stated above, both CRT and NRT perspectives played a key role in item selection.

Participants

   All test takers were third year junior high school students studying English as a foreign language in Japan.

None of the students had extensive experience of staying overseas or being in contact with native English speakers

(4)

of English. Versions 1 and 2 were administered to two cohort groups in a junior high school annexed to a national university. The school has a reputation of strong academic rigor. Version 3 was administered to students in a municipal junior high school. Based on the results of prefectural proficiency tests and anecdotal evidence, it was expected that the tests would be much easier for the first group compared to the second group.

Table 1. Summary Information about Versions of the Listening Test

Notes. Each test contains Forms A and B. A test taker took only one of the two forms on each administration.

The fifth administration is currently scheduled.

Results of the First and Second Administrations

   Forms A and B used for the two administrations contained 41 items (see Table 1 for summary of all Versions). Tests were counterbalanced across classes so that each student saw each Form only once. That is, two classes (Group 1) took Form A in May and Form B in January, and other two classes (Group 2) took Form B in May and Form A in January.

   Data from both administrations were analyzed together separately for each Form. That is, the data from Form A contained the May data of Group 1 and the January data of Group 2, while the data from Form B contained the May data of Group 2 and the January data of Group 1. This makes the results of the two separate administrations of the test comparable.

   Figures 1 and 2 show the locations of test takers’ ability measures on the left of graph and those of item difficulty measures on the right along the same logit (or log odds) scales. As seen in both Figures, the peaks of items are apart from the peaks of test takers. In Figure 2, for example, the peak of test taker ability is around 3 logit, whereas the peak of items is around –1 logit. This asymmetric picture indicates that test takers’

ability far exceeds the difficulty of the test items. The same is true for Figure 1. In the CRT perspective, this is fine since this means most students succeed on the majority of items, indicating the acquisition of the skills.

On the other hand, from the NRT perspective, the tests can be seen as non-discriminating.

    Actual statistics confirmed this quick observation. Tables 2 and 3 show that reliabilities of test takers are moderately low (.50 for Form A and .69 for Form B) and separation is close to 1.00, meaning that the test takers cannot be divided into more than two strata by this test.

   Concerning item fit statistics, none of the items was clearly misfitting. The analysis identified

only two misfitting students in each of Forms A and B, one of whom misfitted across both Forms. Point-biserial

(5)

Figures 1 (Left) and 2 (Right). Figure 1 shows the results of the first and second administrations of Form A and Figure 2 shows the results of the first and second administrations of Form B. In both Figures, an # indicates two students. Numbers are question IDs. First number indicates the subsection ID, and the second number is the serial number within the subsection.

correlations indicate that a number of items do not contribute to the overall discrimination of the test. On negatively correlated item (r = -.36, 14-2 of Form A) suggested, after careful scrutiny, the illustration used in this item was confusing. In fact, ten items did not discriminate well between test takers (those point- biserial correlations carry values of less than .20 (Carey, 2001, p.318)). Some of these items were subjected to revision or simply discarded. However, from the CRT perspective, easy non-discriminating items indicate student success, suggesting that students have learned the target skill. Some easy non-discriminating items should thus remain in the test; therefore, they were retained.

    Table 4 shows raw gain scores from the two test administrations for each form. Unfortunately, there were no overlapping items between the two Forms, thus it was not possible to directly link the Forms.

Therefore, we followed Brown’s (2001) procedure for evaluating the sensitivity of the test to capture actual

learning gains. First, we made sure that students did not see the same items twice by counterbalancing the

pre- and post-test administrations. Second, we calculated the gain using Hedge’s g as an estimator of effect

(6)

size. Calculating changes in raw scores is somewhat spurious because each Form was administered different test takers. The Rasch analysis linked the ability measures of all test takers, however. Thus, one could say that the Rasch measures provide a more valid estimate of effect size compared to raw scores. As seen in Table 4, both Forms are able to capture learning gains with Form A generating slightly better effect sizes.

  Table 2. Summary of Test Takers and Item Statistics of the First and Second Administrations of Form A

 Table 3. Summary of Test Takers and Item Statistics of the First and Second Administrations of Form B

 Table 4. Listening Test Gains of Version 1

Notes. Maximum Raw Scores are 44. Effect size is Hedge’s g. n = 61 (Form A), 71 (Form B).

(7)

Results of the Third Administration

    Seventeen items remained in Version 2 of each form, and 8 questions from Total English 3 (Oyo- hatten ban waakushiito syuu) (2006) and 30 questions from STEP test papers (Eiken Test in Practical English Proficiency Pre-2

nd

Grade, 2008) were further added, amounting to the total of 55 questions. The remaining 17 questions were all revised to increase difficulty by omitting illustrations, changing pictorial options into verbal ones, translating options written in Japanese into English, and revising wording of options and stems.

    Although the original purpose of the test was to assess course gains of junior high school students, the third administration was done without a posttest due to various practical constraints. This administration of the test thus served merely as a temporary prochievement indicator in the course. However, the data itself could provide invaluable input for refining the test.

Figures 3 (Left) and 4 (Right). Figure 3 shows the results of the third administrations of Form A and Figure 4

shows the results of the third administrations of Form B. In both Figures, numbers indicate test taker IDs. Ach

are achievement questions. LL are questions of proficiency with multiple-choice options heard in the audio

text. LR are questions of proficiency with multiple-choice options appearing on the test paper.

(8)

    Figures 3 and 4 clearly show an improvement in terms of targeting of the items. The distributions of the persons and items look more symmetric, suggesting that item difficulty levels of this group of items cover the range of ability in this group of test takers. Compared to Figures 1 and 2, both the items and persons now reflect the curve of the bell-shaped normal distribution. From the NRT perspective, this is good news.

There are still a number of non-contributing (i.e., too easy) items that are hanging below the person’s ability.

However, because this is a prochievment test, the decision was made to retain several easy items in the next version.

    For both of Forms A and B, no clearly misfitting persons or items were identified. From the NRT perspective, when applying the cut-off value of .20 point-biserial correlations, three items of Form A and two items of Form B can be regarded as non-discriminating. These items were deleted or revised. Tables 5 and 6 show test statistics of the Rasch analysis. As can be seen, test-taker reliabilities on the third administration improved remarkably compared to the first and second administrations.

 Table 5. Summary of Test Takers and Item Statistics of the Third Administration of Form A

  Table 6. Summary of Test Takers and Item Statistics of the Third Administration of Form B

(9)

Results of the Fourth Administration

    The purpose of the fourth and fifth administrations were to detect learning gains between pre- and post-tests. Version 3 of the test was administered to students in a municipal junior high school. At the time of this writing, they are scheduled to complete Version 3 of Form B as a posttest in February, 2011.

Figure 5. The person-and-item map of the fourth and third administrations of the test. The numbers on the left

reflect test takers ability measures on the third (3) and fourth (4) administrations. The item names on the right

reflect the questions from achievement (ac) and proficiency (LR for questions with multiple-choice options

appearing on the test paper and LL for questions with options in the audio text) portions.

(10)

    Version 3 was constructed based on the results of the third administration. While items that discriminated poorly and had misfitting tendency were either revised or deleted, several non-discriminating yet non-misfitting items remained in the test for achievement purposes. The final version contained 13 achievement items and 30 proficiency items. Each Form contained four overlapping items in order to link the two forms. Here, we will report the results of Form A, since Form B has yet to be administered. Note also that the data from the third and forth administrations were combined for the purpose of analysis, so that the results can be interpreted in reference to the previous administration. For fair interpretations, the data for an item ach12, which was revised for the forth administration, was eliminated from the data of the third administration.

   Figure 5 shows the location of item difficulty measures on the right of the map and test takers on the left along the same logit scale. Number 3 represents students from the third administration and 4 represents students from the forth administration. Concerning the targeting of the test, the items cover a wide range of the test taker abilities, although some individuals’ abilities still exceed the most difficult items. An approximate symmetric picture indicates a good match between item difficulty and test taker ability. It is also notable that achievement questions are clumped in the lower end of the scale. As for differences among test takers, it is clear that students on the third administration scored higher on ability compared to students on the forth administration.

   As seen in Table 7, test statistics suggest relative improvement in item reliability but not test taker reliability. There could be two reasons for this. First, the number of total items were reduced, which might have affected the decrease in test taker reliability. Second, the number of test takers increased, which might have affected the increase in item reliability. Concerning fit statistics, no items were clearly misfitting, although three items had low point-biserial correlations. No clearly misfitting persons were identified either.

  Table 7. Summary of Test Takers and Item Statistics of Form A on the Forth Administration

Notes. The results here came from the combined data of both the third and fourth administrations.

Conclusions

This paper described the process of developing a junior high school listening test by examining the results of

(11)

the four administrations of the test. Although Version 3 has not yet been evaluated concerning its capability of detecting learning gains, the results of the fourth administration of the test suggests that the current version of Form A seems to be reliable and appropriate for measuring prochievement. This version has fewer items, while it maintains reliabilities comparable to Version 2. The results also suggest that the test can be used in two schools with different learner proficiency.

Authors’ note: The current version of the listening test is available from the authors free of charge (but a small shipping and handling cost is requested).

Acknowledgement

This report is supported by the Grant-in-aid for Scientific Research awarded to the first author (task no. 22520550).

References

Brown, J. D. 2001. "Developing and revising criterion-referenced achievement tests for a text book series." A focus on language test development, T. Hudson and J. D. Brown, eds., Second Language Teaching & Curriculum Center University of Hawai'i at Manoa, Honolulu, HI, 205-228.

Clark, J. L. D. 1989. "Multiple language tests: Is a conceptual and operational synthesis possible?" Georgetown University Round Table on Languages and Linguistics1989, J. E. Alatis, ed., Georgetown University, Washington, D. C., 206-215.

Downing, S. M. 2006. "Twelve steps for effective test development." Handbook of Test Development, S. M. Downing and T.

M. Haladyna, eds., Lawrence Erlbaum, Mahwah, NJ, 3-25.

Eiken Test in Practical English Proficiency Pre-2

nd

Grade (Eiken jyunnikyuu zenmondaisyuu CD) 2008. [CD]. Obunsha, Tokyo.

Society for Testing English Proficiency (STEP). 2009. "Eiken Grades." Retrieved November 4, 2010, from     http://www.stepeiken.org/grades

Total English 3 (Oyo-hatten ban waakushiito syuu). 2006. Gakkotosho, Tokyo.

Tokyo Shoseki. 2006. New Horizon English Course 3: Teacher’s Manual (waakushiito hen 1), Tokyo Shoseki, Tokyo.

Table 1. Summary Information about Versions of the Listening Test
Figure 5. The person-and-item map of the fourth and third administrations of the test

参照

関連したドキュメント

The hypothesis of Hawkins & Hattori 2006 does not predict the failure of the successive cyclic wh-movement like 13; the [uFoc*] feature in the left periphery of an embedded

In case of any differences between the English and Japanese version, the English version shall

In case of any differences between the English and Japanese version, the English version shall

In case of any differences between the English and Japanese version, the English version shall

In case of any differences between the English and Japanese version, the English version shall

In case of any differences between the English and Japanese version, the English version shall

In case of any differences between the English and Japanese version, the English version shall

In case of any differences between the English and Japanese version, the English version shall