近畿大学学術情報リポジトリ

全文

(1)Some Initial Considerations In a Japanese. of Unified Oral Testing. University. Context Carlos. Abstract:. In many oral English programs. tendency. to test oral proficiency through methods. This paper argues that there. at Japanese. are insurmountable. universities,. Ramirez. there. is still a. other than an oral speaking validity issues to testing. test.. spoken. language through other means than a speaking test and offers a sound alternative, the unified oral test. A unified oral test is a program-wide. test conducted. with teacher. exchange in order to satisfy concerns related to both objectivity and validity. Teacher exchange ensures objectivity since the evaluator of the test is not the regular classroom teacher. This author also acknowledges. and explores the issues and challenges related. to oral testing in general. The issues and challenges. raised in this paper are those. related to validity, testing formats, evaluation rubrics and rater reliability.. Introduction. The goal of any oral English program must be the improvement. of the oral skills. of its students. This means all elements of the program, including the testing format, should work towards this goal. Oral language testing, when designed competently, plays an important role in assessing not only student progress but also the overall success or failure of the program.. This may appear to be an enormous. rational for language testing cannot only be the measurement progress.. Testing. should also provide. information. responsibility,. but the. and evaluation of student. and feedback. on the quality of. teaching methods and the program content as a whole (Bachman, 1990; Bachman and Palmer, 1996; Gong 2008). In addition, especially in the Japanese university. context, if. oral tests. language. are properly. improvement,. integrated. into the overarching. goal of general. they should also be, in some way or another, a catalyst towards student. language learning motivation. In other words, the main objective in designing a test must be to support the needs of the English language program, and not the other way. — 307 —.

(2) ft* • around. The testing. format is therefore. intended to measure. (Bachman and Palmer, 1996).. In many large. oral English. a crucial tool to promote the qualities it is. programs. at Japanese. evaluation are done either through written assignments an oral examination conducted independently. further. comment. testing. and. and written tests, or through. by individual instructors.. associated with using a written test to assess students' do not warrant. universities. oral progress. here. Alternatively,. The problems are obvious and. oral testing. completed. in. individual classes without program-wide. co-ordination presents its own concerns. They. relate to the teaching of a standardized. curriculum, objective testing content and the. usage of standardized these inherent. assessment. contradictions,. fulfill the responsibility Therefore,. criteria across the entire program. As a result of. written tests or individual class oral evaluations. of the language. test as promoter. of program. cannot. excellence.. the main purpose of this paper is twofold. First, it makes the case that a. unified oral testing. system. should indeed. be an integral. part of any oral English. program. The second objective is to outline some of the salient issues and challenges faced by teachers and administrators. who choose to introduce a speaking test to their. program. The paper is divided into five sections. The first section defines the term unified oral testing.. In the second. implementing. oral unified testing. section. explains. part. the necessity. prerequisite to implementing. of the paper,. the author. in a large university of curriculum. argues. the case for. English program.. and syllabus. The third. standardization. as a. unified oral testing. The fourth part of the paper discusses. some of the main issues of an oral testing format, specifically the construct and content validity. of a test, testing. assessment. tasks. and testing. format,. criteria. The final section summarizes. and evaluation. rubrics. and. the main challenges of a speaking. test.. What is Unified Oral Testing?. A unified oral test in this paper is defined as a common test that all students in the same year take either at the end of the program or at intervals during the program semester. All students are subject to the same test content and evaluation criteria. To enhance. objectivity,. teachers. other than their regular. —308 —. classroom. instructor. should.

(3) SomeInitialConsiderationsof UnifiedOralTesting In a JapaneseUniversityContext evaluate students.. In practice, in a large Japanese. university. English program, given. the limited financial resources, this would essentially imply the use of class exchange, i.e., the exchange of students independent. and classes among instructors. as opposed to the hiring of. testers.. In order for unified oral testing to be successful, one other important necessary. and requires some discussion: classroom standardization. element is. of the curriculum. and syllabus. In the next section, this paper discusses in more detail the relationship between. curriculum,. standardization. classroom standardization the same fundamental universities, individual. teachers. skills, functions. standardization. implement. students. In many Japanese. only exists on paper. It is often the case that. at testing. curriculum. and syllabus, this would put their. time. Students. would be unable to answer. questions related to content not covered by the teacher. Furthermore, testing should have a symbiotic relationship,. as succinctly. (1998, p. 52) when discussing the advantages to standardized. Since there is a direct link between learning taught. activities,. The main. in a program that includes unified testing is clear:. diverge from a standardized at a disadvantage. and concepts.. their own visions of an "ideal" program.. reason for classroom standardization if teachers. For now, it suffices to say that,. should mean that all instructors in the program are teaching language. programs. and testing.. assessment. attainment. is directly related to what is assessed. summarized. by Brindley. testing:. targets,. is closely integrated. curriculum and. course objectives. with instruction:. and (in theory. and. what is. at least) what is. assessed is, in turn, linked to the outcomes that are reported.. The Case For Oral Unified Testing. There are three They are: 1) problems concerning. main arguments related. to be made in favour of unified oral testing.. to test validity. testing and assessment. of other testing. objectivity, and; 3) matters. formats; 2) issues. pertaining. to student. motivation. I will examine each argument in turn. One of the main deficiencies in oral English programs. at Japanese universities is. the lack of a valid testing schema. For the most part, teachers do not have a general framework,. the appropriate. criteria and/or. valid scales to measure. — 309 —. language ability.

(4) against the established. criteria. At best, individual teachers. must develop their own. untested oral examination format which may be compromised. by validity issues. This. situation is not limited to Japan, as noted in English classrooms. in Taiwan by Gong. (2010, p. 386):. For most of us college ELT instructors. in Taiwan, it is incredibly challenging to. conduct a face-to-face oral test one by one at each mid-term and final exam for both English majored and those non English majored students. In most cases, an oral English test is only carried out for English majored students by the instructor who teaches the same class. The instructor is usually so busy to cope with the "i nterview", or oral test scores of students at the mid-term or final that he/she can hardly co-operate with other oral instructors. in working out the test content. and a reliable rating scale in advance, not to mention other considerations rater training, inter-rater. reliability, interlocutor effect, etc.. It is often the case in Japan that oral English program test. In these programs, the standardized combination. of all three. such as. formats.. may not require an oral. test is a written, listening, multiple-choice or a. While these. formats. can accurately. measure. proficiency in some language skills and can also satisfy the criterion of reliability, these tests are completely. invalid markers. for spoken production. of the language. These. testing formats lack what is generally referred to as construct validity. That is, they do not test what they are supposed to test, namely the student's English speaking ability. However, the most common case at Japanese universities is that each instructor in the oral English. program. conducts. independently. of other teachers.. hardly inspire confidence. their. own oral test. using. This is, of course, "better. in a program. and in the general. a format. developed. than nothing" but would development. of spoken. English language proficiency in Japan. In devising a valid testing format for an oral program, the assessment deemed objective common content. and be able to control for teacher and assessment. student emotional attachment. bias. A unified oral test with. criteria while conducting. satisfies these two issues. Class exchange. must be. class exchange. during final tests eliminates. partially. the teacher-. and any existing teacher preconception, as the students. are unknown to the teacher. Instructors. are thus forced to rely on objective assessment. — 310 —.

(5) SomeInitialConsiderationsof UnifiedOralTesting In a JapaneseUniversityContext scales for scoring rather than any personal bias in favour or against the students under examination. By also providing. program-wide. established. assessment. criteria and content, a. unified oral testing program mitigates concerns related to objectivity vis-à-vis peers in same or other. classes. making. test results. information as it can inform students. easily comparable.. of their progress. This. is valuable. and educate administrators. on. the merits and deficiencies of their program. However, for testing to be standardized, the syllabus. and curriculum. must also undergo. some classroom. standardization. as. described above. Finally, student motivation is also a factor and works in three ways. First, an oral test, as opposed to a written test, will encourage students to master the required verbal skills in order to pass the test and the course. Second, with prior knowledge. that. teacher exchange will take place during testing, students will realize that when facing an unfamiliar teacher, they will need to rely solely on their language competency and thereby giving impetus to improve it. Third, perceived objectivity of the testing format (more. of which will be said later). accomplishment. engender. These motivational. a positive. learning. by the students. will enhance. the sense. factors should not be underestimated. environment. in the classroom. and across. of. and can the entire. program. In sum, as Saylor-Loof and Cayman (2004) explain in regards to oral testing: "It encourages them to pay better attention in class, study harder, and in general, take their learning endeavor more seriously " (p. 1181).. Curriculum, Syllabus and Testing. Syllabus and curriculum most Japanese. universities.. standardization. may appear to be common practice at. However, in reality, especially. within large English oral. programs, teachers are often given the flexibility to interpret. the goals and purposes of. the program. The end result is students. from the same program. completing. it with. competency in different skills and at varied skill levels. This should not be interpreted as a condemnation. of teacher. should be encouraged. autonomy in the classroom. Indeed teacher. especially. in those programs. where teaching. autonomy. objectives. context lend themselves to it. In fact, the conclusions of one study emphatically in favour of such an approach. Fleming. (1998, p. 30) notes that. — 311 —. "instructors. and argue. must be.

(6) ft*. • 1-111,11ft 11-./. 37—CV. able to make curriculum. implementation. decisions. with a fair degree. of latitude,. especially when the programs in which they work emphasize needs assessment multitude. of options. methodology. inherent. in oral English. in the communicative programs. approach.". at an overwhelming. universities. is the communicative. approach. sequencing,. timing and day-to-day in-class activities. The preferred. number. and therefore,. of Japanese. teacher. is essential. and the. control. over. to achieve student. progress. Notwithstanding. the importance. of teacher. autonomy,. the main goals and. objectives, including the basic language skills and functions to be taught, as well as the assessment. criteria,. Furthermore,. should. be clearly. the responsibility. outlined. of teachers. in the curriculum. to adhere to the curriculum. for the overall success of the program. The advantages specific performance. targets. and standards. of standardized. curriculum with (1998).. terms, language learning is. into a tool for education and personal improvement,. an end in itself. Thus, both students. is imperative. are many as outlined by Brindley. First, because student results are evaluated in performance transformed. and syllabus.. and is not viewed as. and instructors can obtain clear feedback on their. progress, as measured against the assessment criteria, and information on the students' future learning needs; second, clear objectives and standards among all stakeholders an objective program. facilitates communication. of the program and; finally, a standardized. basis for determining. curriculum provides. the needs and allocation of resources. within the. (1998, p. 52).. Furthermore,. as argued. by Brindley. (2001), governments,. corporations. and. society at large are now demanding accountability from English language programs in the form of standardized. curriculums. and testing in order to demonstrate. progress. In the political and public debate surrounding especially. in a "testing-intensive". towards benchmarks. general educational standards,. society such as Japan, there. and outcomes-based assessment.. measurable. is a pervasive. In sum, standardization. bias. is driven. by societal demands that cannot be ignored by educators. As Lynn (quoted in Brindley, 2001) explains,. In a political can. be. climate. expected. where that. the. educational. standards. introduction. — 312 —. are. high. of outcomes-based. on the public. agenda,. assessment. it and.

(7) SomeInitialConsiderationsof UnifiedOralTesting In a JapaneseUniversityContext reporting, with its focus on 'measurable improvement',. will remain an attractive. strategy. that they are making. for policy makers. who wish to demonstrate. a. difference. (2001, p. 398). Brindley (1998; 2001) also notes, however, there are many pitfalls in the setting of a standardized. curriculum. including construct. and content validity. Validity issues are. indeed difficult to address and will be discussed in further detail below in the context of testing. In terms of curriculum. and syllabus, let it suffice to say that on the one hand,. there must be respect for teacher autonomy in the classroom. On the other hand, there should be recognition. of the advantages. of a standardized. curriculum,. syllabus and. testing. The principal advantage being the ability to measure progress through testing of skills and proficiency levels that are clearly outlined in a common framework of goals and objectives. Student attainment. of these goals, in turn, can then be easily assessed. against a commonly agreed upon set of criteria. In addition, educational institutions need to address calls for accountability from the public. Program standardization. is the. most obvious and plausible response to these calls. Another. important. issue that warrants. testing format as it relates to curriculum. some discussion. regarding. is the "teaching-to-test". the unified. conundrum.. In an. already test obsessed society such as Japan, there is an immense risk for instructors teach to the test. importance. Thus. to countering. the content. of both curriculum. and test. to. are of utmost. any trend towards a rigid form of teaching. The dangers of. the test affecting the curriculum. and syllabus both formally and informally as well as. affecting classroom teaching and classroom content is real and has been documented extensively elsewhere. (see Bailey, 1999 and Evans, 2003). It should suffice to say that. class content, therefore, should affect test content and not the other way around. If the content of the curriculum, syllabus and test as well as the assessment. criteria are well. designed so as to measure general capability and proficiency, then instructors. will have. an incentive to teach in a communicative. mold. Furthermore,. point out when discussing examination. content in their study of Hong Kong testing. formats, this "washback" washback. effect does not necessarily. as Andrews and Fullilove. need to be negative.. Indeed,. can act as an agent of positive change and innovation if the examinations. themselves. are designed to encourage that kind of change" . (Andrews. 1994, p. 60).. — 313 —. and Fullilove,.

(8) ft* •. :/ 37—CV Saylor-Loof and Cayman (2004) have described the positive washback their oral. testing has had on their program as follows: "The main reasons for having a final oral test are that it motivates. and focuses the students. throughout. the course, provides a. framework for both teaching and learning, and gives students a clear direction and goal to work towards". (pp. 1181). In justifying testing in general, Gewirtz (1977) also notes. there are few methods better. than testing to achieve student. competency. of in-class. material. Therefore, to ignore the impact of testing on student motivation is to neglect a valuable learning tool. In sum, oral testing and course curriculum other with the end goal being motivated. students. must reinforce each. and, ultimately,. general language. proficiency.. Issues in the Implementation of Unified Oral Testing. This paper has argued that a unified oral test is the most appropriate. testing. format at a large oral English program. Let there be no mistake. There are formidable issues of a theoretical standardized. and practical nature facing those administrators. who opt for a. oral test. However, the following caveat should be kept in mind when. pondering these issues. A test should be a means to an end and not an end in itself. The validity or fairness of the test is less crucial than the overall objective of motivating students. and improving language proficiency. Absolute objectivity. and fairness on a. test should not really matter as long as the test is viewed as just one of many tools to achieve the general goals of the programs. As explained by Gewirtz (1977, p. 240), " what should matter is that a test should seem to the student to be fair. Provided that a test is administered. with sufficient ritualistic. paraphernalia. the ordinary. student. will. accept his score as an objective assessment." This is not to argue that imperfection of a testing system is acceptable but to acknowledge that some issues and challenges may well not be resolved to the satisfaction approximates. of all stake holders. An imperfect. to an ideal yet plays an important. role in accomplishing. test that program. objectives is better than no test at all or a " fair " test that achieves little in terms of program priorities. From a theoretical and practical perspective, there are three central issues related to the implementation validity; the testing. of a standardized of content. oral test. They are construct. as it relates to the tasks; and, testing. — 314 —. and content criteria. and.

(9) SomeInitialConsiderationsof UnifiedOralTesting In a JapaneseUniversityContext assessment scales. We shall look at each issue in turn but the larger picture of the test as a tool to accomplish program objectives should not be lost.. i. Construct. and Content Validity. As mentioned earlier, writing, listening and multiple choice tests lack construct validity as a testing. format for oral skills. However, depending. on the construct. concept of the skills to be tested, an oral test that does not include appropriate assessment. content,. scales and criteria may also be considered lacking construct validity. Thus,. in the most basic sense, the construct constitutes. or. the ability to speak. definition. of a test should encompass. what. (Fulcher, 2003:18). This general debate of "what is. speaking" has been much discussed and reviewed elsewhere. (see for, example, Hymes,. 2001 and Llurda, 2000). For the purposes of this paper we need only to point out that administrators of the test need to engage in their own debate regarding a definition of ` speaking' and this discussion should consider at a minimum these two questions: 1) Is speaking mainly dependent on the context and interaction of discourse between two or more people as espoused by Interaction Does speech exist independently. Competence Theory (see Young, 2000)? ; or 2). of interaction, fixed before a conversation even begins. (see Lado, 1961)? These two questions regarding. competence. relate to the perennial. vs. performance.. debate within applied linguistics. This complex debate is rather. esoteric and. outside the scope of this paper. In reality, however, when speaking is put into practice there is much overlap and interaction between the two. Bachman, Fulcher and others have bridged the differing views of this debate. As Fulcher. (2003: 20) writes: "The. strategies that I use to interact are simultaneously internal and external to myself. I am recognizable as myself when I speak in a variety of contexts, and yet in context my speech is always contextually. bound." When Fulcher. interface, he is referring to the performance. compares internal. to external. vs. competence debate.. Bachman (1990) also sees a clear overlap between the two concepts and connects them through developed. a theory. called Communicative. under the rubric of Interactional. Fulcher 2003).. Language. Competence. Ability which is further Theory. (see Young 2000;. Bachman best describes this synthetic approach as having knowledge. of, or competence. in, the language. and the ability. to execute,. or produce,. that. knowledge within a variety of contexts (1990, 84). There are a number of variations on. — 315 —.

(10) ft* • i-liftefft the latter. theory. with those emphasizing. the capacity. to communicate. as mainly. dependent on the individual to adapt to a context and others who espouse a more "d ynamic" understanding of ability where a whole range of factors can affect the quality of performance. (McNamara cited in Fulcher, 2003: 45).. Whether test creators opt for a definition of speaking as competence, performance or a Bachman hybrid version, the final definition of the "construct". has tremendous. implications for the content of the test and the tasks to be performed. In other words, the construct. of the test weighs heavily on the validity of the content on the test. Put. more simply, the content of the test must match the construct. The latter is what is referred to as content validity. When discussing university. content validity from the perspective. program, there are two main interrelated. methodology). and; 2) the choice between. classroom achievement. of a speaking test at a. issues: 1) the test approach. standardized. achievement. testing. (or vs.. testing. First, in reference to methodologies, the test designers. must decide on whether they should take a "real world" approach or an interactional/ ability approach. when developing. consists of testing materials performance. the content. that mirror. of the test. A real world approach. the reality of non-test language. use. Test. should be indicative of the ability of the test taker to use the language in. real life circumstances. (Bachman,. 1990: 301).. An example. of real world testing. materials would be content related to very specific functions or contexts such as the language used by a flight attendant. or a hotel clerk.. An interactional /ability method is a much broader approach integrating materials that measure competence in the language (such as grammatical and syntax). vocabulary. as well as the interaction between the language user, the context and the. discourse. (Bachman, 1990). In many ways, the contrasts. approach. and its other. approach resembles. extreme,. the "ability". aspect. the competency versus performance. The interactional/ability the term. structures,. approach, like Interactional. the "real world". of the interactional/ability dichotomy described earlier.. Competence Theory when defining. "speaking", tries to bridge the gap between. However, there is one important. between. the two opposite approaches.. point that makes the interactional/ability. approach. more difficult to use for test developers. Test developers who adhere to this approach readily agree that either the testing. method must somehow conform to a real life. situation (such as following the test taker around for 24 hours with a voice recorder. — 316 —.

(11) SomeInitialConsiderationsof UnifiedOralTesting In a JapaneseUniversityContext which is impractical. if not impossible). or recognize that testing language is different. from real life language and therefore testers must adjust accordingly when scoring. In Bachman's. words,. "This approach. theoretical. framework. research". (1990, p. 357).. For test creators. is more demanding. in that it requires. with which to begin, and a program. in general and for university. of construct. test developers. both a. validation. specifically, the. process for creating test content would probably have to lie somewhere in between the spectrum. of the interactional/ability. and the real life approach since on the one hand,. the real life approach offers a more practical methodology for testing and, on the other hand, the interactional/ability. approach is conscious of the importance. of interaction,. context and syntax. In regards to the second inter-related choice between testing,. standardized. also requires. achievement. substantial. classroom achievement. testing. discussion. versus. classroom. achievement. among test administrators.. tests measure the extent to which students. class material but not necessarily contrast, standardized. issue of content validity, the dilemma over. comprehensive. progress. In theory,. have mastered in-. in language acquisition. In. achievement tests measure general language proficiency. Ideally,. the content of a university. speaking test should reflect both approaches. and therefore. the content of the test and what it is measuring should be one in the same. Since most university necessary. English. curriculum's. purport. to ensure. graduating. students. have the. skills and abilities to perform in society, the primary goal of any language. course must be the overall improvement being the case, the content. and acquisition of the target language. That. of the test as a reflection. of the syllabus. should. be. measuring general proficiency. However, reality is not as simple as this. In fact, as Kitao and Kitao (1996) argue, one of the most important language. testing.. achievement. and appropriate. Communicative. test. It measures. candidates. Communicative. approaches. language context. testing. to testing. is communicative. is essentially. rich discourse. using. tasks. a classroom to score test. language testing lends itself well to classroom achievement. tests because tests in an oral English course usually expect students. to perform tasks. within a specific context that is based upon the material taught during the course. An example of such a task would be the ability to communicate immigration. and passport. effectively to pass through. control because this task was included in the syllabus. In. — 317 —.

(12) ft* • sum, if the communicative appropriate. approach. approach,. as espoused. to language teaching. by Kitao and Kitao, is the most. and testing, then classroom achievement. tests should also be the choice of testing schemes at the university level. Similarly, Gong (2010) also proposes a "task" approach whereby students. apply. their knowledge of the language to a particular task. For example, a student adopts a specific role in a role play and must defend his or her position against other opinions or views. Gong, however, achievement. does raise an important. tests at universities.. tests only ask students. caveat with the usage of classroom. In many cases, university. to recite carefully prepared. classroom achievement. passages or conversations. about a. certain topic, such as the weather or a first time meeting, even though the syllabus may be designed to encourage spontaneous productive language. Notwithstanding it should be recognized. that task based classroom achievement. the latter,. tests as suggested. by. both Kitao and Kitao as well as Gong do offer a practical. approach. students'. could entail the use of. progress.. One possible avenue of future. research. to evaluating. classroom achievement tests for students at lower levels and standardized tests at more advanced. levels. At lower levels, students. functions and grammatical. need to master. the basic. skills of the language and thus a classroom achievement test. may be more appropriate. As students more open ended standardized. ii. Testing. achievement. acquired better competency in the language, a. testing format may be suitable.. Tasks and Testing Format. The selection of tasks and formats for a test is one of the most important stages of the entire testing process. These tasks and formats must be the outgrowth of the construct. upon which the curriculum. which students. will be evaluated.. is based and, as such, are the medium within. Test administrators. have available a multitude. of. speaking-test formats, and possible content to match those formats, to choose from (Pan and Pan, 2011; Evans, 2003). This section will be limited to discussing methods in which the assessor and test taker are present. direct testing. and interacting. with each. other. Indirect methods, such as oral testing through a recorder, will not be considered as they offer a completely validity arguments. different. set of problems. not the least of which include. (Bachman, 1990; Evans, 2003).. In general, there are two basic formats for testing: the teacher-student and pair conversations.. There are also hybrid formats. — 318 —. that include aspects. interview of both.

(13) SomeInitialConsiderationsof UnifiedOralTesting In a JapaneseUniversityContext formats. For example, a teacher. may interview. two students. at a time while asking. each student to remark on the other's replies thus creating a conversation between the students. Alternatively, in a pair conversation, the teacher may interject as an "i nterlocutor" to prompt the students to stay on task and to elicit language appropriate to the task. The oral portion of the Cambridge ESOL exams incorporates. this kind of. assessor intervention in its testing formats. There are both theoretical. and practical. usage of either format. In the teacher-student sense of the communicative. advantages. and disadvantages. in the. interview, the teacher can get an overall. ability of the student.. More concretely,. it allows for. assessment of construct features ranging from responding to questions and accuracy to appropriacy. of vocabulary and expressions. However, the unequal relationship between. the teacher and student. presents. a problem by adding stress to an already stressful. testing situation for the student. The dominant role of assessor makes it unlikely for test takers to take the initiative and creates language patterns talk" that are uncharacteristic. of normal conversational. addition to these theoretical caveats, there are important. specific to "interview-. speech. (Fulcher,. 2003). In. practical problems with the. interview format in the university context as explained by Moodie (2008, para. 4):. Assuming a two hour period for exams, a class of 20 students student only has six minutes of time for testing.. This includes the time needed to. enter the room/office and adjust to the setting. becomes doubtful that the student real-world conversation.. would mean each. With such a time constraint. and instructor. it. can have any kind of normal. Also, considering the weight of the exam (assuming that. it is between 20-40% of the final score), it is not a lot of time to elicit and test for speaking ability or listening comprehension. the student's. grade puts a lot of pressure. Six minutes for 30 or 40 percent of on the students. to perform in a very. limited amount of time.. Given criticism. of teacher-student. interviews,. many test administrators. have. turned to the pair conversation formats for their testing. In pairs, students form a more equal relationship, the conversation. resembles. and, finally, this format allows students (Saylor-Loof. a more natural and authentic situation,. to participate more actively without inhibitions. and Calman, 2004). It also gives the assessors. — 319 —. more quality. time to.

(14) ft* • i-liftefft. :/ 37—CV. evaluate the test takers as they can focus their attention on the performance without. having to participate. in the task itself. Pair conversations. relevant and arguably more valid especially for oral conversation. of the task. are also more. courses since these. types of activities are an integral part of the syllabus and consume a major portion of in-class time (Moodie, 2008).. However, pair conversations. are not without their own. problems. One partner can have a considerable influence on the individual assessment and evaluation of another. Issues such as personality difference (introvert vs. extrovert) and ability. levels. (high language. proficiency. vs. low) can also impact. on the. performance and final scoring of each individual. Hybrid formats have their own set of issues including the role of the interlocutor, the personality. of the interlocutor. time the interlocutor. (i.e. perceived. friendliness. or not), the amount of. is involved and, the amount and quality of time spent with each. partner. Any sign of undue unfairness or bias in favour of one partner over the other, done consciously or not, would taint the final score. In practice, hybrid formats that require an assessor and an interlocutor. (e.g. Cambridge. tests) would be close to an. impossibility in the Japanese university classroom given the limited financial and human resources in an oral English program. Finally, certain. testing. format. For example, interview. format.. content. will be more conducive. story telling about pictures. Whereas. a role-play. would. towards. a particular. is more likely to be used in an be more. appropriate. for a pair. conversation format. In the end, test content is valid only as long as there is "relevance and complete coverage of the course syllabus" can replicate be congruent. (Gong, 2010, p.7) and the test content. "real world" or authentic conversations. These conversation tasks should with the overall program. construct.. This will ultimately. lead to the. desired "kind of speaking" as described in a course curriculum.. iii. Evaluation. Scales, Assessment. Criteria and Rubrics. Much research has been conducted on evaluation scales, assessment rubrics. to measure. students'. plethora of major international. progress. Language and the Interagency. (see, Chuang 2009). Indeed, teachers. scales to draw upon. Those interested. Canadian Language Benchmarks,. criteria and. the Common European Framework American. have a. might look at the of Reference for. Council for Teaching of Foreign Language. assessment scales. Many scales for oral testing are adapted, or borrow generously, from. — 320 —.

(15) SomeInitialConsiderationsof UnifiedOralTesting In a JapaneseUniversityContext these scales. Criteria framework. and scales, in the end, should reflect. (i.e. curriculum),. methods. the language. learning. (i.e. syllabus), purposes, goals and the practical. limitations of a specific program. Conversely, criteria and scales cannot and should not determine. these factors. In addition, the choice of scales is of practical importance, as. the gradation on the scale will ultimately define the ability and progress of the student. There rubrics.. are two basic kinds of scoring rubrics. The holistic scoring. system. assigns. or scales: holistic and analytic. a single score based on the overall. performance of a student speech sample guided by a rating scale (see Appendix: Table 1). It is an overall impression of the task performed by the student. Each level or "b and" in a holistic scale evaluates spoken English by defining or describing the ability of the test. taker. in terms. of general. competency. components. such as fluency, pronunciation. including. multiple. criteria. or. and accuracy. As noted by the National. Capital Language Resource Center (NCLRC, n.d., p. 3), "a holistic approach to scoring is primarily. used for large-scale. assessment. approach to scoring is necessary.". when a relatively. Holistic scoring methods. quick yet consistent. can generally be applied. across any number of tasks. That is to say, there is no specificity to any particular task or context. This is problematic. according to Fulcher. not take into account any theory or construct. (2003) because the scoring does. of speaking:. speaking. just becomes. speaking without any theoretical basis. An analytic rubric identifies and assesses each criteria or component of speech included in the rubric such as fluency, pronunciation. and accuracy. in the case of a. spoken test (see Appendix: Tables 2A,B and C). Analytic rubrics have the advantage of providing. more detailed. information. on the test takers. ability in the different. components of the language. It also allows the test creator to give different weighting to each of the components within the rubric (NCLRC, n.d.). The weighting of the scales and the selection of criteria/components construct. of the program. will depend in large measure to the underlying. and the main goals of the curriculum. Scale weighting. criteria selection requires in depth discussion by the test administrators. and. to determine. which language components should be tested and which components deserve emphasis over others in terms of weighting. comprehensive. prognosis. While the analytic. of test performance,. scoring method. it is difficult to extrapolate. offers more an overall. level or ability. In this kind of rubric the parts do not necessarily add up to a whole as noted by NCLRC (n.d.). Indeed, analytic rubrics should be applied to task specific tests. — 321 —.

(16) such as a role-play between two people asking and giving directions to each other or for very specific purposes such as medical English for nurses. Fulcher. (2003) outlines, in addition to these two more common scales, two other. scoring methods Primary. that merit attention:. trait rubrics. resemble. primary. and multiple trait scoring systems.. the holistic scoring method in that each band in a. primary trait rubric describes overall performance. The band would include such terms as fluency and accuracy. Primary trait scoring, unlike the holistic method, however, is task specific and identifies a main "trait" or a single dimension that is of importance. to. the task (see Appendix: Table 3). An example would be a test on giving instructions in which the primary trait is clarity of explanation. The advantage of this type of rubric is that it allows both the assessor and the test taker to focus on one main aspect of the performance. and it provides a detailed measurement. perform a certain task. The obvious drawback. on the ability of the test taker to. is that the field of measurement. is too. narrow and does not evaluate overall competence of the test taker. Multiple trait scoring identifies a number of dimensions important to the test task (see Appendix: Tables 4A, B and C). Furthermore,. this scoring system has similarities. to the analytic rubric in that the criteria in the rubric includes a number of dimensions important to completing the task or traits, as well as possibly some components of the language. (such as fluency and accuracy).. As opposed to analytic scoring, however,. multiple trait rubrics have more predictive power of overall proficiency. They can be generalized across tasks because scores relate to multiple dimensions/traits to an underlying construct. that refer. (Fulcher, 2003: 90). These rubrics provide rich feedback and. information to the test taker regarding. language ability. The main deficiency of these. rubrics is their practicality. The rater may not be able to give three or four accurate grades for one speech sample especially if the sample is produced under very limited time restraints. In a Japanese university. context, there are multiple variables that will influence. the decision on the type of appropriate rubric. These variables include the main goals of the oral English program, test content, test format, class sizes, time constraints, teacher. training issues. They interact. with each other and are dependent. and. upon the. human and financial resources available to the English program. For instance, a holistic rubric may be more appropriate student numbers. for an interview. format in a class with significant. and limited testing time. However, analytic or multiple trait scoring. — 322 —.

(17) SomeInitialConsiderationsof UnifiedOralTesting In a JapaneseUniversityContext may be appropriate. for measuring. format or in an interlocutor. general conversational. skills in a pair conversation. hybrid format in which the test takers are performing one. or more specific tasks.. Challenges. The main challenges. Facing A Unified Testing Format. facing unified testing. are teacher. training. and limited. resources. In a university context, teacher training is closely associated with the issues of intra and inter rater reliability. Intra-rater sequencing. applies. According. the grading. circumstances.. to Bachman criteria. (1990), sequencing. inconsistently.. First, the evaluator. such as accuracy. However,. reliability, in turn, is mostly related to. This. occurs when the evaluator. can happen. under. one of two. may not be initially assessing accuracy of speech. after reviewing. numerous. samples from different. test. takers and observing that accuracy is a serious problem, the evaluator may deem that accuracy should way more heavily in the final score and begin to assess accuracy more strictly on later tests even though accuracy was viewed more benignly on earlier tests. Secondly, if the evaluator is not given sufficient time to assess the test taker, different components of speech may be graded equally, although they were performed unevenly. This phenomenon. is called a "halo effect" as the evaluator begins to generalize. the. score of one component to scores on other components. A further example of the halo effect takes place when a test taker scores extremely well in one component of the test and that score "overflows" through inflated scores into other components. Sequencing can also occur when a mediocre test taker is followed by a number of much weak ones and garners. a very high score simply because the test taker was. relatively better than the previous candidates. Alternatively, sequencing is also an issue after long periods application. of testing. without. a break. Evaluators. of the criteria and score testers. may tire and relax their. based on their own physical/emotional. status as one test fades into the next. Inter-rater. reliability can also become a problem as a result of sequencing. and. other issues. In the case of sequencing, inevitably the skewed results of one evaluator will affect the results of other evaluators takers. will shift accordingly.. as the median or average score of all test. Also, the skewed. extreme, may force an adjustment. results. of one evaluator,. if in the. or bell curve on scores from other evaluators.. — 323 —. In.

(18) addition, lack of assessor reliability. Evaluators. training. will have immense. repercussions. on inter-rater. should be trained on the application of the criteria and scoring. scales so that there is uniform agreement. on their meanings and usage among all of the. assessors. Teacher. training. is imperative. in order to remedy. many of these problems.. According to Barnwell, evaluators need to be made aware of the challenges through a process of "testing socialization." In his research, Barnwell found that higher reliability coefficients were attached. to raters with training and constant. operation. of testing. scales as they began to see "speaking" in terms of these scales (as quoted in Fulcher, 2003: 142). He deemed this familiarization of the scales as socialization. Unfortunately, in a Japanese. university. staff as instructors.. context, large oral English programs. mainly employ part time. Limited budgets and financial resources mean that teacher training. is of a low priority. Therefore,. at a Japanese. university,. it is difficult to implement. testing socialization even though it is desirable. Limited financial resources at Japanese universities have other consequences as student. numbers. thirty students. per class. Insufficient teaching. teachers. fatigue or time problems affect scoring. minutes, which translates teacher-student student. staff results in class sizes above. and thus causes severe time constraints. these difficult circumstances,. interview. proficiency.. such. during testing periods. Under. become more susceptible. to sequencing. as. A typical examination period is sixty or ninety. into a maximum of three minutes per student evaluation in a format. This is hardly sufficient time to adequately. In the end, without. sufficient. attention. reliability, teacher training and socialization, the predictive. assess. to these challenges. validity of student. of. scores. can also become an issue.. Concluding Considerations. Japanese universities will need to increase the quality of oral English programs as English proficiency. becomes more important. argued in this paper can play an important oral test offers objectivity. to Japanese. society. Unified testing as. role in accomplishing this goal. A unified. when implemented. with teacher. exchange. Students. are. evaluated on the basis of test performance without any preconception of student ability on the part of the teacher. This in turn acts as an important. — 324 —. motivational. agent for.

(19) Some Initial Considerations of Unified Oral Testing. students. to perform. well on the test. Furthermore,. In a Japanese University Context. evaluators. garner valuable feedback from test results. These washback. and test takers. can. effects demonstrate. the. positive impact unified testing should have on an oral English program. In the end, unified testing does not just pertain to the testing and evaluation. The unified test should be integrated. into, and reflect, the entire oral English program. including the curriculum, the syllabus and the construct regards to the curriculum prerequisite. underlying both of these. In. and syllabus, it was argued that their standardization. is a. in order for students to perform satisfactorily during the test regardless. of. the test assessor. For unified oral testing to be successful, a number of issues of general concern to oral testing itself must be examined. These issues are, first and foremost, the construct and content validity of the test: does the test measure what it is supposed to? Second, in light of the test construct, test designers need to create relevant tasks. In addition to tasks, test designers need to choose a format that is appropriate. to the task. Third, test. designers need to develop a general rubric for scoring that includes appropriate criteria. and weighting.. administrators. Finally, there. are a number. especially those pertaining. of challenges. scales,. that face test. to inter and intra rater reliability. Many of. these challenges could be solved through training and increased financial resources. If handled successfully, the solution to these challenges would enhance the overall validity of unified oral testing. More importantly, their validation would assist in achieving the ultimate objective. of unified testing: student. language proficiency. For now, further. research on programs currently implementing. unified oral testing is necessary in order. to gather statistical and qualitative data on the program. Particularly, data relating to rater reliability, the scoring of students. and the predictive. would be helpful in assessing the real effectiveness of the test.. — 325 —. validity of these scores.

(20) ft*.NAMIt1-37—CV References. Andrews, S. and Fullilove, J. (1994). Assessing Spoken English in Public Examinations Why and How? In J. Boyle and P. Falvey. (Eds.), English. Language. Testing in Hong Kong (pp. 57 - 86). Hong Kong: The Chinese University Press. Bachman,. L. F. (1990). Fundamental. Considerations. in Testing.. Oxford:. Oxford. University Press. Bachman, L.F. and Palmer, D. (1996). Language. Testing in Practice.. Oxford: Oxford. University Press. Bailey, K. M. (1999). Washback in Language Testing. TOEFL Monograph Series. New Jersey: Educational Testing Service. Brindley, G. (1998). Outcomes-based Assessment. and Reporting in Language Programs:. A Review of the Issues. Language Testing, 15, 45-85. Brindley,. G. (2001). Outcomes-based. Assessment. in Practice:. Some Examples. and. Emerging Insights. Language Testing, 18 (4), 393-407. Chuang, Y.Y. (2009). Foreign Language College English. Teachers'. Speaking. Assessment:. Scoring Performance. Chinese Taiwanese in the Holistic and. Analytic Rating Methods. Asian EFL Journal, 8 (1). [On Line]. Retrieved July 22, 2011 from http://www.asian-efl-journal.com/March_09_yyc.php. Evans, D. (2003). Why Do Oral Testing? ETJ Journal 4 (2), 1-3. Fleming, D. (1998). Autonomy and Agency in Curriculum Decision Making: A Study of Instructors in a Canadian Adult Settlement ESL Program. TESL Canada Journal 16 (1) , 19-35. Fulcher, G. (2003). Testing Second Language Speaking. Harlow: Pearson Longman. Gewirtz, A. (1997). Some Observations. on Testing and Motivation ELT Journal 31(3),. 240-244. Gong, B. (2008). A Comparative. Study of Construct. Validity of Graduation. English. Proficiency Tests between Universities in Taiwan and Mainland China. Paper presented. at The 34th IAEA Annual Conference,. Cambridge,. United Kingdom. Gong, B. (2010). Considerations Students.. Paper. of Conducting presented. Spoken English Tests For Advanced. at The 36th IAEA Annual. — 326 —. Conference,.

(21) SomeInitialConsiderationsof UnifiedOralTesting In a JapaneseUniversityContext Bangkok, Thailand. Hymes, D. (2001). On Communicative. Competence.. Anthropology: A Reader. In A. Duranti. (Ed.), Linguistic. (pp. 53-73). Malden, Massachusetts:. Blackwell. Publisher, Ltd. Kitao, S. K. & Kitao, K. (1996). Testing. Communicative. Competence.. The Internet. TESL Journal, 2(5), [On-Line] . Retrieved June 17, 2011, from http:// iteslj.org/Articles/ Lado, R. (1961). Language. Kitao-Testing.html.. Testing:. The Construction. and Use of Foreign Language. Tests. London: Pearson Longman. Llurda E. (2000). On Competence, Proficiency, and Communicative. Language Ability.. International Journal of Applied Linguistics 10 (1), 85-95. Moodie, I. (2008). Using Pair Work Exams for Testing Classes. The Internet. TESL Journal,. in ESL/EFL. Conversation. 14(8), [On-Line] . Retrieved. July. 20, 2011, from http://iteslj.org/Techniques/Moodie-PairWorkTesting. html. National. Capital. Language. Resource. Center. (NCLRC).. (n.d.). The Essentials. of. Language Teaching. Retrieved July 25, 2011, from http://www.nclrc.org/ essentials/assessing/. alternative.htm. Pan, Y.C. and Pan Y.C. (2011). Conducting Speaking Tests for Learners of English as a Foreign. language.. The International. Psychological Assessment. Journal. of Education. 6 (2), 83-100.. Saylor-Loof, C. and Calman, R. (2004). Oral Testing in the Communicative JAL T 2004 Conference Proceedings Young, R. (2000). Interactional at the annual. and. Classroom.. (pp.1178-1186). Nara, Japan.. Competence: Challenges For Validity. Paper presented meeting. of the American. Association. for Applied. Linguistics and the Language Testing Research Colloquium, Vancouver, Canada.. Acknowledgements. The author would like to thank James Jensen for his valuable insight and extensive comments on the topic. The author also thanks his colleagues in the Faculty of Applied Sociology for their commitment and work in implementing a unified testing format.. — 327 —.

(22) ft*. • 1-111,1,11t1--./. Appendix Table. 1: Holistic. Rating. Scale Description. 1. Very. Good. 2. Good. 3. Satisfactory. High. fluency. errors.. Speaks. some. with. is easily. intelligible.. Average. fluency. and. Table. difficult. without. hesitations. and. grammatical. errors. spoken. language. grammatical. to understand. but. is punctuated. few. speech. with. long. in accuracies.. speech.. Grammatical. and. Speaks. errors. with. impair. much. meaning. hesitancy. of speech.. Rubric. 2A: Fluency. Rating. Scale Fluency:. 4. Very. but. long pauses.. Speaks. hesitations. and numerous. Very. Analytical. accuracy.. grammatical. pauses 4. Poor. and. Good. Weighting. Speaks. without. speaker 3. Good. Few. - 40%. hesitations.. Speech. is smooth. and resembles. native. speech.. hesitations. and. punctuated. with. some. pauses. but. that. hinders. speech. of too many. long. generally. smooth. 2. Satisfactory. Many. hesitations. intelligible 1. Poor. Speech. and. pauses. but. still. pauses. and. conversation. is unintelligible. as a result. hesitations.. Table. 2B: Grammar. Rating 4. Very. and. Accuracy. Grammar Good. Few. Scale. and Accuracy:. or almost. Weighting. no grammatical. - 40%. errors. on par. with. native. language. accuracy. 3. Good. Some language. 2. Satisfactory. Many. but. not. many. grammatical. errors.. Has. good. command. of. accuracy. grammatical. errors. but. conversation. is intelligible. and. understandable. 1. Poor. Many. grammatical. errors. is unintelligible.. -. 328 -. that. impairs. meaning. of speech.. Speech.

(23) Some Initial Considerations of Unified Oral Testing. Table. 2C:. Pronunciation. Rating. Scale. Pronunciation:. 4. Very. Good. High. Weighting. quality. speaker. of pronunciation. and. is affected. by inability. such as Japanese 2. Satisfactory. - 20% intonation. approaching. native. level.. Pronunciation. 3. Good. In a Japanese University Context. Heavy. sounds. "katakana". to pronounce. segmentals. of "r" for "1" sounds.. accent. but. does. not. inhibit. intelligibility. of. speech. 1. Poor. Complete. "katakana". pronunciation. of. English.. Speech. is. unintelligible. Total Score For Analytical Rubric: Fluency: 40% Grammar and Accuracy: 40% Pronunciation: 20% Total: 100%. Table. 3: Primary. Trait. Rating 1. Very. Scale:. Giving Clear. Giving Instructions. Speech. Speech. 4. Not. Clear. Clear. Speech. is fairly. of Explanation. listener. smooth. instructions. has. many. Listener. needs. Speech. has. clear.. and. of Explanation. immediately. and. completely. explanation.. understands 3. Somewhat. - Clarity. - Clarity. is smooth. understands 2. Clear. Instructions. Listener. but without. hesitations. pauses. needs. and. 329 -. hesitation.. to ask. general. Listener. questions.. explanation. is intelligible.. clarification. hesitations. to ask numerous. -. some. need. but. to ask for further. long. with. and questions. explanation. is not. for clarification..

(24) ft*. • 1-111,1fiftil--./ 37—CV. Multi-trait Table. Rubric. For Evaluating. Presentations. Skills. 4A: Fluency. Rating. Fluency. 4. Very. Good. Speaks. without. speaker 3. Good. Few. hesitations.. Speech. is smooth. and resembles. native. speech.. hesitations. and. punctuated. with. some. pauses. that. hinders. but. generally. smooth. 2. Satisfactory. Many. hesitations. intelligible 1. Poor. Speech. and. pauses. speech. but. still. pauses. and. conversation. is unintelligible. as. result. of too. many. long. hesitations.. Table. 4B: Organization. Rating. of Presentation Organization. 1. Well. Organized. of Presentation. Followed. precisely. body. conclusion.. and. the. basic. format. Reasoning. of presentations:. and. logic. introduction,. of argument. is clear. and. sound. 2. Satisfactory. 3. Poorly Organized. Followed. the. is unclear. in places.. Did. not. follow. argument. Table. 4C: Content. Rating 1. Very. format. the. of presentations. basic. format. but. logic. of argument. of presentations. and. logic. of. is convoluted.. of Presentation Content. Good. basic. of Presentation. Well-researched. and. engaging. content. backed. by. detail. and. statistics. 2. Good. Good. 3. Satisfactory. Some. 4. Poor. research superficial. little. detail. Very. little. details. and. about. some. research. about. the. with. main. research the. interesting. -. topic.. 330 -. occasional. but. lacking. interesting. details.. content.. Very. topic.. is evident. main. content. with. no. engaging. information. or.

(25)