• 検索結果がありません。

Reliability and Validity of a Task-based Writing Performance Assessment for Japanese Learners of English

N/A
N/A
Protected

Academic year: 2021

シェア "Reliability and Validity of a Task-based Writing Performance Assessment for Japanese Learners of English"

Copied!
20
0
0

読み込み中.... (全文を見る)

全文

(1)

Reliabilityand

Validity

of a

Taskrbased

VVriting

Performance

Assessment

for

Japanese

Learners

of

English

Yoshihito

SUGITA

}'inmanashiPrefbcturalUhiversity

Abstract

Thisarticle examines

the

main dataofa task-basedwriting perfbrmancetests

in

which

the five

junior

highschool teachers panicipatedas novice raters. The purpose of

this

research istoimplementa task-basedwriting test

(TBWT)

which was developedon the

basisof construct-based processingapproach totesting,and to examine the

degree

of reliability and validity of the assessment tasks and rating scales. Accuracy and communicability were definedas constructs, and thetestdevelopmentproceededaccording to such threestages as

designing

and characterizing writing tasks,reviewing existing

scoring proceduresand draftingrating scates. Each of the fortyscripts collected from twenty undergraduate students was scored by

five

new raters, and theanalyses were

done

using FACETS. The results

indicated

thatallnovice ratersdisplayedacceptable

levels

of

selficonsistency,and

that

therewas no significantly

difTerent

scoring on thetwo tasksand overall impression,which providedreasonable fittotheRasch model. Themodified scales associated with thefiverating categories and theirspecific written samples were shown to

be

mostly cemprehensible and usable byraters, and

demonstrated

thatthestudents' ability was effectively measured using these tasksand ratingscales. However,

futher

research is necessary fbrconsidering elimination ofinter-rater

differences.

Key Words: writing perfbrmance,task-based assessment, FACETS, reliability, validity

1.

Introdllction

InJapan,Englishlanguagehasbeentraditionallytaughtwith a fbcuson accuracy, and

indirectmeasurement iswidelyused inthe fieldofassessment.

There

seems tohavebeena

paradigmshift

frorn

accuracy-oriented to fiuency-orientedwriting instruction,

but

no significant ehanges hayeoccurred

in

an assessment of writing. Judgingfromthepresent state ofteaching and assessing writing

in

Japan,itwould bemeaningfu1 todevelopscoring

procedures

for

writing

perfbrrriance

assessment inplaceof traditionalindirecttestsof writing. The main purposeof thisstudy is 1) to implementa

developed

wr:iting

perfbrmanceassessment, and 2) exarnine the degreeof reliability and validity of the assessment tasksand rating scales. Itismotivated bysuch an urgent need

fbr

an improved assessment of writing, which isconducted inorder to

develop

a task-basedwriting testfor

(2)

2.

Developmentofatask-basedwritingtest(TBwr) 2.1Construct-based

processing

approach totesting

As Bachman and Palmermentioned

(1996),

theprimarypurposeof a languagetest

is

tomake

inferences

about lariguageability. The al)ilitythatwe want totestisdefinedas a

construct, and describingthe constmct isone of themost

fundamental

concems

in

test

development.When assessing writing, itisthereforenecessary to address theissueof

definingtheconstruct ofgood writing orwriting ability of our students.

Skehan

(1998)

claimed thattheprocessingperspective

is

relevant tohow we directly explain underlying ahilitiestoperformance,as well as how we conceive of models of

languageability. Inthisview, hedefines"ability for

use" as a construct, which rationalizes

theuse of tasks as a centralunit

within

atestingcontext and indevelopinga perfbrTnance test.According toSkehan,such a task-basedapproach totestingwould be`fto assume that

thereisa scale of dithcultyand thatstudents withgreaterIevelsof underlying abilitywill

thenbeal)letomore successfu11y complete taskswhich come higheron such a scale of

diMculty"

tp.174).

lnthisassumption, we findthattaskdithcultyisa major

deteminant

of

testperfbrrriance.Task-basedapproaches, therefore,need tofbcuson task

dithculty

as a

preconditionforusing tasks-as-tests,and methods of evaluating task-basedperfbrmance.

Baclman

(2002),

however,claimed thattaskdithcultycan

be

fbund

with

the

various components ina perfbrrriancetaskand withtheinteractionsamong

them,

and thustask

diMculty

isnot a separate factorand isno longerassumed to

be

amajor

determinant

oftest

perfbrmance.Therefbre,heemphasized thatthetask-basedapproach has toconsider not only perfbrmanceson tasks,btrtalso abilities tQ

be

assessed. Inthisway.

Bachrr)an

argues

thatthenotion ofconstruct-based approach totestingisalsonecessary fortestdevelopment, and mentions thatthemost importantthing

is

to

integrate

tasksand constmct

in

thedesign and developmentofa particularassessment.

Here,we notice thatthereisconsiderable validity inthe

integration

ofconstruct-based

taskdevelopmentand taskimplementation

based

on

the

operation ofthe processingfactors and the infiuencesof

the

processingconditions. In other words, when we develop assessmerrt tasks,

it

is

reasonable tosuppose thatwe should designthetaskon thebasisof construct definitionand processingperspectives.Thus,the so-called construct-based

processiilgapproach to testingresults in a comprehensive framework

for

our test

development.The characteristic featuresof thisappreach are: 1)

it

must consider

both

constructs and tasksin developingperfbmiance assessment

(Bachman,

2002);

2)

procedures

fbr

design,

development

and use of languagetestsrnust incorporatebotha specification ofthe assessrnent tasks to be includedand definitionsof theabilitiesto

be

assessed

(Alderson

et

al.,

1995;

Ba(

hnian& Palmer,1996;Brown,1996);3)tasksshould

be conceptualized as sets of characteristics

(Bachrnan,

2002),and taskcharacteristics should be

designed

toconsider perfbrmanceon tasksinterms of the operation ef the

processingfactorsand theinfluencesof theprocessingconditions

(Skehan,

1998);and 4) the processingfactors

that

affect perfbrmancesuch as commmicative stress should be utilized inorder tocontrol processingconditions inwhich itinvolvesthe interactionof test-takerattributes.

(3)

2.2

Construct

definition

The

constructs of our task-basedwriting testdeveloped

fbr

thisstudy are assumed to

be

accuracy and communicability. Both constructs are derivedfromthe Bachman and

Palmerframework

(1996)

and Skehan'sprocessingperspectiveon testing

(1998).

Therule-based system

' Accuracy - Complexity

Organizationalknowledge ' Grammaticalknowledge ・ Textualknowledge

]Fligure

1.

Theconstruct structureofaccuracy

As shown inFigure

1,

accuracy shares therule-based system intermsofprocessing

perspectives,and hasa deepconnection with organizational knowledgewhich consists of

grammaticaland textual

knowledge.

Grammaticalknowledge"is involvedinproducingor comprehending fbrmallyaccurate utterances or sentences," and textualknowledge"is

involvedinproducingor comprehending textsthatconsists of two ormere utterances or sentences"

(Baciman

& Palmer, 1996,p.68),Based on these two areas of organizatienal

knowledge,

it

is

proposedherethattheconstruct accuracty specialized fbrwriting would be comprised oforganlzational skillsand linguisticaccuracy. Specifically,organizational skills can bedefinedas

the

ability toorganize a logicalstructurewhich enables thecontent tobe accurately acquired, and linguisticaccuracy concerns errors of vocal]urary, spelling,

punctuationorgrammar

(Sugita,

2009).

Jilgure2,Theconstructstmctureofcommunicability

Figure2 indicatesthe construct structure of communicability. We realize thatthe construct shares

the

exemplar-based system interms of processingperspectives,and its

basisispragrnaticknowledge,which consists of functionaland sociolinguistic knowledge.

Functionalknowledge"enables

us tointerpretrelationships betweenutterances or sentences and textsand the

intentions

of Ianguageusers"

(Bachinan

& Palmer,1996, p.69).

(4)

Socielinguisticknowledgeenables us tocreate orinterpretlanguagethat

is

appropriate toa

particularlanguageuse setting

(p.70),

Based on thesedefinitionsand theprocessing

perspectives,the termcommunicability

is

defined

as

fluency

specialized

for

writing, which

iscomprised of communicative qualityand effect Communicativequalityrefers tothe ability tocommunicate without causing

the

reader any dienculty,and cemmunicative effect concerns thequantityof

ideas

necessary todeveloptheresponse as well as therelevance of

thecontent tothe proposedtask

(Sugita,

2009). 2.3Proceduresfordeyelopingthe

TBwr

Utilizinga construct-based processingapproach, the test

development

proceeded according tothefbllowingthreestages:

Stage1:Designingand characterizing writing tasks

With regard to processingperspectives

(Skehan,

1998),content-based support and

form-focused

stakes are necessary foraccuracy tasks.An elicitationtask

(writing

a letter)

was chosen, and specific topicsof selfiintroduction were given inthetask.A situation is supposed inwhich thestudent isgoingtostay with a

host

family

inBritain,and isadvised

to write a

letter,

so thatstudents can focuson writing accuracy,

On

the other hand, cornmunicability tasks need form-orientedsupport and meaning-focused stakes inorder to write witha

fbcus

on meaning. A discussiontask was designed

because

itencourages students to write theiropinions or ideasabout the topic,and

it

lays

emphasis on a meaning-fbcused response

(see

thespecifications inAppendix A).

Accordingto Bachman and Palmer

(1996),

characteristics of the

input

arid the expected response ina testtask are ciosely concerned with the operation of theprocessing

factorsand influencesof

the

processingconditions fortaskimplementation.Inview of the construct-based processingapproach totesting,theTBWT needs to

develop

such task characteristicsinorder to adjust students toperforrnanceconditions inwhich theyallocate

theirattention inappropriate ways. Specifically,thecharacteristics of our accuracy task require students towrite a 1OO-120 word letterinadequate time inorder thatthe rule-based system can be accessed. and the characteristics of our communicability taskencourage students towrite as many answers toa discussiontopicaspossibleina very limitedtime in

order thatan exemplar-based system might

be

appropriate,

Stage2:Reviewing existing scoring proceduresforassessing writing

In

theconstruction ofratingscalesespecially when resources arelimitedas isthecase of this study,

it

seems meaningfu1 to selectively use well-developed and well-researched scales outside Japanas a reference. For thispurpose,existing scoring procedures

for

assessing writing were considered inorder to explore what typesof proceduresare more suitable to construct rating scales. We examined and collected the descriptorsof

the

TOEFL Test of Written English

(TWE),

CambridgeFirstCertificateinEnglish

(FCE),

ESL

CompositionProfile,NAEP

Scoring

Guide and Michigan WritingAssessment

according to thetargetcontext inwhich theTBWT isadministered. As a consequence,

the

TBWT

is

as construct-relevant as multiple traitscoring, and itsprocedureissimilar to

(5)

combined procedure,thetwo assessment tasksand

their

criteriaexist

independently,

and

thusraters are required tomake only one decisionforevery script inthesame way they conduct

in

holisticscoring.

Stage3:Draftingrating scales

The underlying competences served as a usefu1 basiswhen

developing

rating scales

fbraccuracy and communicability, The

descriptors

ofthernarking categories ineach scale were collected from existing writing assessments such as the TOEFL Testof Written

English

(TWE)

and CambridgeFirstCertificateinEnglish

(FCE).

By conforming one

construct closely tothe definitionofits rating scale, itisfairtosay thatraters would use the scale appr6priately and consistently, ensuring the reliability and validity of thewriting assessment. AccordingtoAlderson et aL

(1995),

raters should understand theprinciples

behindtheparticularrating scales

they

must work

with,

and

be

able tointerprettheir

descriptorsconsistently.Therefore,therating scales are comprised ofclearer descriptionsof each constmct and of5-pointLikertscales

(Appendix

B).The

descriptors

ofeach category are alsoprovidedwith theselected witten samples as an explanatory partof the scale in order

that

busy

school teacherswith limitedtraininginwriting perfbrmanceassessment could understand thedescriptorsand work

with

them

consistently.

3

TheStudy

3.1Purposes and research questions

Inorder to examine the degreeof reliabilityand validity of thetask-based writing

performancetest,the fbllowingare fbcusedon raters severity, biasinteractionsbetween writer abilityand task

difficulty,

thereliability ofelicitation tasksand rating scales and the measure'svalidity.

Thespecificresearch questionsare as fbllows:

1

)

Isstudent abilityeffectivelymeasured?

2)Areteacher-ratersequally severe?

3)How much dotasksthataredesignedto

be

equivalent actually differindithculty?

4)How well doscales conform to expectations about theiruse? Do raters use al1partsof

them,and use them consistently?

5)Do individualraters show harshor lenientbiastowardsparticulargroupsof writers? If so,what are thesub-patterns of ratings interms of rater-student biasinteraction

for

each rater?6)

Do theraters score one tasksmore harshlyor more lenientlythantheother? Ifso, what are thesub-patterns of ratings interrnsofrater-task biasinteractionfbreach rater?

7)To what extent, statistically, isthetask-based writing testa reliable and valid measure?

32 Procedure

The

data

fbrthisstudy were 40scripts

(20

scripts foreach oftwo tasks)collected from

twenty undergraduate students inthe firstsemester of2008. Eachofthe scripts was scored

by fiveraters, who were all experienced Japanese

junior

high

school teachersof

English.

(6)

qualificationsoften ormore yearsofteaching experience. The TBWI' scoring guidewas edited

fbr

thistesting.The

first

section

is

the

background

ofthe TBWT. The second section

istheexplanation of assessment tasks.Thethirdsection

is

the

implementation

method of

the testing.TIhe

fburth

section

is

comprised of

the

rating scales and written samples accompanied bydetailedcommentary on each sample at fivelevels,1-5.Bothscriptsand thescoring guidewere giventotheraters byrnail atthebegimingofAugust, 2008.

Each

of

thefiveraters rated theentire set of fortyscripts and sent thembackbytheend of Augusg

2008.They were

instructed

torate the

20

scriptsof

Task

1 firstand thentorate the20

scripts of Task 2.Finally,

they

were asked to rate each of the participants'writing

proficiencybasedon theiroverall impressionat

five

levels,

1-5.A questionnaireabout the scoring guidewas

also

enclosed and sent backwith thematerials.

33 Dataanalysis

Table1,2and 3show thedescriptivestatistics

for

thescores of

the

two testtasksand

theoveral1 impression.Table4 summarizes theinter-ratercorrelation coethcients fbrthe

differentscoring. Sincetheaverage of

the

coeMcients

for

each scoring isrelativelyhigh

(O.78,

O.77,O.79),the

five

ratersappear to

be

ofacceptable reliability.

Table1Descriptivestatistics forTaskl

Raterl Rater2 Rater3 Rater4 Rater5

MeanSDMinimum

Maximum

3.20O.922.05.03.35l.102.e5.03.001.14l.O5.03.00O.702.04.03,25O.882.05.0

Table2Descriptivestatistics

for

Task2

Rater1 Rater2 Rater3 Rater4 Rater5

MeanSDMmimum

Maximum

3251.081.05aO3.20O.922.05.02.801.241.05.03.201.161.05.03.55O.972.05.0

Table

3

Descriptive

statistics

fbr

overall

impression

Rater1 Rater2 Rater3 Rater4 Rater5

MeanSDMinimum

Maximum

3.25O.882.05.03.351.012.05.03.051.161.05.0

3.15O.852aO5.0

3.35O.902.05.0

Table4Inter-ratercorrelation coefficients

between

pairsofraters 12 1!3IM 1/52/32!42153!43154!5Av.

Tl .75.85.68.78,87.76.82,8019.71.78

T2 .84.91.66.86.81.70.70,71,83,74.77

OI .79.8].67.88,92.80,78,80,78,64.79

(7)

Table

5

reports results

fbr

each testtask,theoverall impressionand thescores of

C7iterion,

including

its

meaii and standard

deviation.

In

thetesting,twenty students were requested tosubmit an essay using aiterion

in

order to

discuss

thecriterion-related validity

byexamining theirscores and those

with

a

high

reliability proyided

by

Criterion.

Themean scores forthetesttasksand theoveral1 impressionare very close, ranging from3.16to

3.23.The alpha coethcients fbrthethree variables were calculated.

Using

Davies'

cut-off

(.90)

asan acceptable

level

ofintemalconsistency on a high-stakestest,each Cronbach'sct would meet thepoint:.9434,.9432 and .9480 forTask 1,Task2and overall

impression,

respectively.

Table5Descriptivestatisticsofthe differentscoring

Task1 Task2 Im ression eiterionTWE

NMeanSDMinimum

Maximum

IOO3.16O.971.05.0

1OO3.201.ll1.05.01OO3.23O.971.05.0202.30O.781.04,O

Table6Pearson¢orrelation coethcients

Taskl Task2 lm ression

Task2Impression oriterion

.797**.924**.710**

.884**.678**

.734** IVbte.**all correlations significant atO.Ol]evel

As seen inTable6,

the

correlation coethcients

between

each taskand theoveral1

impressionfallina range of .797 to.924,which are al1significant at theO.Ollevel.The correlatien

between

thetwo testtasks

(.797)

is,however,slightly lowerthantheestablished estimate ofreliability

(,80).

Table

6

also shows thatthetwo tasksand overall impression correlate positivelywith the scores of C7iterion

(p<.Ol).

Such correlation givesa criterion-related validity evidence as tothe threescores of each raterand the eiterion

score.

The

highestcorrelation isforaiterioniswith Impression

(r=.734),

followed

by

Task 1

(r=.710)

and

finally,

Task2

(rT.678).

Sincetheindicatorsofeffect size are more

thanO.5,

indicating

largeefTect

(Cohen,

l988),thetesttasksaredeterminedto

be

valid.

Test

data

are influencedby errors of measurement resulting fromvariation

in

rater

harshnessand testtasks,as well as bythenature ofthe ratingscaleused and bythe range of ability ofthe students who are beingassessed. Therefbre,

it

was necessaiy touse statistical models which takeintoaccount allof thefactorsthatmight affect a student's finalscore.

The

analyses fbrthepresentstudy were doneusing FACETS, version

3.63

(Linacre,

2008).

To examine

the

measurement characteristics of thistesting,the datawere specified as

havingthree

facets,

namely, theabilityofstudents, thedithcultyoftasks and theseverity of raters. The partial-creditmodel was chosen

because

thescoring criteria fortherating scales were qualitativelydifferent.Thisanalysis also ofTersa kindof analysis thatdetectsbiased

(8)

certain

types

ofstudents or tasks,theresults ofa biasanalysis

wi11

be

presented. 4Results

4.1

FACETS

summary

IMeasurel-Ratersl+Studentsl+Tasks

lS,1IS,21S,3l +1+ 7+ +9 +

1

f

6+ + 5+ +l+l+l+[l1+ +1+1+ +1+1+ 4+ 3+ 2 1 o +1+1' -1+1-2 +1-3 +l-4 + -5 -6 +L+ -1 + 34125 +1 ,,+ +1+l+]+'* + +1+I,i+1+ + 612 147131921 101715s3 20 16 5 411 +1+ +t+ +1+1* Aecuracv + +E+1+1+l+ + Communicabilitylmpression + (5)+ (5)+ (5)+

l"--

.--I ++++

11

+4+L+1+1+1,* + + +1+ +l+ +--+ +l+1+1+1*+41 + --- + 4 +,1, * +-'-i .

1

+

1,

+2+t--.+ + +l+1+ +r+ + +1+E+l+1* ,"-l +

1

+

1

+2+l+ + (1)+ (1)+ (t)+ IMea$urel-Ratersi+Studentsl+Tasks [S,1lS.2iS.31 FVgurel FACETSsunmiary

Figure

1

shows a sumrnary of all

facets

and

their

elements. They are

positioned

on a common

logit

scale,which appears as "measure" inthefirstcolumn. The second column shows theseverity variation among raters.The most severe rater

(ID:

3)

isat thetop,and

the

least

severe rater

(ID:

5)isatthebottom.The thirdeolumn shows the ability variation

among the20 students. The student isranked withhighability at the top

(ID:

9)

and

low

ability atthe bottom

(ID:

11).'Iliefourthcolumn shows equivalence of the dificulty yariation among tasks.The lastthreecolumns graphicallydescribethethreeratingscales.

Themost likelyscale score fbreach ability tevelisshown. Forexample, the student whose ability estimate is+1.0 logiton thelogitscale

(ID:

19)islikelytoget

3

pointson each task and the overaIl impressionwhen thestudent isassessed byan average-severity rater.

(9)

4.2FACETS analysis

l,}

lsstudent abiiity

ofectively

measured2

As shown

in

Figure1,student ability estimates range firorna highof6.96 logitsto a

lowof -6.27 logits,indicatinga spread of 13logitsinterrnsof students' ability.

Student

separation value was 6.54,rneaning thatpopulationsIikethesestudents inthisstudy can be spread intoabout seven levels.The reliability indexwas .98,which

demonstrates

the

possibiiitytoachieve reliableabilityscores.

ZJ

Areteacher-ratens equally severe .2

Table

7

FACETS

analysis ofrater characteristics

Fair-Mavera

e

Severity

loits Error Infits uare

(mean

Rater1Rater

2Rater

3Rater 4Rater 5 3,173252.903.063.34-1,07-1.35 .09 -.59-l.70 .27,2726.2627 ,63 .90 .911.25 .96 MeanSD 3.14 ,15 -,92.62 .26.oo .93.20

]Vbte.Reliabilityofseparation index=.82;fixed(al]sarne)chi-square: 27.8,df:4;significance: p=.OO

Table 7 providesinformationon the characteristics of raters. From the left,each column shows rater IDs,fairaverage scores, rater severity, error and fitmean square value.

The second column indicatesthat the severity span betweenthemost severe rater and the most lenientrater was l.79and thedifflrence, basedon falraverage scores inthefirst celumn, isO,44of one grade inthescale. The reliability of theseparation index

(which

indicatesthe likelihoodof which raters consistently difTerfrom one another inoverall severity) was high

(.82).

The chi-square of 27.8with 4 dfwas significant atp<.OO and, therefore,thenull hypothesisthatallraters were equally severe must berejected, Therewas a significant diflerenceinseverity among raters. On theother hand,thelnfitMean Square column indicatesthatno raters were identifiedas misfitting: fitvalues foral1raters were within therange oftwo standard deviationsaround themean

(O.93

±

[O.20

×2]).Inother words, allraters behavedconsistently inthescoring,

sp

Hbw much dotcxsksthatare designedto

be

equivalent actually cb;0Zirin

dy/77culty2

Table8Descriptivestatistics on thedifferentscoring

DiMculty

(logits)

ErrorInfit(meansquare) DiscriminationEstimateof

Task1Task 2Impression MeanSD -.08-.07 .16 .oo.11 .21.1921.21.Ol.961.04,77.92.121.041.00124

IVbte,Reliabilityofseparation index=.OO;fixed

(all

same) chi-square: O.9,dft2;significance: p=,65

[[heanalysis of the two testtasksand overall impressioninTable8 shews thatno sigriificant variation

in

dithcultyexists among them. Ratersare considered to

be

(10)

self:-consistent

in

scoring and

the

tasks

do

not appear to separate thestudents toa sigriificant degree,meaning that

the

diMculty

of

the

twotasksand

the

over impressionare equivalent.

An

estimate of the itemdiscrirninationwas computed according to

the

"Generalized PartialCreditModel"

approach.

1.0

is

the

expected value, butdiscriminations

in

the

range

O.5

to1.5

provide

areasonable

fit

to

the

Raschmodel

(Linacre,

2oo7:132).

D

How

well doscales conjbrm toexpectations about

their

use2

Do

ratens use allparts

of

them,ancitase

them

consistently2

Table9Ratingscale statisdcs

for

Accuracy

Category

score

Average

Measure Oudit(means uare

stepDifficul 12345 -626-324 583.766.16 .4.91.1.81.6 -7.07-1.422.286.22 -9.0ll -6.0 -3.0 o.o 3.0 6.0 9.0 ++---+---+---+---+---+---++ 111Plrlobabi1ityl 2 22 o 11 22 1 2 12 *21 22 11 222 11 22 11 222222 11 3333.*11 22 333 22 33 223 .332 3 2 33 2' 2 444 3333 2 33 24444224 3 44 222 3443434344 4444 44 33 3 33 5*S55555 4 5 4 5 3333 45 5 4 *54 5 4 5 4 5 4 5 4 5 4 555 444 5 4 ++---+---+---+---+---+---++ -9.0 -6.0 -3.0 O.O 3.0 6.0 9.0 ]Fifgne2

ProbabilityCurvesforAccuracy

1"**************##*****-l***#*****i****-*l*"********"*****#*#**

Linacre

(2002)

has

proposed

guidelinesfora rating scale:

(1)

average category measures should advancemonotonically

by

category,

(2)

oudit mean-spuares

should

be

less

than

2.0,

and

(3)

thestep difficultyofeach seale should advance

by

at

least

1.4

logits

and

by

no more

than

5.0logits.Tal)le

9

sbows

the

rating scale statistics

for

accuracy.

The

average category measures risemonotonically as expected by category. All outfit mean-equares are lessthan2.0,meaning thateach ef thefivecategories

has

expected randomness

in

choosing categories. However,

the

increase

in

step difficultiesbetween2 and 3 is5.56,which doesnot meet

(3).

Figure

2

presents

the

scale structure probal)ility curves, which

visually

demonstrate

thefrequencysigrialsof each observed scale. Inthis

(11)

hilltopsare observed. Accordingto

Tyndall

and Kenyon

(1995),

theobvious peaksand the

division

between

thescales indicatethatthescales work as

intended.

Thedivisionbetween

thescales isclear,

but

theprobabilityand therange ofscale 2

indicate

that thescale does not work as intended.

Table1ORatingScaleStatisticsforCornmunicability Category

Score

AverageMeasure

Outfit(mean

s uare

StepDirncul 12345 -5.57-2.99 .663.055.88 .61.11.3.61.3 -5.54-1.612.264.90 -9.011

Plll

Probabi1ity -6.0 -3.0 o.o 3.0 6.0 9.0 ++---+---+---+---+---+---++

l]

l'11l11

1

ll 22 1 2 11 22 11 22 121 22121 2 1 2 1 2222 22 11133*13 22 33 2 3 23 3 *2 3 2 3 2 3 2 33333 2244*2 33 444 3 4 34 434433 4 3 5 444 3 35553 44 5 3 5 45 * 54 5 4 5 4 5 4 5 4 55 44 55 44 555555

1

l

2222 333 111 444 222 555 333 4444

oe.**..*.*i*-*i*-*-'.*-*i*i*.*-:rt;*-*-r*"trrri*.*rrr.*fi*f-*i.*r-*i*.*-*-*-*.t*.*.*H*x*.*.r*.rr.**.1 -9.0 -6.0 -3.0 O,O 3.0 6.0 9.0 ]FVgtire3 ProbabilityCurvesfbrCommunicability

Table

1O

shows therating scale statisticsforcommunicability. Alloutfit mean-squares areless

than

2.0,which meet

(2).

Allstepdiencultyincreases

fal1

within 1.4and 5.0,which

doesmeet

(3).

In

Figure3,thestep

dithculties

increasemonotonically withratingscale numbers and obvious hi11topsareobserved.

Table

11RatingScaleStatisticsfbrOveral1Impression

Category

Score Average Measure Outfit(mean s uare

StepDithcul

12345 -6.02-3.57 .763.656.61 .4.6.6.91,O -6.81-l.66223623

(12)

-9.0 -6.0 -3,O O.O 3.0 6.0 9.0 ++---"+r"--"---+---++---H--h-+----"-rr-r+---ff---++ IL Probabiilty

E[1

]l

1

11112l

11 22 1 2 l1 22 1 2 12 *21 22 11 2 l 22 l 222222 11ll33 22 33 333111 3 22 33 2 3 23 *32 3 2 33 22 333 33 2 444244*24 33 44 222 3 4 34 4 *3 4 3 4 3 4 3 4444 44 33 S555 44 5 5 5 5** 3333 4 5 45 *54 5 4 5 4 5 4 5 4 5 55155

1

44444

1

ttl]

o

1*******************************************************************

++----"-H-"-+-"---J----+F-fi-+---+---+---"n+-"---M-++ -9.0 -6.0 -3.0 O.O 3.0 6.0 9,O jF}gure4 ProbabilityCurvesforOverallImpression

Tal)le11 shows

the

rating scale statistics

for

impression.

Averagemeasures rise monotonically with each category. Alloutfitmean-squares are lessthan 2.0.Allstep

dithculty

increases

fallwithin1.4and 5.0,which

does

meet

(3).

InFigure4,thestep

dienculties

increase

monotonically with rating scale numbers, and obvious hi11tops are observed. Whilethecategory 2of theaccuracy sealewas slightly 1argerthanexpected,

the

rating scales

for

commmieability and overall impressionconfbrmed toexpectations

al)out

itsuse. Insum, thesemodified 5-pointscales could be a reliable toolfornovice raters

in

determining

the

estimate ofstudents' writing ability.

sp

Do individualraters show harshor

lenient

bias

towards particulargroups

ofwriters2

Zf

so,what are thesub-patterns

ofratings

interms

ofrateJ"-student

biasinteractionforeach rater2

Sincetherater-student

interactions

where z-score values fallbelow-2.0or abeve2.0 means a significantbias,therewere a totalof ninesigriificarrtly

biased

interactions

among ailraters. Tables12-16show theresults of thebiasanalysis intermsof

intera

¢tionbetvveen

raterseverity and student ability.In column 9,the

infit

mean square value shows how eonsistent thebias

pattern

is

for

the

rater toevaluate thestudent's abilityacross all the scoring. Inthiscase, themean of

the

infit

mean square value was O.4and itsstandard

deviationwas O.5.Thus

fit

values above

1.4

logits

suggest misfit

(O.4+[O.5

×2]).As shown

in

Table

13,one

(student

10 × Rater2)was

identified

as misfitting

(its

infitmean square

value was above 1.4logits).The mean-square fitstatistics report on

how

much misfit

there

is

in

the

dataafter thebiasisremeved

(Linacre,

2002).[Fherefbre,itwould be

(13)

Table12Biascalibration report: rater-student

interaction

forRater1

StudentAbility(logits)Observed score

Expected Obs-Exp Bias

score Average (togits)Errorz-scoreInfitMeans

uare

1 O.18 12 9.7

O.78

2,71 1.12 2.41 o.o

Table

13

Bias

calibration report: rater-student interaction

for

Rater

2

StudentAbility(logits)Observed

score

Expected Obs-Exp Bias

score Average

(]ogits)Errorz-scofeInfitMeans

uare

17 -O.07 12 9.7 O.77 2.68 1.12 2.39 o,o

10 O.18 12 9.9 0.70 2.43 1.12 2.l6 3,O*

12

3.75

ll 12.9 -O.63 -2.321.05-221 O.8

Nbte.'= misfitting

Table14Biascalibration report: rater-studentinteractionfbrRater3 StudentAbility(]ogits)Observed

score

Expected Obs-Exp Bias

score Average (]ogits)Errorz-scoreInfitMeans uare

6 4.00 14 12.0 O.68 2,65 124 2.14 O.8

Table15 Biascalibration report: rater-studentinteractionfbrRater4

StudentAbility(legits)Observed score

Expected Obs-Exp Bias

score Average (togits)Errorz-scoreInfitMeans

uare

97

6.962,28

139 14.5ILI -O.51-e.69-2.32-2.521.121.20 -2.06-2.10O.4o.o

1 O.18

6

93 -1.10 -4.68 1.45 -3.23 o,o

Table16Biasca]ibration report: rater-student interactionforRater5 StudenttAbMty(logits)Observed

score

Expected Obs-Exp Bias

score Average

(]ogits)Errorz-scoreInfitMeans

uare

11 -627 7 5.7 O.44 2.36 1.15 2.05 1.0

Table17 summarizes the frequenciesof rater-student interactionsthatdisplayeda sigriificant

bias

foreach rater at various levelsof the ability range. The

first

colunm shows

theability estimate range, and

the

second column shows thenumber ofstudents withinthe

particularrange of abilityestimate. Intherange of3.00 or

higher,

therewere fourstudents.

Rater2and 4 harshlyscored one student fbreach, and Rater3lenientlyscored one student.

The

totalnumber of rater-student

bias

interaction

was three,which was 759t6of thetota1 number ofstudents within thisrange

(3!4=O.75).

Therewere twelvestudents whose ability estirnatewas

between

-2.99and 2.99.Rater4

harshly

scored two students. Rater1leniently scored one student, and 2lenientlyscored two students.Thetotalnumber of rater-student

biasinteractionswas five,which was 42% of thetota1number ofstudents withinthisrange

(14)

scored ene student.

This

is

theonly one-rater-student

bias

interaction,

which was 20% of

the

totalnumber ofstudents

within

this

range

(1!5=O.20>.

Table17Frequencyofrater-student biasinteraction

Ability N Harsh Lenient

RlurR3R4R5Rlm asR4R5

3.00hier4 1 1 1

-2.99--2.99 12 2 1 2

-3,OOIower 5 1

Tables12-16alsoindicatethateach rater hadthefbllowingunique rater-student

bias

pattern.'Rater

1:Therewas a more lenientlyscored student than expected fbrRater 1.The lenientlyscored student was ofmiddle range ability

(between

-2.99and 2.99).

' Rater2:Therewere bothmore harshlyand lenientlyscored students than expected

fbr

Rater2.The harshlyscored student was a highability student

(3.00

or higher)and the

lenientlyscored students were ofmiddle range ability.

'Rater3:As inthecase of Rater1,therewas a more lenientlyscored student thanexpected

fbrRater3.UnlikeRater1,thelenientlyscored student was one with highability.

'Rater4:Therewere more harshlyscored students than expected forRater4.Theharshly

scored students includedone student withhighability and two students with middle

range ability.

' Rater5:As inthecase ofRater

1

and Rater3,therewas a more lenientlyscored student

than expected forRater5.

Unlike

the

two raters,the lenientlyscored student hadlow

ability.

6)

Do theraters score one tasksmore hat:shlyor more lenientlythan the other.7

ifso.

what are thesub-patterns

ofratings

interms

ofrater-task

biasinteractionforeach rater2

Table18 Biascalibrationreport:rater-task interaction RaterTasks Observed Expected Obs-Exp Bias Error

score score Ayerage (logits)

z-score Infit Mean s uare 524332ll4I55423Communicability AccuracyCoTnmunicabMty AccuracyImpressionImpressionCorTununieability AccuracyImpressionImpressionIrnpressienAccuracyAecuraeyCorrmiunicability Communicab"i

71676460616765646365676560645668.065.162.258.559.966.464.763.863.065.268.066.661.766.258.6O.15O.10O.09O.07O.05O.03O.OlO.Olo.oo-O.Ol-O,05-O.08-O.08-O.11-O.13O.57O.43O.34O.32O.23O.13O.05O.04-O.Ol-O.04-022-O.36-O.36-O.41-O.49O.44O.47O.43O.47O.47O.47O.44O.47O.47O.47O.47O.47O.47O.43O.441.31O.92O.79O.68O.49027O.12O.08-O.Ol-O.09-O.46-O.76-O.78-O.95-1.12O.91.31.7*O.7O.9O.6O.6O.71.0O.6O.81.1O.8O.71.0

(15)

Table

18

shows

the

results of

the

bias

analysis

in

terrnsof

the

interaction

between

ratersand tasks.Itlistsal1rater-task

interactions

(5

raters×3tasks).Incolumn 8,thereis neither a z-score

below

-2.0 nor greater

than

+2.0 suggesting

that

no rater shows significantly biasedrater-task interactions.Incolumn 9,thernean ef theinfitmean square value was

O.9,

and

its

standard

deviation

was O.3.Thus,

fit

values above 1.5

logits

suggest

misfit

(e.9+[O.3

X2]).TIhevalue byRater4on `communicability'

was l.7,which indicates

that

Rater

4

did

not consistentlyevaluate thetask

in

the

identified

patternsof

bias

across

al1

students.

D

7b

what exten4 stotisticalZJ4is

the

task-basedwriting testareliableand validmeasure ,?

(1)

Reliability

in the firstanalysis・

(Tal)te

7)the dataset was -analyzedusing FACETS. [Ihetable

providedinfbrmationon

the

characteristicsof raters

(seyerity

and consistency). Allraters

displayedacceptable levelsof selficonsistency. This can be seen from the InfitMean

Square

column,

by

adding two standard

deviatiens

tothemean. Raters

falling

within

these

parametersintheirreported

Infit

Mean

Square

illdices

are considered to

have

behayed

consistently. On theother hand,theseparation and reliability figuresindicatethatthere were significani differencesameng raters interms of severity. However,thedifference,

based

on

fair

average scores,

is

O.44of one gradeinthescale,suggesting thattherewould

beno impacton scores awarded inan operational setting.Theanalysis of thetwo tasksand

the

overall

impression

in

Table8 show thatno significant differenceoccurs betweenthe

tasksand theimpression.'lheadjacent scale levelon thetwo tasksmay indicatethatthe

tasks

do

not appear toseparate thestudents toasignificantdegree.

(2)

Validity

lnTable8,an estimate of theitemdiscriminationwas computed according tothe

"Generalized PartialCreditModel"

approach. I.O

is

theexpected value,but

discriminations

intherange of O.5to 1.5providea reasonal)le fitto theRaschmodel

(Linacre,

2007,

p.132).Ailtheestimates

fall

in

this

range

(1.04,

1.00,1.24),which

indicates

thatthe randomness inthreesets of

data

fit

the

Rasch model. The two tasksand

the

overal1

impressionwere, therefbre,of relevance to

dependent

dataacquisition.

There

is

also a criterion-related validity evidence as to

the

three

scores of each raterand the&iterion score. Table19 shows theresulting correlation coecacients fortherelationship

between

each of threeraters' scores and theCV'iterionscore,and theywere statisticallysignlficant

Cp<.O1)

forTask1,Task2and overall irnpression.The substantial correlations demonstrate

thatthetask-basedwTiting perfbrrnanceisrelated tolearnerperfbrmanceon

(>-iterion,

a widely-used writing instructionaltools,and thusresult insupporting itsvalidity.

Table

l9

Inter-rater

correlationcoethcients

between

raters'scoresandthe

eiterion

score

Rater1

Rater2 Rater3 Rater4 Rater5Av .

Tl.67 .74 .78 .72 .68 .72

T2.79 .67 .67 .64 .70 .70

OI.68 .75 .81 .68 .76 .74

(16)

5Discussion

5.1Summary ofthe analyses

The five

junier

highschool teachersinthisstudy were all novice raters inthis task-based writing assessment. Theinterratercorrelationcoethcients betweenpairsofraters were relatively

high,

and the

five

raters appeared to be of acceptable reliability.The

FACETS arialysisshowed thattheratersdisplayedacceptable levelsof selficonsistency,

Therewere, hewever,significant differencesbetweenraters interms of severity, Thebias analyses indicatedthat

al1

raterswere significantly

biased

towards certaintypesofstudents, and theirbiaspatternswere unique. Moreover, itmust be said that one rater-student

interactionand one rater-task interactionwere identifiedas misfitting,so theseraterswere not consistent intheidentifiedpatternsof biasacross the students or tasks.The FACETS

analysis alsoshowed thattherewas no sigr)ificantly differentscoring among

the

two tasks

and theoverall impression.The adjacent scale

level

on thetwo tasksmay

indicate

thatthe testtasksdidnot separate thestudents to a sigriificant

degree

and thusthetesttasksare roughiy equivalent indiMcultywhen

learner

responses tothetwo taskswere scored based on the accuracy and commmicability scales.

Since

the 5-pointscales

dernonstrated

acceptable fit,the five¢ategories and theirspecific written samp]es were mostly comprehensible and usable byraters. Therefore,itisquitelikelythatthe assessment tasks

and rating scaleswere reliableindeterminingan estimate ofstudents' writing ability.

52 Implications

The findingsofthis study suggest thattheTBWT scoring guidemay

have

eflectively

givennovice ratersashared understanding ofthe construct ofwriting ability

defined

bythe testwriter and may havecontributed totheconsistency inscoring aridthereduction inthe

biasedinteractionswith tasks.Itis,therefore,reasonable tosuppose thattheTBWT scoring

guidemay possiblyreduce thedifferencesor biasescaused byvariation among raters. In order toconfirrri this,a questionnairewas also administered tothefiveteacher-ratersinthis study bymail. Itwas designedtobecompleted inashorttime.Most of thequestionswere of themultiple choice variety. The lastquestioninvitedcomments and opinions on the whole guidebook.

Question

lthrough

3

were answered on a 3-pointscale from1

(No,

not usefu1) to 3

(Yes,

very usefu1).

Questions

4 through 9were answered on 4-pointscale from

1

(Strongly

disagree)to4

(Strongly

agree).

As shown inTables20 and 21,the results of thequestionnaireshow thatthefive

teacher-ratersfeltthattheTBWT scoring guidewas fairlyusefu1. Accordingto the

FACETS analysis,thecategory 2ofthe accuracy scale was slightlyiargerthanexpected. In other words, with increasingmeasure, the category 2 ismost

likely

to beobserved for accuracy scoring among thefivenovice raters.Thismay beinagreement withtheresult

thatnovice raters tendto

become

mere severe than experienced raters as some previous studies

have

shown

(Ruth

and Murphy,1998;Weigle,1994,1998).Whereas thereseems to

be an admission of improvementinthescale,thequestioimairesurvey

indicated

thatthe

fivecategories and theirspecific written samples were mostly comprehensible and usable

(17)

guidelinesare supposed to leadthe raters to selgconsistency and reduction of biased

interactions

with tasks.Therefore,itisquitelikelythattheTBWT scoimg guide

haye

effectively

given

novice raters a shared understanding ofthe construct ofwriting ability and

have contributed totheconsistency indeteminingan estimate ofstudents' writing ability

Table20Resultsofscaling in

Questionnaire

(Q.

1-3)

Questions

Veryuseful usefijtl Notusefu1

1.Istheintroductionuseful? 2400/e 36oe/, o

2.Isthetaskexlanationuseful? 5looo/, o o

3.Isthescorinrocedureusefu1? slooo/,

o

o

Table21Resultsofscaiing in

Questionnaire

(Q.

4-9)

uestlons StronglyeeAgree DisagreeStronglydisee

4.Thedefinitionofaccuracyis understandable 2(4oof.)3(6oo/,)o o 5.IIheaccuracyscaleiseasytoevaluate.1(200/e)3(6oo/,)1(200/e)o

6.Thesamplesforaccuracyareuseful.3(60e/e)2(4oo/.)o

o 7.Thedefinitionofcomunicabilityis understandable. 2(400/e)2(4oo/,)1(2oe/,)o 8.Thecommunicabilityscaleiseasyto evaluate. 1(2oo/,)4(soe/.)e o 9.Thesarnplesforcomunicabilityare uefu1. 2(4oo/,)3(6oo/,)o o

Thispresentstudy, however,indicatesthat there were significant

biased

interactions

with students' abilityarnong al1fiveraters. Each rater was

found

te

be

selfLconsistentin scoring 20 studenis' writing performances,butall of the raters hada unique

bias

pattem

toward a certain typeofstudent. [IIhesefindingssuggest thattheTBWT scoring guidemay

have

contributed to

the

reduction of biasedinteractions,buttrainingforcertain raters with

hislherunique

bias

patternsmight stillbe required. On thispoint,Schaefer

(2008),

for

example, states that therating processisso complex and error-prone that intentionaland

programmatictraining

ibr

rating

is

necessary.

Moreover,one rater-student interactionand one rater-task interactionwere identified as misfitting, so

these

raterswere not consistent intheidentifiedpatternsofbias across the students or tasks.As

fbr

the

rater-studentinteraction,Rater2 commented thatitwas not easy fbrhertounderstand the

diffbrences

between

"few

errors" and "occasienal errors" in

thedescriptorofaccuracy scale.For.therater-task interaction,Rater4 commented on the

introductionofthe scoring gujde

that

theconfbmity betweenthetask and itsunderlying construct

isi

not easy teunderstand, and a detailedexplanation isnecessary.

In

association with thi$.cgmment, she disagrged

in

the questionnaire

that

the

definition

of communicability is understandable, and describedthat she rated the task'

for

communicability mostly depe.ndingon gralp.ipar and vocabulary. Althoughsuch a unique response patternseems to reflect

individual

rater idiosyncrasies,these

findings

would be

(18)

beneficialinimprovingour understanding of raterbehaviorand inprovidingmore focused

directioninrater training.As Eckes

(2008)

points

out, theconsistency with which each rater uses particularscoring criteria should beexamined more precisely.Forthispurpose, research viewed froma rater cognition perspectiveduringevaluation of written samples must

be

conducted

hereafter.

6Conclusion

The resultsof thefivenovice ratersinthisstudy indicatedthatal1ratersdisplayed acceptable levelsofselfconsistency, and

that

thestudents' abilitywas effectivelymeasured using these tasksand rating scales. Ihe FACETS analysis showed thattherewas no significantlydifferentscoring on thetwo tasksand overal1

impression,

which provided reasonable fittotheRaschmodel. The questionnairesurvey also indicatedthatthefive categories and theirspecific written samples were mostly comprehensible and usable by raters, and the5-pointscales demonstratedacceptabSe fit.Thisisbecausethe TBwr scoring guidehasgivennovice raters a shared understanding of theconstruct of writing ability, and hascontributed tetheconsistency inscoring. Therefore,itisquitelikelythatthe assessment tasksand rating scales were reliable indeterrniningan estimate of students' writing ability.

There

were,

however,

relatively smal1

but

significant

differences

betweenraters in

terrnsofseverity. Eachrater was

fbund

to

be

selgconsistent

in

scoring,

but

the

bias

analyses of thisstudy also

indicated

al1

of

the

five

raterswere significantlybiasedtowards certain types of students. These findingssuggest that theTBwr scoring guidemay have contributed tothe reduction ofbiased

interactions,

but

trainingforcertain raters with hislher

mique biaspatternsmight stillbe required. Moreover, the consistency withwhich each rater uses particuiarscoring criteria should

be

examined rnore precisely.Forthispurpose, research viewed froma rater cognition perspective

during

evaluation of written samples must beconducted. Theseissueswi11beexamined ina subsequent study.

Acknowledgement

The presentresearch was supported inpartbya Grant-in-Aidfbr

Scientific

Research

for

201O-2012

(No.

22520572)fromtheJapanSocietyforthePrornotionof Science.Iam

gratefu1toanonymous JLTA reviewers forsuggestions inrevising thearticle.

References

Alderson,J.C.,Clapham, C. and Wal1,D.

(1995).

Language test construction and

evaluation. Carnbridge:CambridgeUniversityPress.

Bachinan,L.F.and Palmer,A.S.

(1996).

Language testing inpractice:clesigning and

developingusefiil languagetests.Oxford:Oxford

University

Press.

BachJnan,L.F.

(2002).

Some reflections on task-based

language

perfbrmanceassessment.

Language7;7sting19,453-76.

Brown,J.D.

(1996).

7lestinginlanguageprograms.UpperSaddleRiver,NJ:PrenticeHall

(19)

Cohen,J.

(1988).

Slatisticalpower

anaL)nsisfor the behavioralscience.

(2"d

ed.),Hillasdale, NL: LawrenceErlbaumAssociates.

Davies,

A.

(1990).

A Principles

oflanguage

testing.BasilBlackwelJ.

Eckes,T.

(2008).

Rater

types in writing perfbrmanceassessments: A classification

approach toratervariability.Langunge71esting25,155-85.

Linacre,J.

(2002).

Guidelinesfbrrating scales. Mesa ResearchNote 2

(Online).

AvaiIable

athttp:llwww.rasch.orgtrn2.htm

(accessed

18March2008).

Linacre,J.

(2007).

A

b(ser

's

guidetoE4CE7S: Rasch-modelcomputerprogram .

Chicago,

IL:MESA Press.

Linacre,J.

(2008).

Ibcets,version no. 3.63.Computerprogram.Chicago,IL:MESA Press.

Ruth,

L.,& Murphy, S.

(1988).

Designingwriting tasks

for

the assessment

of

writing.

Norwood,NJ:Ablex

Publishing

Corp.

Schaefer,E.

(2008).

Raterbiaspattem

in

an

EFL

writing assessment of writing. Language

7;esting,25,465-93.

Skehan,P.

(1998).

A cognitive approachtolanguage

learning.

OxfbrdUniversityPress.

Sugita

Y.

(2009).

Thedevelopmentand implementationoftask-based writing perfbrrnance

assessment fbrJapanese

learners

of English.Jburnal

ofPan-Pacijic

Association

of

Applied

Linguistics13(2),77-103.

Tyndall,B.,

&

Kenyon,

D.M.

(1995).

Validationofa new

holistic

rating scale using Rasch

multifaceted analysis. InA.

Curnming

& Berwick

(Eds.),

Ubliclationinlanguqge

testing

(pp.

39-57),

Clevedon,

England:MultilingualMatters.

Weigle,

S.

(1994).

Effectsof trainingon raters of ESL compositions. Latzguage7lesting11, 197-223.

Weigle,S.

(l998).

UsingFACETS to model rater trainingeffects. Language 7lesting15,

263-87.

Appendix A:Assessmenttasksfortesting

1. Task1(aecuracy)

' Rubric:This

is

a testof yourability to write a coherent and grammatically

paragraph.You wi11have20minutes tocornplete thetest.

' Prompt:You are goingtostay with theParlcerFamily

in

Britainthissummer.

100-120

word letterintroducingyourselftoyourhostfamily.Beforewriting,

the fbllowingtopics: m Yournarne and age

- Yourjoband major inschool

-Your farnilyand pet

-Your interestsand hobbies

-Your favorite

places,

fbods

and activities

-Your experience travelingabroad

- Some things youwant to

do

while youare inBritain

correct

Writea

thinkof

2. Task2(communicability)

(20)

without causing the reader any dienculties.You will have1Ominutes tocomplete the

test.'

Prompt:You are goingto discussthefo11owingtopic withyourclassmates, "Why do

youstudy English?"Inorder to

prepare

forthediscussion,thinkof as many answers as

possibletothequestionand write them as t`To travelabroad."

Appendix B:Rating scales

[Accuracy]

Table 1 Descriptive statistics for Task l
Table 5 Descriptive statistics ofthe different scoring
Table 1O Rating Scale Statistics forCornmunicability Category
Table 17 Frequency ofrater-student bias interaction

参照

関連したドキュメント

Standard domino tableaux have already been considered by many authors [33], [6], [34], [8], [1], but, to the best of our knowledge, the expression of the

This section will show how the proposed reliability assessment method for cutting tool is applied and how the cutting tool reliability is improved using the proposed reliability

H ernández , Positive and free boundary solutions to singular nonlinear elliptic problems with absorption; An overview and open problems, in: Proceedings of the Variational

Keywords: Convex order ; Fréchet distribution ; Median ; Mittag-Leffler distribution ; Mittag- Leffler function ; Stable distribution ; Stochastic order.. AMS MSC 2010: Primary 60E05

In solving equations in which the unknown was represented by a letter, students explicitly explored the concept of equation and used two solving methods.. The analysis of

(2013) “Expertise differences in a video decision- making task: Speed influences on performance”, Psychology of Sport and Exercise. 293

In particular, we show that the q-heat polynomials and the q-associated functions are closely related to the discrete q-Hermite I polynomials and the discrete q-Hermite II

Inside this class, we identify a new subclass of Liouvillian integrable systems, under suitable conditions such Liouvillian integrable systems can have at most one limit cycle, and