Reliability and Validity of a Task-based Writing Performance Assessment for Japanese Learners of English

(1)

Reliabilityand

Validity

of a

Taskrbased

VVriting

Performance

Assessment

for

Japanese

Learners

of

English

Yoshihito

SUGITA

}'inmanashiPrefbcturalUhiversity

Abstract

Thisarticle examines

the

main dataofa task-basedwriting _perfbrmancetests

in

which

the five

_junior

highschool teachers _panicipatedas novice raters. The _purpose of

this

research istoimplementa task-basedwriting test

_(TBWT)

which was developedon the

basisof construct-based processingapproach totesting,and to examine the

degree

of reliability and validity of the assessment tasks and rating scales. Accuracy and communicability were definedas constructs, and thetestdevelopment_proceededaccording to such threestages as

designing

and characterizing writing tasks,reviewing existing

scoring _proceduresand draftingrating scates. Each of the fortyscripts collected from twenty undergraduate students was scored by

five

new raters, and theanalyses were

done

using FACETS. The results

indicated

thatallnovice ratersdisplayedacceptable

levels

of

selficonsistency,and

that

therewas no significantly

difTerent

scoring on thetwo tasksand overall impression,which providedreasonable fittotheRasch model. Themodified scales associated with thefiverating categories and theirspecific written samples were shown to

be

mostly cemprehensible and usable byraters, and

demonstrated

thatthestudents' ability was effectively measured using these tasksand ratingscales. However,

futher

research is necessary fbrconsidering elimination ofinter-rater

differences.

Key Words: writing perfbrmance,task-based assessment, FACETS, reliability, validity

1.

Introdllction

InJapan,Englishlanguagehasbeentraditionallytaughtwith a fbcuson accuracy, and

indirectmeasurement iswidelyused inthe fieldofassessment.

There

seems tohavebeena

paradigmshift

frorn

accuracy-oriented to fiuency-orientedwriting instruction,

but

no significant ehanges hayeoccurred

in

an assessment of writing. Judgingfromthepresent state ofteaching and assessing writing

in

Japan,itwould bemeaningfu1 todevelopscoring

procedures

for

writing

perfbrrriance

assessment in_placeof traditionalindirecttestsof writing. The main _purposeof thisstudy is 1) to implementa

developed

wr:iting

perfbrmanceassessment, and 2) exarnine the degreeof reliability and validity of the assessment tasksand rating scales. Itismotivated bysuch an urgent need

fbr

an improved assessment of writing, which isconducted inorder to

develop

a task-basedwriting testfor

(2)

2.

Developmentofatask-basedwritingtest(TBwr) 2.1Construct-based

_processing

approach totesting

As Bachman and Palmermentioned

(1996),

theprimarypurposeof a languagetest

is

tomake

inferences

about lariguageability. The al)ilitythatwe want totestisdefinedas a

construct, and describingthe constmct isone of themost

fundamental

concems

in

test

development.When assessing writing, itisthereforenecessary to address theissueof

definingtheconstruct ofgood writing orwriting ability of our students.

Skehan

(1998)

claimed thatthe_processing_perspective

is

relevant tohow we directly explain underlying ahilitiesto_performance,as well as how we conceive of models of

languageability. Inthisview, hedefines"ability for

use" as a construct, which rationalizes

theuse of tasks as a centralunit

within

atestingcontext and indevelopinga _perfbrTnance test.According toSkehan,such a task-basedapproach totestingwould be`fto assume that

thereisa scale of dithcultyand thatstudents with_greaterIevelsof underlying abilitywill

thenbeal)letomore successfu11y complete taskswhich come higheron such a scale of

diMculty"

_tp.174).

lnthisassumption, we findthattaskdithcultyisa major

deteminant

of

test_{perfbrrriance.}Task-basedapproaches, therefore,need tofbcuson task

dithculty

as a

preconditionforusing tasks-as-tests,and methods of evaluating task-basedperfbrmance.

Baclman

_(2002),

however,claimed thattaskdithcultycan

be

fbund

with

the

various components ina _{perfbrrriance}taskand withtheinteractionsamong

them,

and thustask

diMculty

isnot a separate factorand isno longerassumed to

be

amajor

determinant

oftest

perfbrmance.Therefbre,heemphasized thatthetask-basedapproach has toconsider not only _perfbrmanceson tasks,btrtalso abilities tQ

be

assessed. Inthisway.

Bachrr)an

argues

thatthenotion ofconstruct-based approach totestingisalsonecessary fortestdevelopment, and mentions thatthemost importantthing

is

to

integrate

tasksand constmct

in

thedesign and developmentofa _particularassessment.

Here,we notice thatthereisconsiderable validity inthe

integration

ofconstruct-based

taskdevelopmentand taskimplementation

based

on

the

operation _ofthe _processingfactors and the infiuencesof

the

_processingconditions. In other words, when we develop assessmerrt tasks,

it

is

reasonable tosuppose thatwe should designthetaskon thebasisof construct definitionand _processing_{perspectives.}Thus,the so-called construct-based

processiilgapproach to testingresults in a comprehensive framework

for

our test

development.The characteristic featuresof thisappreach are: 1)

it

must consider

both

constructs and tasksin developingperfbmiance assessment

(Bachman,

2002);

2)

procedures

fbr

design,

development

and use of languagetestsrnust incorporatebotha specification ofthe assessrnent tasks to be includedand definitionsof theabilitiesto

be

assessed

(Alderson

et

al.,

1995;

Ba(

hnian& Palmer,1996;Brown,1996);3)tasksshould

be conceptualized as sets of characteristics

(Bachrnan,

2002),and taskcharacteristics should be

designed

toconsider _perfbrmanceon tasksinterms of the operation ef the

processingfactorsand theinfluencesof the_processingconditions

(Skehan,

1998);and 4) the _processingfactors

that

affect _perfbrmancesuch as commmicative stress should be utilized inorder tocontrol _processingconditions inwhich itinvolvesthe interactionof test-takerattributes.

(3)

2.2

Construct

definition

The

constructs of our task-basedwriting testdeveloped

fbr

thisstudy are assumed to

be

accuracy and communicability. Both constructs are derivedfromthe Bachman and

Palmerframework

₍₁₉₉₆₎

and Skehan'sprocessingperspectiveon testing

(1998).

Therule-based system

' Accuracy - Complexity

Organizationalknowledge ' _{Grammaticalknowledge} ･ Textualknowledge

]Fligure

1.

Theconstruct structureofaccuracy

As shown inFigure

1,

accuracy shares therule-based system intermsofprocessing

perspectives,and hasa deepconnection with organizational knowledgewhich consists of

grammaticaland textual

knowledge.

Grammaticalknowledge"is involvedinproducingor comprehending fbrmallyaccurate utterances or sentences," and textualknowledge"is

involvedin_producingor comprehending textsthatconsists of two ormere utterances or sentences"

(Baciman

& Palmer, 1996,p.68),Based on these two areas of organizatienal

knowledge,

it

is

_proposedherethattheconstruct accuracty specialized fbrwriting would be comprised oforganlzational skillsand linguisticaccuracy. Specifically,organizational skills can bedefinedas

the

ability toorganize a logicalstructurewhich enables thecontent tobe accurately acquired, and linguisticaccuracy concerns errors of vocal]urary, spelling,

punctuationor_grammar

(Sugita,

2009).

Jilgure2,Theconstructstmctureofcommunicability

Figure2 indicatesthe construct structure of communicability. We realize thatthe construct shares

the

exemplar-based system interms of _processingperspectives,and its

basisis_pragrnaticknowledge,which consists of functionaland sociolinguistic knowledge.

Functionalknowledge"enables

us tointerpretrelationships betweenutterances or sentences and textsand the

intentions

of Ianguageusers"

(Bachinan

& Palmer,1996, p.69).

(4)

Socielinguisticknowledgeenables us tocreate orinterpretlanguagethat

is

appropriate toa

particularlanguageuse setting

(p.70),

Based on thesedefinitionsand the_processing

perspectives,the termcommunicability

is

defined

as

fluency

specialized

for

writing, which

iscomprised of communicative _qualityand effect Communicative_qualityrefers tothe ability tocommunicate without causing

the

reader any dienculty,and cemmunicative effect concerns the_quantityof

ideas

necessary todeveloptheresponse as well as therelevance of

thecontent tothe _proposedtask

(Sugita,

2009). 2.3Proceduresfordeyelopingthe

TBwr

Utilizinga construct-based _processingapproach, the test

development

_proceeded according tothefbllowingthreestages:

Stage1:Designingand characterizing _writing _tasks

With regard to _processing_perspectives

(Skehan,

1998),content-based support and

form-focused

stakes are necessary foraccuracy tasks.An elicitationtask

(writing

a letter)

was chosen, and specific topicsof selfiintroduction were _given inthe_task.A situation is supposed inwhich thestudent is_going_to_stay _with _a

host

family

inBritain,_and isadvised

to write a

letter,

so thatstudents can focuson writing accuracy,

On

_the _other hand, cornmunicability tasks need form-orientedsupport _and _{meaning-focused} stakes inorder to write witha

fbcus

on meaning. A discussiontask was designed

because

itencourages students to write theiropinions or ideasabout the topic,and

it

lays

emphasis on a meaning-fbcused response

(see

thespecifications inAppendix A).

Accordingto Bachman and Palmer

(1996),

characteristics _of _the

input

_arid _the expected response ina testtask are ciosely concerned with the operation of the_processing

factorsand influencesof

the

_processingconditions fortaskimplementation.Inview of the construct-based _processingapproach totesting,theTBWT _needs _to

develop

_such _task characteristicsinorder to adjust students to_perforrnanceconditions inwhich _they_allocate

theirattention inappropriate ways. Specifically,thecharacteristics _of _our accuracy task require students towrite a 1OO-120 word letterinadequate time inorder thatthe rule-based system _can be _accessed. and the characteristics of our communicability taskencourage students _to_write as many answers toa discussiontopicas_possibleina very limitedtime in

order thatan exemplar-based _system might

be

appropriate,

Stage2:Reviewing existing scoring _proceduresforassessing writing

In

theconstruction ofratingscalesespecially when resources arelimitedas isthe_case of this study,

it

seems meaningfu1 to selectively use well-developed and well-researched scales _outside Japanas a reference. For this_purpose,existing scoring _procedures

for

assessing writing were _considered inorder to explore what typesof _procedures_are more suitable to construct rating scales. We examined and collected the descriptorsof

the

TOEFL Test of Written English

(TWE),

_Cambridge_First_Certificate_in_English

_(FCE),

ESL

CompositionProfile,NAEP

Scoring

Guide and Michigan WritingAssessment

according to thetargetcontext inwhich theTBWT isadministered. As a consequence,

the

TBWT

is

as construct-relevant as multiple traitscoring, and its_procedureis_similar _to

(5)

combined _procedure,thetwo assessment tasksand

their

criteriaexist

independently,

and

thusraters are required tomake only one decisionforevery script inthesame way they conduct

in

holisticscoring.

Stage3:Draftingrating scales

The underlying competences served as a usefu1 basiswhen

developing

rating scales

fbraccuracy and communicability, The

descriptors

ofthernarking categories ineach scale were collected from existing writing assessments such as the TOEFL Testof Written

English

_(TWE)

and CambridgeFirstCertificateinEnglish

(FCE).

By conforming one

construct closely tothe definitionofits rating scale, itisfairtosay thatraters would use the scale appr6priately and consistently, ensuring the reliability and validity of thewriting assessment. AccordingtoAlderson et aL

(1995),

raters should understand the_principles

behindthe_particularrating scales

they

must work

with,

_and

be

able tointerprettheir

descriptorsconsistently.Therefore,therating scales are comprised ofclearer descriptionsof each constmct and of5-pointLikertscales

(Appendix

B).The

descriptors

ofeach category are also_providedwith theselected witten samples as an explanatory _partof the scale in order

that

busy

school teacherswith limitedtraininginwriting _perfbrmanceassessment could understand thedescriptorsand work

with

them

consistently.

3

TheStudy

3.1Purposes and research _questions

Inorder to examine the degreeof reliabilityand validity _of thetask-based writing

performancetest,the fbllowingare fbcusedon raters severity, biasinteractionsbetween writer abilityand task

difficulty,

thereliability ofelicitation tasksand rating scales and the measure'svalidity.

Thespecificresearch _questionsare as fbllows:

1

₎

Isstudent abilityeffectivelymeasured?

2)Areteacher-ratersequally severe?

3)How much dotasksthataredesignedto

be

equivalent actually differindithculty?

4)How well doscales conform to expectations about theiruse? Do raters use al1_partsof

them,and use them consistently?

5)Do individualraters show harshor lenientbiastowards_particular_groupsof writers? If so,what are thesub-patterns of ratings interms of rater-student biasinteraction

for

each rater?6)

Do theraters score one tasksmore harshlyor more lenientlythantheother? Ifso, what are thesub-patterns of ratings interrnsofrater-task biasinteractionfbreach rater?

7)To what extent, statistically, isthetask-based writing testa reliable and valid measure?

32 Procedure

The

data

fbrthisstudy were 40scripts

(20

scripts foreach oftwo tasks)collected from

twenty undergraduate students inthe firstsemester of2008. Eachofthe scripts was scored

by fiveraters, who were all experienced Japanese

junior

high

school teachersof

English.

(6)

qualificationsoften ormore _yearsofteaching experience. The TBWI' scoring _guidewas edited

fbr

thistesting.The

first

section

is

the

background

ofthe TBWT. The second section

istheexplanation of assessment tasks.Thethirdsection

is

the

implementation

method of

the testing.TIhe

fburth

section

is

comprised of

the

rating scales and written samples accompanied bydetailedcommentary on each sample at fivelevels,1-5.Bothscriptsand thescoring _guidewere _giventotheraters byrnail atthebegimingofAugust, 2008.

Each

of

thefiveraters rated theentire set of fortyscripts and sent thembackbytheend of Augusg

2008.They were

instructed

torate the

20

scriptsof

Task

1 firstand thentorate the20

scripts of Task 2.Finally,

they

were _asked _to rate each of the _{participants'}writing

proficiencybasedon theiroverall impressionat

five

levels,

1-5.A _{questionnaire}about the scoring _guidewas

also

enclosed and sent backwith thematerials.

33 Dataanalysis

Table1,2and 3show thedescriptivestatistics

for

thescores of

the

two testtasksand

theoveral1 impression.Table4 summarizes theinter-ratercorrelation coethcients fbrthe

differentscoring. Sincetheaverage of

the

coeMcients

for

each scoring isrelativelyhigh

(O.78,

O.77,O.79),the

five

ratersappear to

be

ofacceptable reliability.

Table1Descriptivestatistics forTaskl

Raterl Rater2 Rater3 Rater4 Rater5

MeanSDMinimum

Maximum

3.20O.922.05.03.35l.102.e5.03.001.14l.O5.03.00O.702.04.03,25O.882.05.0

Table2Descriptivestatistics

for

Task2

Rater1 Rater2 Rater3 Rater4 Rater5

MeanSDMmimum

Maximum

3251.081.05aO3.20O.922.05.02.801.241.05.03.201.161.05.03.55O.972.05.0

Table

3 Descriptive

statistics

fbr

overall

impression

Rater1 Rater2 Rater3 Rater4 Rater5

MeanSDMinimum

Maximum

3.25O.882.05.03.351.012.05.03.051.161.05.0

3.15O.852aO5.0

3.35O.902.05.0

Table4Inter-ratercorrelation coefficients

between

_pairs_ofraters 12 1!3IM 1/52/32!42153!43154!5Av.

Tl .75.85.68.78,87.76.82,8019.71.78

T2 .84.91.66.86.81.70.70,71,83,74.77

OI .79.8].67.88,92.80,78,80,78,64.79

(7)

Table

5

reports results

fbr

each testtask,theoverall impressionand thescores of

C7iterion,

including

its

meaii and standard

deviation.

In

thetesting,twenty students were requested tosubmit an essay using aiterion

in

order to

discuss

thecriterion-related validity

byexamining theirscores and those

with

a

high

reliability _proyided

by

Criterion.

Themean scores forthetesttasksand theoveral1 impressionare very close, ranging from3.16to

3.23.The alpha coethcients fbrthethree variables were calculated.

Using

Davies'

cut-off

(.90)

asan acceptable

level

ofintemalconsistency on a high-stakestest,each Cronbach'sct would meet the_point:.9434,.9432 and .9480 forTask 1,Task2and overall

impression,

respectively.

Table5Descriptivestatisticsofthe differentscoring

Task1 Task2 Im ression eiterionTWE

NMeanSDMinimum

Maximum

IOO3.16O.971.05.0

1OO3.201.ll1.05.01OO3.23O.971.05.0202.30O.781.04,O

Table6Pearson¢orrelation coethcients

Taskl Task2 lm ression

Task2Impression oriterion

.797**.924**.710**

.884**.678**

.734** IVbte.**all _correlations _significant _at_O.Ol_]evel

As seen inTable6,

the

correlation coethcients

between

each taskand theoveral1

impressionfallina range of .797 to.924,which are al1significant at theO.Ollevel.The correlatien

between

thetwo testtasks

(.797)

is,however,slightly lowerthantheestablished estimate ofreliability

(,80).

Table

6

also shows thatthetwo tasksand overall impression correlate _positivelywith the scores of C7iterion

(p<.Ol).

Such correlation _givesa criterion-related validity evidence as tothe threescores of each raterand the eiterion

score.

The

highestcorrelation isforaiterioniswith Impression

(r=.734),

followed

by

Task 1

_(r=.710)

and

finally,

Task2

(rT.678).

Sincetheindicatorsofeffect size are more

thanO.5,

indicating

largeefTect

(Cohen,

l988),thetesttasksaredeterminedto

be

valid.

Test

data

are influencedby errors of measurement resulting fromvariation

in

rater

harshnessand testtasks,as well as bythenature ofthe ratingscaleused and bythe range of ability ofthe students who are beingassessed. Therefbre,

it

was necessaiy touse statistical models which takeintoaccount allof thefactorsthatmight affect a student's finalscore.

The

analyses fbrthe_presentstudy were doneusing FACETS, version

3.63 (Linacre,

2008).

To examine

the

measurement characteristics of thistesting,the datawere specified as

havingthree

facets,

namely, theabilityofstudents, thedithcultyoftasks and theseverity of raters. The _{partial-credit}model was chosen

because

thescoring criteria fortherating scales were _{qualitatively}different.Thisanalysis also ofTersa kindof analysis thatdetectsbiased

(8)

certain

types

ofstudents or tasks,theresults ofa biasanalysis

wi11

be

_presented. 4Results

4.1 FACETS

summary

IMeasurel-Ratersl+Studentsl+Tasks

lS,1IS,21S,3l +1+ 7+ +9 +

1 f

6+ + 5+ +l+l+l+[l1+ +1+1+ +1+1+ 4+ 3+ 2 1 o +1+1' -1+1-2 +1-3 +l-4 + -5 -6 +L+ -1 + 34125 +1 ,,+ +1+l+]+'* + +1+I,i+1+ + 612 147131921 101715s3 20 16 5 411 +1+ +t+ +1+1* Aecuracv + +E+1+1+l+ + Communicabilitylmpression + (5)+ (5)+ (5)+

l"--

.--I ++++

11

+4+L+1+1+1,* + + +1+ +l+ +--+ +l+1+1+1*+41 + --- + 4 +,1, * +-'-i .

1

+

1,

+2+t--.+ + +l+1+ +r+ + +1+E+l+1* ,"-l +

1

+

1

+2+l+ + (1)+ (1)+ (t)+ IMea$urel-Ratersi+Studentsl+Tasks [S,1lS.2iS.31 FVgurel FACETSsunmiary

Figure

1

shows a sumrnary of all

facets

and

their

elements. They are

_positioned

on a common

logit

scale,which appears as "measure" inthefirstcolumn. The second column shows theseverity variation among raters.The most severe rater

(ID:

3)

isat _the_top,_and

the

least

severe rater

(ID:

5)isatthebottom.The thirdeolumn shows the ability variation

among the20 students. The student isranked withhighability at the top

(ID:

9)

and

low

ability atthe bottom

_(ID:

11).'Iliefourthcolumn shows equivalence of the dificulty yariation among tasks.The lastthreecolumns _graphicallydescribe_the_three_rating_scales.

Themost likelyscale score fbreach ability tevelisshown. Forexample, the student whose ability estimate is+1.0 logiton thelogitscale

(ID:

19)islikelyto_get

3

_points_on _each _task and the overaIl impressionwhen thestudent isassessed byan average-severity rater.

(9)

4.2FACETS analysis

l,}

lsstudent abiiity

ofectively

measured2

As shown

in

Figure1,student ability estimates range firorna highof6.96 logitsto a

lowof -6.27 logits,indicatinga spread of 13logitsinterrns_of _students' _ability.

Student

separation value was 6.54,rneaning that_populationsIikethesestudents inthisstudy can be spread intoabout seven levels.The reliability indexwas _.98,which

demonstrates

the

possibiiitytoachieve reliableabilityscores.

ZJ

Areteacher-ratens equally severe .2

Table

7 FACETS

analysis ofrater characteristics

Fair-Mavera

e

Severity

loits Error Infits uare

(mean

Rater1Rater

2Rater

3Rater 4Rater 5 3,173252.903.063.34-1,07-1.35 .09 -.59-l.70 .27,2726.2627 ,63 .90 .911.25 .96 MeanSD 3.14 ,15 -,92.62 _.26.oo _.93.20

]Vbte.Reliabilityofseparation index=.82;fixed(al]sarne)chi-square: 27.8,df:4;significance: _p=.OO

Table 7 providesinformationon the characteristics of raters. From the left,each column shows rater IDs,fairaverage scores, rater severity, error and fitmean square value.

The second column indicatesthat the severity span betweenthemost severe rater and the most lenientrater was l.79and thedifflrence, basedon falraverage scores inthefirst celumn, isO,44of one _grade inthescale. The reliability of theseparation index

(which

indicatesthe likelihoodof which raters consistently difTerfrom one another inoverall severity) was high

(.82).

The chi-square of 27.8with 4 dfwas significant at_p<.OO and, therefore,thenull hypothesisthatallraters were equally severe must berejected, Therewas a significant diflerenceinseverity among raters. On theother hand,thelnfitMean Square column indicatesthatno raters were identifiedas misfitting: fitvalues foral1raters were within therange oftwo standard deviationsaround themean

(O.93

±

[O.20

×2]).Inother words, allraters behavedconsistently inthescoring,

sp

Hbw much dotcxsksthatare designedto

be

equivalent actually cb;0Zirin

dy/77culty2

Table8Descriptivestatistics on thedifferentscoring

DiMculty

(logits)

ErrorInfit(meansquare) DiscriminationEstimateof

Task1Task 2Impression MeanSD -.08-.07 .16 .oo.11 .21.1921.21.Ol.961.04,77.92.121.041.00124

IVbte,Reliabilityofseparation index=.OO;fixed

(all

same) chi-square: O.9,dft2;significance: _p=,65

[[heanalysis of the two testtasksand overall impressioninTable8 shews thatno sigriificant variation

in

dithcultyexists among them. Ratersare considered to

be

(10)

self:-consistent

in

scoring and

the

tasks

do

not appear to separate thestudents toa sigriificant degree,meaning that

the

diMculty

of

the

twotasksand

the

over impressionare equivalent.

An

estimate of the itemdiscrirninationwas computed according to

the

"Generalized _Partial_Credit_Model"

approach.

1.0 is

the

expected value, butdiscriminations

in

the

range

O.5

to1.5

_provide

areasonable

fit

to

the

Raschmodel

(Linacre,

2oo7:132).

D

How

well doscales conjbrm toexpectations about

their

use2

Do

ratens use all_parts

of

them,ancitase

them

consistently2

Table9Ratingscale statisdcs

for

Accurac_y

Average

Measure Oudit(means uare

stepDifficul 12345 -626-324 583.766.16 .4.91.1.81.6 -7.07-1.422.286.22 -9.0ll -6.0 -3.0 _o.o _3.0 6.0 9.0 ++---+---+---+---+---+---++ 111Plrlobabi1ityl 2 22 o 11 22 1 2 12 *21 22 11 222 11 22 11 222222 11 3333.*11 22 333 22 33 223 .332 3 2 33 2' 2 444 3333 2 33 24444224 3 44 222 3443434344 4444 44 33 3 33 5*S55555 4 5 4 5 3333 45 5 4 *54 5 4 5 4 5 4 5 4 5 4 555 444 5 4 ++---+---+---+---+---+---++ -9.0 -6.0 -3.0 O.O 3.0 6.0 9.0 ]Fifgne2

ProbabilityCurvesforAccuracy

1"************##-l#*i-l"****"##**

Linacre

(2002)

has

proposed

guidelinesfora rating scale:

(1)

average category measures should advancemonotonically

by

category,

(2)

oudit mean-spuares

should

be

less

than

2.0,

and

(3)

thestep difficultyofeach seale should advance

by

at

least

1.4 logits

and

by

no more

than

5.0logits.Tal)le

9

sbows

the

rating scale statistics

for

accuracy.

The

average category measures risemonotonically as expected by category. All outfit mean-equares are lessthan2.0,meaning thateach ef thefivecategories

has

expected randomness

in

choosing categories. However,

the

increase

in

step difficultiesbetween2 and 3 is5.56,which doesnot meet

(3).

Figure

2 presents

the

scale structure _probal)ility curves, which

visually

demonstrate

thefrequencysigrialsof each observed scale. Inthis

(11)

hilltopsare observed. Accordingto

Tyndall

and Kenyon

(1995),

theobvious _peaksand the

division

between

thescales indicatethatthescales work as

intended.

Thedivisionbetween

thescales isclear,

but

the_probabilityand therange ofscale 2

indicate

that thescale does not work as intended.

Table1ORatingScaleStatisticsforCornmunicability Category

Score

AverageMeasure

Outfit(mean

s uare

StepDirncul 12345 -5.57-2.99 .663.055.88 .61.11.3.61.3 -5.54-1.612.264.90 -9.011

Plll

Probabi1ity -6.0 -3.0 _o.o _3.0 6.0 9.0 ++---+---+---+---+---+---++

l]

l'11l11

1

ll 22 1 2 11 22 11 22 121 22121 2 1 2 1 2222 22 11133*13 22 33 2 3 23 3 *2 3 2 3 2 3 2 33333 2244*2 33 444 3 4 34 434433 4 3 5 444 3 35553 44 5 3 5 45 * 54 5 4 5 4 5 4 5 4 55 44 55 44 555555

1 l

2222 333 111 444 222 555 333 4444

oe.**..*.*i*-*i*-*-'.*-*i*i*.*-:rt;*-*-r*"trrri*.*rrr.*fi*f-*i.*r-*i*.*-*-*-*.t*.*.*H*x*.*.r*.rr.**.1 -9.0 -6.0 -3.0 O,O 3.0 6.0 9.0 ]FVgtire3 ProbabilityCurvesfbrCommunicability

Table

1O

shows therating scale statisticsforcommunicability. Alloutfit mean-squares areless

than

2.0,which meet

(2).

Allstepdiencultyincreases

fal1

within 1.4and 5.0,which

doesmeet

(3).

In

Figure3,thestep

dithculties

increasemonotonically withratingscale numbers and obvious hi11topsareobserved.

Table

11RatingScaleStatisticsfbrOveral1Impression

StepDithcul

12345 -6.02-3.57 .763.656.61 .4.6.6.91,O -6.81-l.66223623

(12)

-9.0 -6.0 -3,O O.O 3.0 6.0 9.0 ++---"+r"--"---+---++---H--h-+----"-rr-r+---ff---++ IL Probabiilty

E[1

]l

1

11112l

11 22 1 2 l1 22 1 2 12 *21 22 11 2 l 22 l 222222 11ll33 22 33 333111 3 22 33 2 3 23 *32 3 2 33 22 333 33 2 444244*24 33 44 222 3 4 34 4 *3 4 3 4 3 4 3 4444 44 33 S555 44 5 5 5 5** 3333 4 5 45 *54 5 4 5 4 5 4 5 4 5 55155

1

44444

1 ttl]

o

_{1*******************************************************************}

++----"-H-"-+-"---J----+F-fi-+---+---+---"n+-"---M-++ -9.0 -6.0 -3.0 O.O 3.0 6.0 9,O jF}gure4 ProbabilityCurvesforOverallImpression

Tal)le11 shows

the

rating scale statistics

for

impression.

Averagemeasures rise monotonically with each category. Alloutfitmean-squares are lessthan 2.0.Allstep

dithculty

increases

fallwithin1.4and 5.0,which

does

meet

(3).

InFigure4,thestep

dienculties

increase

monotonically with rating scale numbers, and obvious hi11tops are observed. Whilethecategory 2of theaccuracy sealewas slightly 1argerthanexpected,

the

rating scales

for

commmieability and overall impressionconfbrmed toexpectations

al)out

itsuse. Insum, thesemodified 5-pointscales could be a reliable toolfornovice raters

in

determining

the

estimate ofstudents' writing ability.

sp

Do individualraters show harshor

lenient

bias

towards _particular_groups

ofwriters2

Zf

so,what are thesub-patterns

ofratings

interms

ofrateJ"-student

biasinteractionforeach rater2

Sincetherater-student

interactions

where z-score values fallbelow-2.0or abeve2.0 means a significantbias,therewere a totalof ninesigriificarrtly

biased

interactions

among ailraters. Tables12-16show theresults of thebiasanalysis intermsof

intera

¢tionbetvveen

raterseverity and student ability.In column 9,the

infit

mean square value shows how eonsistent thebias

pattern

is

for

the

rater toevaluate thestudent's abilityacross all the scoring. Inthiscase, themean of

the

infit

mean square value was O.4and itsstandard

deviationwas O.5.Thus

fit

values above

1.4 logits

suggest misfit

(O.4+[O.5

×2]).As shown

in

Table

13,one

(student

10 × Rater2)was

identified

as misfitting

(its

infitmean square

value was above 1.4logits).The mean-square fitstatistics report on

how

much misfit

there

is

in

the

dataafter thebiasisremeved

(Linacre,

2002).[Fherefbre,itwould be

(13)

Table12Biascalibration report: rater-student

interaction

forRater1

StudentAbility(logits)Observed score

Expected Obs-Exp Bias

score Average (togits)Errorz-scoreInfitMeans

uare

1 O.18 12 9.7

O.78

2,71 1.12 2.41 o.o

Table

13 Bias

calibration report: rater-student interaction

for

Rater

2

StudentAbility(logits)Observed

score

score Average

(]ogits)Errorz-scofeInfitMeans

uare

17 -O.07 12 9.7 O.77 2.68 1.12 2.39 o,o

10 O.18 12 9.9 0.70 2.43 1.12 2.l6 3,O*

12

3.75

ll 12.9 -O.63 -2.321.05-221 _O.8

Nbte.'= misfitting

Table14Biascalibration report: rater-studentinteractionfbrRater3 StudentAbility(]ogits)Observed

score

score Average (]ogits)Errorz-scoreInfitMeans uare

6 4.00 14 12.0 O.68 2,65 124 2.14 O.8

Table15 Biascalibration report: rater-studentinteractionfbrRater4

StudentAbility(legits)Observed score

score Average (togits)Errorz-scoreInfitMeans

uare

97

6.962,28

139 14.5ILI -O.51-e.69-2.32-2.521.121.20 -2.06-2.10O.4o.o

1 O.18

6

93 -1.10 -4.68 _1.45 -3.23 _o,o

Table16Biasca]ibration report: rater-student interactionforRater5 StudenttAbMty(logits)Observed

score

score Average

(]ogits)Errorz-scoreInfitMeans

uare

11 -627 ₇ _5.7 O.44 2.36 1.15 2.05 1.0

Table17 summarizes the frequenciesof rater-student interactionsthatdisplayeda sigriificant

bias

foreach rater at various levelsof the ability range. The

first

colunm shows

theability estimate range, and

the

second column shows thenumber ofstudents withinthe

particularrange of abilityestimate. Intherange of3.00 or

higher,

therewere fourstudents.

Rater2and 4 harshlyscored one student fbreach, and Rater3lenientlyscored one student.

The

totalnumber of rater-student

bias

interaction

was three,which was 759t6of thetota1 number ofstudents within thisrange

(3!4=O.75).

Therewere twelvestudents whose ability estirnatewas

between

-2.99and 2.99.Rater4

harshly

scored two students. Rater1leniently scored one student, and 2lenientlyscored two students.Thetotalnumber of rater-student

biasinteractionswas five,which was 42% of thetota1number ofstudents withinthisrange

(14)

scored ene student.

This

is

theonly one-rater-student

bias

interaction,

which was 20% of

the

totalnumber ofstudents

within

this

range

(1!5=O.20>.

Table17Frequencyofrater-student biasinteraction

Ability N Harsh Lenient

RlurR3R4R5Rlm asR4R5

3.00hier4 1 1 1

-2.99--2.99 ₁₂ ₂ ₁ ₂

-3,OOIower ₅ ₁

Tables12-16alsoindicatethateach rater hadthefbllowingunique rater-student

bias

pattern.'Rater

1:Therewas a more lenientlyscored student than expected fbrRater 1.The lenientlyscored student was ofmiddle range ability

(between

-2.99and 2.99).

' Rater2:There_were both_more harshly_and leniently_scored _students than _expected

fbr

Rater2.The harshlyscored student was a highability student

(3.00

or higher)and the

lenientlyscored students were ofmiddle range ability.

'_Rater_3:_As _in_the_case _of _Rater_1,_there_was _a _more _leniently_scored _student _than_expected

fbrRater3.UnlikeRater1,thelenientlyscored student was one with highability.

'_Rater_4:_There_were _more _harshly_scored _students _than _expected _for_Rater_4._The_harshly

scored students includedone student withhighability and two students with middle

range ability.

' Rater5:As inthe_case _ofRater

1

_and Rater3,there_was _a _more leniently_scored _student

than expected forRater5.

Unlike

the

_two _raters,_the leniently_scored _student hadlow

ability.

6)

Do theraters score _one _tasksmore hat:shlyor more lenientlythan the other.7

ifso.

what are thesub-patterns

ofratings

in_terms

ofrater-task

biasinteractionforeach rater2

Table18 Biascalibrationreport:rater-task interaction RaterTasks Observed Expected Obs-Exp Bias Error

score score Ayerage (logits)

z-score Infit Mean s uare 524332ll4I55423Communicability AccuracyCoTnmunicabMty AccuracyImpressionImpressionCorTununieability AccuracyImpressionImpressionIrnpressienAccuracyAecuraeyCorrmiunicability Communicab"i

71676460616765646365676560645668.065.162.258.559.966.464.763.863.065.268.066.661.766.258.6O.15O.10O.09O.07O.05O.03O.OlO.Olo.oo-O.Ol-O,05-O.08-O.08-O.11-O.13O.57O.43O.34O.32O.23O.13O.05O.04-O.Ol-O.04-022-O.36-O.36-O.41-O.49O.44O.47O.43O.47O.47O.47O.44O.47O.47O.47O.47O.47O.47O.43O.441.31O.92O.79O.68O.49027O.12O.08-O.Ol-O.09-O.46-O.76-O.78-O.95-1.12O.91.31.7*O.7O.9O.6O.6O.71.0O.6O.81.1O.8O.71.0

(15)

Table

18

shows

the

results of

the

bias

analysis

in

terrnsof

the

interaction

between

ratersand tasks.Itlistsal1rater-task

interactions

(5

raters×3tasks).Incolumn 8,thereis neither a z-score

below

-2.0 nor _greater

than

+2.0 suggesting

that

no rater shows significantly biasedrater-task interactions.Incolumn 9,thernean ef theinfitmean square value was

O.9,

and

its

standard

deviation

was O.3.Thus,

fit

values above 1.5

logits

suggest

misfit

_(e.9+[O.3

X2]).TIhevalue byRater4on `communicability'

was l.7,which indicates

that

Rater

4

did

not consistentlyevaluate thetask

in

the

identified

_patternsof

bias

across

al1

students.

D

7b

what exten4 stotisticalZJ4is

the

task-basedwriting testareliableand validmeasure ,?

(1)

Reliability

in the firstanalysis･

(Tal)te

7)the dataset was -analyzedusing FACETS. [Ihetable

providedinfbrmationon

the

characteristicsof raters

(seyerity

and consistency). Allraters

displayedacceptable levelsof selficonsistency. This can be seen from the InfitMean

Square

column,

by

adding two standard

deviatiens

tothemean. Raters

falling

within

these

parametersintheirreported

Infit

Mean

Square

illdices

are considered to

have

behayed

consistently. On theother hand,theseparation and reliability figuresindicatethatthere were significani differencesameng raters interms of severity. However,thedifference,

based

on

fair

average scores,

is

O.44of one _gradeinthescale,suggesting thattherewould

beno impacton scores awarded inan operational setting.Theanalysis of thetwo tasksand

the

overall

impression

in

Table8 show thatno significant differenceoccurs betweenthe

tasksand theimpression.'lheadjacent scale levelon thetwo tasksmay indicatethatthe

tasks

do

not appear toseparate thestudents toasignificantdegree.

(2)

Validity

lnTable8,an estimate of theitemdiscriminationwas computed according tothe

"Generalized _Partial_Credit_Model"

approach. I.O

is

theexpected value,but

discriminations

intherange of O.5to 1.5_providea reasonal)le fitto theRaschmodel

(Linacre,

2007,

p.132).Ailtheestimates

fall

in

this

range

(1.04,

1.00,1.24),which

indicates

thatthe randomness inthreesets of

data

fit

the

Rasch model. The two tasksand

the

overal1

impressionwere, therefbre,of relevance to

dependent

dataacquisition.

There

is

also a criterion-related validity _evidence _as to

the

three

_scores _of _each _rater_and _the&iterion score. Table19 shows theresulting correlation coecacients fortherelationship

between

each of threeraters' scores and theCV'iterion_score,and theywere _{statistically}_signlficant

Cp<.O1)

forTask1,Task2and overall irnpression.The substantial correlations demonstrate

thatthetask-basedwTiting _perfbrrnanceisrelated tolearner_perfbrmance_on

(>-iterion,

_a widely-used writing instructionaltools,and thusresult insupporting itsvalidity.

Table

l9

Inter-rater

correlationcoethcients

between

raters'scoresandthe

eiterion

score

Rater1

Rater2 Rater3 Rater4 Rater5Av .

Tl.67 .74 .78 .72 .68 .72

T2.79 .67 .67 .64 .70 .70

OI.68 .75 .81 .68 .76 .74

(16)

5Discussion

5.1Summary ofthe analyses

The five

junier

highschool teachersinthisstudy were all novice raters inthis task-based writing _assessment. Theinterrater_correlation_coethcients between_pairs_ofraters were relatively

high,

and the

five

raters appeared to be of acceptable reliability.The

FACETS arialysisshowed thattheratersdisplayedacceptable levelsof selficonsistency,

Therewere, hewever,significant differencesbetweenraters interms of severity, Thebias analyses indicatedthat

al1

raterswere significantly

biased

towards certaintypesofstudents, and theirbias_patternswere unique. Moreover, itmust be said that one rater-student

interactionand one rater-task interactionwere identifiedas misfitting,so theseraterswere not consistent intheidentified_patternsof biasacross the students or tasks.The FACETS

analysis alsoshowed thattherewas no sigr)ificantly differentscoring among

the

two tasks

and theoverall impression.The adjacent scale

level

on thetwo tasksmay

indicate

thatthe testtasksdidnot separate thestudents to a sigriificant

degree

and thusthetesttasksare roughiy equivalent indiMcultywhen

learner

responses tothetwo taskswere scored based on the accuracy and commmicability scales.

Since

the 5-pointscales

dernonstrated

acceptable fit,the five¢ategories and theirspecific written samp]es were mostly comprehensible _and usable byraters. Therefore,itis_quitelikelythatthe assessment tasks

and rating _scales_were _reliableindeterminingan estimate ofstudents' writing ability.

52 Implications

The findingsofthis study suggest thattheTBWT scoring _guidemay

have

eflectively

givennovice _raters_a_shared understanding ofthe construct ofwriting ability

defined

bythe testwriter and may have_contributed _totheconsistency inscoring aridthereduction inthe

biasedinteractionswith _tasks.Itis,_therefore,reasonable tosuppose thattheTBWT scoring

guidemay _possiblyreduce _thedifferences_or biases_caused byvariation among raters. In order toconfirrri this,a _{questionnaire}was also administered _to_thefive_{teacher-raters}in_this study bymail. Itwas designedtobecompleted in_a_short_time.Most of the_questionswere of themultiple choice variety. The last_questioninvitedcomments _and _opinions on the whole _guidebook.

Question

lthrough

3

were answered on a 3-pointscale from1

(No,

not usefu1) to 3

(Yes,

very usefu1).

Questions

4 through 9were _answered on 4-pointscale from

1

_(Strongly

disagree)to4

(Strongly

agree).

As shown inTables20 and 21,the results of the_{questionnaire}_show _that_thefive

teacher-ratersfeltthattheTBWT scoring _guidewas fairly_usefu1. According_to _the

FACETS analysis,thecategory 2ofthe accuracy scale was _slightlyiarger_than_expected. In other words, with increasingmeasure, the category 2 ismost

likely

_to be_observed for accuracy scoring among thefivenovice _raters.Thismay beinagreement withtheresult

thatnovice raters _tend_to

become

_mere _severe _than experienced raters as some _previous studies

have

shown

(Ruth

and Murphy,1998;Weigle,1994,1998).Whereas thereseems to

be an admission _of improvementin_the_scale,_the_{questioimaire}_survey

indicated

_that_the

fivecategories _and theirspecific written samples were mostly comprehensible and usable

(17)

guidelinesare supposed to leadthe raters to selgconsistency and reduction of biased

interactions

with tasks.Therefore,itis_quitelikelythattheTBWT scoimg _guide

haye

effectively

_given

novice raters a shared understanding ofthe construct ofwriting ability _and

have contributed totheconsistency indeteminingan estimate ofstudents' writing _ability

Table20Resultsofscaling in

Questionnaire

(Q.

1-3)

Questions

Veryuseful usefijtl Notusefu1

1.Istheintroductionuseful? 2400/e 36oe/, o

2.Isthetaskexlanationuseful? 5looo/, o o

3.Isthescorinrocedureusefu1? slooo/,

o

Table21Resultsofscaiing in

Questionnaire

(Q.

4-9)

uestlons Stronglyee_Agree _{DisagreeStronglydisee}

4.Thedefinitionofaccuracyis understandable 2(4oof.)3(6oo/,)o o 5.IIheaccuracyscaleiseasytoevaluate.1(200/e)3(6oo/,)1(200/e)o

6.Thesamplesforaccuracyareuseful.3(60e/e)2(4oo/.)o

o 7.Thedefinitionofcomunicabilityis understandable. 2(400/e)2(4oo/,)1(2oe/,)o 8.Thecommunicabilityscaleiseasyto evaluate. 1(2oo/,)4(soe/.)e o 9.Thesarnplesforcomunicabilityare uefu1. 2(4oo/,)3(6oo/,)o o

Thispresentstudy, however,indicatesthat there were significant

biased

interactions

with students' abilityarnong al1fiveraters. Each rater was

found

te

be

selfLconsistentin scoring 20 studenis' writing _{performances,}butall of the raters hada unique

bias

_pattem

toward a certain typeofstudent. [IIhesefindingssuggest thattheTBWT scoring _guidemay

have

contributed to

the

reduction of biasedinteractions,buttrainingforcertain raters with

hislherunique

bias

_patternsmight stillbe required. On this_point,Schaefer

(2008),

for

example, states that therating _processis_so _complex _and error-prone that intentionaland

programmatictraining

ibr

rating

is

necessary.

Moreover,one rater-student interaction_and _one rater-task interactionwere identified as misfitting, _so

these

_raters_were _not _consistent intheidentified_patternsofbias across the students or tasks.As

fbr

the

rater-studentinteraction,Rater2 commented thatitwas not easy fbrher_to_understand _the

diffbrences

between

"few

errors" and "occasienal errors" in

thedescriptorofaccuracy scale.For.therater-task interaction,Rater4 commented on the

introductionofthe scoring _gujde

that

theconfbmity betweenthetask and itsunderlying construct

isi

not easy teunderstand, and a detailedexplanation isnecessary.

In

association with thi$._cgmment, _she disagrged

in

_the _{questionnaire}

that

_the

definition

_of communicability is understandable, and describedthat she rated the task'

for

communicability mostly depe.ndingon _gralp.ipar and vocabulary. Althoughsuch a unique response _patternseems to reflect

individual

rater idiosyncrasies,these

findings

would be

(18)

beneficialinimprovingour understanding of raterbehaviorand in_providingmore focused

directioninrater training.As Eckes

(2008)

points

out, theconsistency with which each rater _uses _particularscoring criteria should beexamined more _precisely.Forthis_purpose, research viewed froma rater cognition _perspectiveduringevaluation of written samples must

be

conducted

hereafter.

6Conclusion

The resultsof thefivenovice ratersinthisstudy indicatedthatal1ratersdisplayed acceptable levelsofselfconsistency, and

that

thestudents' abilitywas effectivelymeasured using these tasksand rating scales. Ihe FACETS analysis showed thattherewas no significantlydifferentscoring on thetwo tasksand overal1

impression,

which _provided reasonable fittotheRaschmodel. The _{questionnaire}survey also indicatedthatthefive categories and theirspecific written samples were mostly comprehensible and usable by raters, and the5-pointscales demonstratedacceptabSe fit.Thisisbecausethe TBwr scoring _guidehas_givennovice raters a shared understanding of theconstruct of writing ability, and hascontributed tetheconsistency inscoring. Therefore,itis_quitelikelythatthe assessment tasksand rating scales were reliable indeterrniningan estimate of students' writing ability.

There

were,

however,

relatively smal1

but

significant

differences

betweenraters in

terrnsofseverity. Eachrater was

fbund

_to

be

_{selgconsistent}

in

_scoring,

but

the

bias

_analyses of thisstudy also

indicated

al1

of

the

five

raterswere significantlybiasedtowards certain types of students. These findingssuggest that theTBwr scoring _guidemay have contributed _tothe reduction _ofbiased

interactions,

but

_trainingforcertain raters with hislher

mique bias_patternsmight stillbe required. Moreover, the consistency withwhich each rater uses _particuiarscoring criteria _should

be

_examined _rnore _precisely.For_this_purpose, research viewed froma rater cognition _perspective

during

_evaluation _of _written _samples must beconducted. Theseissueswi11beexamined ina subsequent study.

Acknowledgement

The presentresearch was supported in_partbya Grant-in-Aidfbr

Scientific

Research

for

201O-2012

_(No.

22520572)fromtheJapanSocietyforthePrornotionof Science.Iam

gratefu1toanonymous JLTA reviewers forsuggestions inrevising thearticle.

References

Alderson,J.C.,Clapham, C. and Wal1,D.

(1995).

Language test construction and

evaluation. Carnbridge:CambridgeUniversityPress.

Bachinan,L.F.and Palmer,A.S.

(1996).

Language testing in_practice:clesigning and

developingusefiil languagetests.Oxford:Oxford

University

Press.

BachJnan,L.F.

_(2002).

Some reflections _on task-based

language

_perfbrmanceassessment.

Language7;7sting19,453-76.

Brown,J.D.

_(1996).

7lestinginlanguage_programs.UpperSaddleRiver,NJ:PrenticeHall

(19)

Cohen,J.

(1988).

Slatisticalpower

anaL)nsisfor the behavioralscience.

(2"d

ed.),Hillasdale, NL: LawrenceErlbaumAssociates.

Davies,

A.

_(1990).

A Principles

oflanguage

testing.BasilBlackwelJ.

Eckes,T.

_(2008).

Rater

types in writing _perfbrmanceassessments: A classification

approach toratervariability.Langunge71esting25,155-85.

Linacre,J.

(2002).

Guidelinesfbrrating scales. Mesa ResearchNote 2

(Online).

AvaiIable

athttp:llwww.rasch.orgtrn2.htm

(accessed

18March2008).

Linacre,J.

_(2007).

A

b(ser

's

guidetoE4CE7S: Rasch-modelcomputerprogram .

Chicago,

IL:MESA Press.

Linacre,J.

_(2008).

Ibcets,version no. 3.63.Computerprogram.Chicago,IL:MESA Press.

Ruth,

L.,& Murphy, S.

_(1988).

Designingwriting tasks

for

the assessment

of

writing.

Norwood,NJ:Ablex

Publishing

Corp.

Schaefer,E.

_(2008).

Raterbias_pattem

in

an

EFL

writing assessment of writing. Language

7;esting,25,465-93.

Skehan,P.

(1998).

A cognitive _approachtolanguage

learning.

OxfbrdUniversityPress.

Sugita

Y. _(2009).

Thedevelopmentand implementationoftask-based writing _perfbrrnance

assessment fbrJapanese

learners

of English.Jburnal

ofPan-Pacijic

Association

of

Applied

Linguistics13(2),77-103.

Tyndall,B.,

&

Kenyon,

D.M.

_(1995).

Validationofa new

holistic

rating scale using Rasch

multifaceted analysis. InA.

Curnming

& Berwick

(Eds.),

Ubliclationinlanguqge

testing

(pp.

39-57),

Clevedon,

England:MultilingualMatters.

Weigle,

S. _(1994).

Effectsof trainingon raters of ESL compositions. Latzguage7lesting11, 197-223.

Weigle,S.

_(l998).

UsingFACETS to model rater trainingeffects. Language 7lesting15,

263-87.

Appendix A:Assessmenttasksfortesting

1. Task1(aecuracy)

' Rubric:This

is

a testof yourability to write a coherent and grammatically

paragraph.You wi11have20minutes tocornplete thetest.

' Prompt:You are goingtostay with theParlcerFamily

in

Britainthissummer.

100-120

word letterintroducing_yourselftoyourhostfamily.Beforewriting,

the fbllowingtopics: m Yournarne and age

- Yourjoband major inschool

-Your farnilyand _pet

-Your interestsand hobbies

-Your favorite

places,

fbods

and activities

-Your experience travelingabroad

- Some things _youwant to

do

while _youare inBritain

correct

Writea

thinkof

2. Task2(communicability)

(20)

without causing the reader any dienculties.You will have1Ominutes tocomplete the

test.'

Prompt:You are _goingto discussthefo11owingtopic with_yourclassmates, "Why do

youstudy English?"Inorder to

_prepare

forthediscussion,thinkof as many answers as

possibletothequestionand write them as t`To travelabroad."

Appendix B:Rating scales