雛 M欝 T
Chapter 3 Reliability of the Lesson Observation lnstrument
This chapter discusses the validity and reliability of our lesson observation
instrument. We follow Guarino, et al. (2001) to verify reliability.
3.I A previous s伽dy testing validity and reliability of t血e SIOP
Guarino, et al. (2001) tested the validity and reliability of the SIOP. They
defined three maj or SI dimensions, (1) Preparation, (2) lnstruction, and (3)
Review/Evaluation, and examined the consistency of the rating among raters (inter−rater
reliability). Among the three dimensions, Preparation included two main features of lesson preparation and building background, which consisted of 6 assessment features;
deterrnining the lesson obj ectives and content obj ectives and selecting age−appropriate
content concepts and vocabulary, and assembling supplementary materials to
contextualize the lesson. lnstruction included the main features of comprehensible input, strategies, interaction, practice/application, and lesson delivery. These main
features were finther broken down into 20 items, such as making connections with
students background experiences and prior learning, modulating teacher speech,
emphasizing vocabulary development, using multimodal techniques, promoting
higher−order thinking skills, grouping students appropriately for language and content
development, and providing hands−on materials. Review/Evaluation consisted of 4
items; assessment of student comprehension and learning of all lesson obj ectives.Guarino, et al. (2001) asked four teachers to observe and rate 6 video−taped
lessons, each of which lasted about 45 minutes. The raters were experienced SI teachers and the total teaching experience of each teacher was over 30 years. Three of
the video−taped lessons were perforrned by sheltered instruction specialists and the other
three were conducted by non−SI teachers. Cronbach s alpha was used to confirm the interd−rater reliability for the three dimensions above. The result indicated that all the
correlations among the raters achieved .90 or higher, and thus the rating was considered
significantly consistent. They also compared the scores given to SI−based lessons and
non−SI based lessons. The result showed that the raters tended to score higher fbr the
SI based lessons than fbr non−SI based lessons. This suggested the SIOP was an appropriate tool fbr measu血g SI based lesson.
32 lnterrater reliability of our observation instrument
To test the reliability of our observation instrument, we put it to use in the
Japanese junior high school context and calculated the interrater reliability, following
the procedure described in Guarino, et al. (2001).
3.2.1 Raters
The raters were two people; the author and her supervisor, who was a faculty of
auniversity of teacher education with I 6 years of experience in teaching pre/in−service
teacher education. The author was a pre−service teacher with virtually no experience
in teaching. Both of the raters worked on the present study and had a fairly deep grasp
ofthe SIOP.
3.2.2 The lessons observed
Three video−recorded English lessons were observed and rated. Each of them lasted 45 minutes and all of them were taught by a Japanese teacher of English. Two
lessons (Lesson A and B, respectively) were recorded on October 20th, 2010 and a third
one (Lesson C) was recorded in February, 2001. ln Lesson A, the teacher, who had 14
years of teaching experience, taught the first grade of j unior high school. The main
focus of the lesson was to practice and repeat the words previously leamed from their
textbook. The most of the lesson was delivered in a teacher−centered way. Lesson B
was taught by a teacher with 12 years of teaching experience. Since the lesson was conducted j ust before the final exam would begin, the teacher had to focus on the
review of learning and little interaction between teacher−students or student−student was
observed. Due to the fact that Lesson A and B were teacher−centered and very little
interaction was observed, we used another video−recorded lesson which had contrasting
features with these two lessons. The lesson was taught by a male teacher with 25 years
of teaching experience. He taught third graders and provided a lot of oppo1加nities fbr
communication and teacher−student or student−student interaction. The students were engaged in activities in pairs, which were effectively configured. He frequently assessed students learning during the lesson by checking the worksheet the students worked on or asking questions which seemed to assess students understanding of the lesson content.
3.2.3 Procedure
The raters watched the three different English lessons individually and scored
them according to the procedure described in Chapter 2. After collecting the
evaluation sheets, Pearson correlations coefficient was calculated for all three lessons.The result is shown in Table 2.
Table 2. Pearson correlations coefficients between the two raters
A B C Mean
O.818 O.859 O.943 O.907
The mean of the coefficients of the three lessons was .907, which is considered an
appropriate estimate of interrater reliability. However, individual coefficients of the
rating of the three lessons varied丘om.818 fbr Lesson A to.943 fbr Lesson C. To identify the cause ofthe discrepancy, we examined how the scores for Lessons A, B and
C were distributed for each main feature. Table 3 below shows the peroentages of the
congruence of the two raters.
Table 3. The congruence of the two raters
lesson building
、oomprehons三華
ts/trategies scaffolding
・…am・ ・・脚・・d 奄iト・i・押tl ・・…漁Pf
潔TP、蕊1鵬, e一㌦罵g
Lesson A IOO.Oe/o so.oo/, o.oe/, 1 66.7e/, 66.7 /e 2s.oe/, 66.70/, o.oo/, 66.70/o 1 oo.eo/o
Lesson B 5e.OO/o 100.Oe/e 33,30/o 100.eo/, o.ea/, so.ew!o so.o /o 25.oo/, 66.7 /e IOO.O /o
1.esson C IOO.Oe/o 100.OO/o 100.OO/o , 66.70/o 葺
66.7e/, 75 .00/. 66.7e/. leo.oo/, lee,eo/. leo.oo/,
As is evident, the scores for comprehensible input, scaffolding, interaction and
lesson delivery given for Lessons A and B agreed to a lesser degree than those for
Lesson C. We attribute the results to the fact that the raters were not yet trained and
tended to rely on subjective judgment in determining the degree of achievement for each feature. Also descriptors for each feature on the evaluation sheet did not guide
the raters to a sound judgment. This is a point requiring furthers investigation for
improvement of the instrument. Having said that, the scores given for the main
features in Lesson C showed agreement between the raters. This is presumably
because the lesson style was highly teacher−centered and matched the framework of the
SIOP and our observation instrument. This poses another issue; that is, the S IOP and
our observation instrument both presupposed particular types of lesson; interactive,
student−centered, activity一 or task−based, and communicative. Our problem is whether
it is necessary to incorporate other features which are observed in more traditional teaching styles. We will discuss these issues in the next chapter.
3.3 Summary
Although some inconsistency of scores among the raters was observed and
further improvement is necessary, we believe our lesson observation instrument has been modified to the Japanese context to evaluate lessons in a relatively reliable way.Since our aim of developing the lesson evaluation instrument is to promote better teacher development, we should have involved the teachers teaching the lessons into the
evaluation processes. That way the instrument could be refined. This issue will be pursued in future research.