Investigating measurement accuracy for measures of lexical richness:
Comparison of subjective and objective methodologies
語彙の豊かさの測定における正確性の検証 —主観的測定と客観的測定による比較—
Keywords:language assessment, speaking, lexical richness
MIKAMI Ryosuke 三上 綾介
1. Introduction
In the area of second language acquisition (SLA), researchers are attracted to investigations of constructs such as language abilities (e.g., speaking and listening) and emotions (e.g., motivation, anxiety, and fun). However, such constructs cannot be seen or touched directly. Despite the difficulty of observing these constructs, researchers must somehow measure them as exactly as possible to use them, for example, to examine developments of language abilities (e.g., Akiyama & Saito, 2016), or to speculate on their linguistic components (e.g., Crowther, Trofimovich, & Isaacs, 2016). The current study classified measurement methodologies of linguistic features in objective and subjective ways, and compared subjectively measured scores of lexical richness with two kinds of objectively measured scores to examine the accuracy of the subjective methodology introduced by Saito, Trofimovich, and Isaacs (2017).
2. Background
As comprehensibility (i.e., ease of understanding a given L2 speech) has attracted researchers’ increasing attention in the area of speech research, they have begun to investigate strongly related linguistic features to help L2 learners prioritize those linguistic characteristics they need to learn to increase their comprehensibility effectively. Along these lines, Isaacs and Trofimovich (2012) foreshadowed objective scrutiny of 19 linguistic features and those that relate to comprehensibility, which span phonology, lexis, grammar, and discourse. Five years later, Saito, Trofimovich, and Isaacs (2017) further developed the precursor study by Isaacs and Trofimovich (2012), and introduced the subjective measurement methodology of analyzing linguistic features.
In more detail, Saito et al. (2017) asked 20 native raters to make an intuitive assessment of 11 linguistic characteristics, including phonology, lexis, grammar, and discourse, on a 1000-point sliding scale, targeting the same data set as that used in Isaacs and Trofimovich (2012). To validate the subjective measurement methodology by using the sliding scale, Saito et al. calculated correlation coefficients between the 19 objectively measured linguistic scores in Isaacs and Trofimovich and the 11 subjectively measured linguistic scores in Saito et al. These calculations elicited moderate to high correlation between the objective and subjective scores, which led Saito et al. to claim that their subjective methodologies were valid enough to quantify the linguistic features.
However, because raters in Saito et al. (2017) were not shown the 1000-point numerical rating scale when they evaluated linguistic features of a given token, raters could not identify the point value they had assigned to the token, which led to uncertainty in differentiating a score of “1” from
a score of “1000.” In fact, Isaacs and Thomson (2013), who investigated whether native judges could properly differentiate 5-point and 9-point Likert-type scales when rating speech for comprehensibility, accentedness, and fluency, concluded that it is difficult even for expert native raters to differentiate these two types of measures. Thus, the 1000-point sliding scales complicate the task of exact assessment of linguistic features for native raters.
Based on the literature review above, the current study targeted the lexical richness of L2 speech to revisit the measurement accuracy of the 1000-point sliding scale introduced by Saito et al. (2017). Following Saito, Trofimovich, and Isaacs. (2016), the present study defined lexical richness as “the vocabulary used by the speaker” and “how sophisticated this vocabulary is” (p. 236). Also, “if the speaker uses a few simple, unnuanced words the speech lacks lexical richness.” By contrast, “if the speaker’s language is characterized by varied and sophisticated use of English vocabulary, the speech is lexically rich” (p. 236).
3. Method
3.1 Speech samples
The current study elicited speech samples from 45 Japanese learners of English (16 males and 29 females) with an average age of 20.1 years (SD = 1.0, Range = 18–22). All the speakers were university undergraduate students with diverse major fields of study, intentionally including speakers with widely varied proficiency levels and a mean score of 738.8 (SD = 130.0, Range = 450– 960) according to their self-reported TOEIC scores (n = 41).
Speakers were involved in individual recording sessions that took place in a sound-treated room at a national university in Japan. First, the author carefully explained the purpose and procedure of the current study. Once participants understood, they completed a consent form that expressed their voluntary participation in the present study and proceeded to the recording session.
During the recording session, participants were first presented with an 8-frame picture story (Derwing, Munro, Thomson, & Rossiter, 2009). They were allowed one minute to understand the story and to plan what they would say about the story. However, they were forbidden to take notes during this preparation phase. The speakers then were instructed to begin describing the story, with no time limitation for their telling. After completing the recording, the participants answered a questionnaire about their language backgrounds, and then were provided with a small payment for their participation. All the procedures lasted approximately 30 minutes.
Upon completion of the recordings, the speech samples were edited using acoustic analysis software Praat, Version 6.0.16 (Boersma & Weenink, 2016). Because of the length difference among samples, with a mean length of 92.9 seconds (SD = 45.5, Range = 24.4–46.3), unfilled and filled pauses were edited out first. Next, an excerpt of approximately 30 seconds was obtained from each edited sample, ending at a natural pause and phrasal or clausal boundary, as identified by the study author. The mean length of the excerpts was 30.8 seconds (SD = 4.3, Range = 20.6–37.3).
Finally, the author exactly transcribed these excerpts to prepare them for assessment of lexical richness. In addition, a second coder, majoring in applied linguistics at a graduate school in Japan, transcribed five randomly selected samples to validate the accuracy of the author’s transcriptions, eliciting an agreement value of 88.3%, which substantiated accuracy. The descriptive statistics for the number of words in the transcribed samples is shown in Table 1 below.
Table 1
Descriptive Statistics for the Number of Words in Transcribed Speech Samples
Measure Mean SD Median Minimum Maximum Skew Kurtosis SE
Word Count 42.2 13.4 41 17 80 0.47 0.21 1.99
3.2 Assessment of lexical richness
Following Saito et al. (2017), the present study analyzed the lexical richness of written stimuli, adopting subjective and objective measurement methodologies. For the subjective methodology, four male native raters were recruited, with a mean age of 60.5 years (SD = 7.5, Range = 46–69). These raters were regarded as expert judges because of their long experience in teaching English at the university level, with an average of 30.0 years (SD = 8.7, Range = 22–38) as instructors. Moreover, three of the raters had earned master’s degrees related to English education (TESL / TESOL).
Each of the raters individually took part in a rating session held in a quiet room. First, the study author carefully described the purpose and procedure of the study. Once the raters understood, they completed a consent form that expressed their voluntary participation in the present study and then proceeded to the rating session.
During the rating session, the raters were first provided with the definition of lexical richness (see Background section for the definition) and the 8-frame picture story used for the speech elicitations, to familiarize themselves with the story they would read and evaluate its lexical richness. Once the raters understood the rating procedures, they proceeded to a practice session in which they evaluated three samples for lexical richness using a 1000-point sliding scale programmed by the author using the free programming software Hot Soup Processor, Version 3.4a (Onion Software, 2016). The slider was programmed as similarly as possible to the slider used in Saito et al. (2017) (see Appendix A for the on-screen label of the slider). As shown in Appendix A, a cursor was originally set in the middle of the slider.
Next, the author manually presented each of the transcribed samples to the raters in a randomized order. In each rating trial, the raters were instructed to read the given token and use the mouse to evaluate the lexical richness of the token. The rightmost end of the slider was labeled “simple words,” and the leftmost end was labeled “varied words.” If the raters moved the cursor to the rightmost end, a score of “1000” was automatically recorded in a computer. Conversely, if the raters located the cursor at the leftmost end, a score of “1” was recorded. In addition, the raters were allowed to change the rating value as many times as they needed to until they made a final decision and proceeded to the next sample. No time limitation was applied to the assessments.
After evaluating all the samples, raters answered a questionnaire about their language backgrounds. Additionally, they were asked to report their understanding of the concept of lexical richness on a 9-point scale (where 1 = “I did not understand this concept at all” and 9 = “I understand this concept well.”). The results elicited a mean score of 8.8 (SD = 0.5), which demonstrated that all the raters highly understood lexical richness. Finally, all raters were provided with a small payment for the completion of the study. Each rating session lasted approximately one hour.
For objective measurement of lexical richness, the current study calculated two kinds of indexes—type frequency and token frequency—targeting the same data set used for the assessments on the 1000-point sliding scale. The two subjective measures were adopted from Saito et al. (2017), so that the outcomes of the present study would be comparable to the findings in the
precursor.
Before computing the type and token frequencies, the following preparations were conducted. First, the period, comma, and punctuation marks were excluded. Next, all capital letters were lowercased to prevent overlap based on typographical differences. However, the 3rd-person singular -(e)s and the past tense -(e)d were maintained. Upon completion of these preparations, the type and token frequencies of the samples were coded using the free programming software RStudio, Version 1.1.463 (RStudio team, 2017). Finally, because the type and token frequencies tend to rely on speech length, samples were divided into seconds, both to normalize the raw type and token frequency and because Saito et al. (2017) followed this process.
4. Analysis and results
Before actually investigating measurement accuracy for the lexical richness, descriptive statistics were calculated, as shown in Table 2 below.
Table 2
Descriptive Statistics for the Objective and Subjective Measurement Scores of Lexical Richness
Measures Mean SD Median Minimum Maximum Skew Kurtosis SE
1000-pointa 624 198 608 205 953 -0.16 -0.78 29.52
Typeb 0.88 0.26 0.86 0.29 1.79 0.86 2.11 0.04
Tokenb 1.38 0.42 1.39 0.63 2.54 0.38 -0.01 0.06
Note. a1 = varied words, 1000 = simple words.
bNormalized by dividing the raw scores by the speech length in seconds.
In addition, the distributions of the three measuring scores were examined by conducting Shapiro-Wilk tests. The results showed that whereas 1000-point scores (W = 0.97, p =.32) and token scores (W = 0.98, p =.53) distributed normally, type scores (W = 0.94, p =.04) did not. Therefore, two separate Spearman’s rank correlation analyses were performed to investigate the measurement accuracy of the subjective methodology—the 1000-point sliding scale—first between 1000-point scores and type scores and then between 1000-point scores and token scores. The results elicited significant rho values of -.82 (p < .001) between the 1000-point scores and the type scores, and -.76 between the 1000-point scores and token scores (p < .001).
5. Discussion
The results of correlation analyses showed that two types of objectively measured scores (type frequency and token frequency) correlate significantly with subjectively measured scores (1000-point scores), yielding correlation coefficients of -.82 with type frequency and -.76 with token frequency. The correlation coefficients indicate that the subjective measurement methodology is accurate enough to quantify the lexical richness of L2 speech.
However, some issues remain for further consideration. First, Saito et al. (2017) did not differentiate four dimensions of lexical richness. According to Read (2000), lexical richness covers the following four aspects: (i) lexical variation/lexical diversity, (ii) lexical sophistication, (iii) lexical density, and (iv) number of errors.
Lexical variation/lexical diversity are quantitative measures that illustrate the variety of words produced by a speaker. These measures typically are quantified as the total number of words produced (token frequency) or the total number of unique words produced (type frequency). Lexical
sophistication measures the degree of word sophistication observed in L2 production, frequently calculated as the ratio of infrequent words to the total number of words produced. Lexical density is the ratio of content words (e.g., nouns, verbs, adjectives, adverbs, etc.). The number of errors is simply the number of words, including lexical errors.
The definition of lexical richness adopted in Saito et al. (2017), and also adopted in the current study, refers to the sophistication of a speaker’s vocabulary. In addition, if a speaker uses a few simple words, the speech lacks lexical richness. Following Read (2000), whereas the former attribute relates to the first of the four dimensions of lexical richness (lexical sophistication), the latter relates to the second (lexical variation/lexical diversity). At very least, this indicates that Saito et al. confuses these two dimensions of lexical richness.
Secondly, it is uncertain whether lexical richness can be applied to speech samples of short length. Measures of lexical richness frequently apply to longer written texts. In fact, Hess, Sefton, and Landry (1986) states that at least 350 words are required to elicit reliable scores of lexical richness. However, the speech samples in the current study included only 80 words at most.
Finally, the current study needs to acknowledge one limitation. It was designed to speculate on the measurement accuracy of the 1000-point sliding scale, targeting the lexical richness of L2 speech, specifically on whether native raters can differentiate a score of “1” out of “1000.” Although the findings of the current study replicated those of Saito et al. (2017), the crucial issue of the differentiation of scores could not be resolved. Thus, future studies are expected to examine this issue. For example, intra-rater reliability can be calculated as an approach to exploring that differentiation. More specifically, it is possible to ask the same raters to assess the same samples again after a long-term interval, then to examine whether the scores of the first trial correspond to those of the second trial.
6. Conclusion
The current study investigated measurement accuracy of the 1000-point sliding scale as introduced by Saito et al. (2017), comparing subjective measures of lexical richness (1000-point scores) with two types of objectively measured scores (type and token frequency scores). Although the results tentatively support the findings of the precursor study, they remain inconclusive because some issues need to be considered (i.e., length of texts, four dimensions of the lexical richness, and differentiation of a score of “1” out of “1000”). However, the present study provides a valuable exploration of the measurement accuracy of the 1000-point sliding scale, and sheds the first light on intuitive judgements of linguistic features.
(Graduate School, Nagoya University)
References
Akiyama, Y., & Saito, K. (2016). Development of comprehensibility and its linguistic correlates: A longitudinal study of video-mediated telecollaboration. Modern Language Journal, 100(3), 585–609. doi:10.1111/modl.12338
Boersma, P., & Weenink, D. (2016). Praat: Doing phonetics by computer (Version 6.0.16) [Computer program]. Retrieved April 14, 2016, from www.praat.org.
Crowther, D., Trofimovich, P., & Isaacs, T. (2016). Linguistic dimensions of second language accent and comprehensibility: Nonnative listeners’ perspective. Journal of Second Language Pronunciation, 2(2), 160–182. doi: 10.1075/jslp.2.2.02cro
L1 fluency and L2 fluency development. Studies in Second Language Acquisition, 31(4), 533– 557. doi:10.1017/S0272263109990015
Hess, G., Sefton, K., &Landry, R. (1986). Sample size and type-token ratios for oral language of preschool children. Journal of Speech, Language, and Hearing Research, 32(3), 536–540. doi:10.1044/hshr.3203.536
Isaacs, T., & Thomson, I. R. (2013). Rater experience, rating scale length, and judgments of L2 pronunciation: Revisiting research conventions. Language Assessment Quarterly, 10(2), 135– 159. doi:10.1080/15434303.2013.769545
Isaacs, T., & Trofimovich, P. (2012). Deconstructing comprehensibility: Identifying the linguistic influences on listeners’ L2 comprehensibility ratings. Studies in Second Language Acquisition,
34(3), 475–505. doi:10.1017/S0272263112000150
Onion Software (2016). Hot Soup Processor (Version 3.4a) [Computer program]. Retrieved April 15, 2016, from hsp.tv.
Read, J. (2000). Assessing vocabulary. Cambridge University Press.
RStudio team (2017). RStudio (Version 1.1.463) [Computer program]. Retrieved October 25, 2018, from www.rstudio.com.
Saito, K., Trofimovich, P., & Isaacs, T. (2016). Second language speech production: Investigating linguistic correlates of comprehensibility and accentedness for learners at different ability levels. Applied Psycholinguistics, 37(2), 217–240. doi:10.1017/S0142716414000502
Saito, K., Trofimovich, P., & Isaacs, T. (2017). Using listener judgments to investigate linguistic influences on L2 comprehensibility and accentedness: A validation and generalization study.
Applied Linguistics, 38(4), 439–462. doi:10.1093/applin/amv047 Appendix A