Research
Rubric Use in Technical Communication: Exploring the Process of Creating Valid and Reliable Assessment Tools
Researcher’s note (June 2026). This was my first published article — and the first thing to come out of my dissertation, which, if you’ve survived a doctoral program, you know is about the only thing you think about for a couple of years. It ran in a special issue on assessment whose guest editor was Tom Orr, and having someone like Tom usher me into peer-reviewed research was a bigger deal than I understood at the time. I got to reflect on what he meant to the field years later, in a special-issue editorial I co-wrote with Suguru Ishizaki that closed with a tribute to him after his passing. The editor-in-chief, Jo Mackiewicz, made the whole experience just as memorable — she even had authors record short podcasts about their articles to promote the journal, and mine survives as a little 2010 time capsule.
The whole thing happened a little by happenstance. I first learned about rubrics — holistic and analytic, or trait-based — through the TEACH program at Texas Tech, a graduate professional-development program I’ve since realized was extraordinarily rare; most colleagues I meet never had anything like it, for grad students or even for faculty. Those sessions stuck with me, so when it came time to measure my dissertation’s explicit-teaching experiment (its results are in this Journal of Writing Research article), rubrics were how I compared the treatment and control groups. Most people know rubrics as teaching tools; almost no one had treated them as research tools — and that is really what this paper is about: the rigor and reliability you need when your data is student writing, which never measures as cleanly as phenomena in other fields.
My fondest memory is the four undergraduate coders who evaluated all of that dissertation data — Bryan, Darcy, Natalie, and Ashley, whom I lightly disguised in the paper as Bryant, Marcie, Natalia, and Ainsley (subtlety, clearly, was never my strong suit). They were students I’d taught in first-year composition and intro to technical writing, and I’m still in touch with most of them all these years later. I still smile thinking about our norming sessions — arguing over practice claim letters, refining the rubrics until the traits were genuinely mutually exclusive. I didn’t have much money to pay them, so I often paid them in food. Cooking was a skill I picked up in grad school — Lubbock’s options were thin back then — and those big thank-you dinners and lunches are some of the best memories I have of the whole period.
The piece stays evergreen because rubrics never went away. Learning-management systems now ship with rubric tools that make mutual exclusivity far easier to hold; AI makes building intricate, site-specific trait rubrics easier still — exactly the tailoring I argue for here. And the deeper thread runs straight through the rest of my career: the codebook-driven content analyses I’d later run with Erin Friess all trace back to that one lucky TEACH session. It’s the same question I keep asking in my work on AI and assessment — how do you build a tool rigorous enough to trust with something as human as writing?
Abstract
Assessing the quality of student efforts and products is a continual necessity for academics and practitioners in technical communication; however, the process of constructing valid and reliable rubrics remains an underexplored topic in the field. This paper first addresses some of the assessment concerns and then describes a case study that documents the development and implementation of one holistic and five analytic rubrics to evaluate undergraduate projects. The discussion focuses on identifying site-specific criteria and training effective raters and is intended to help academics respond to their required accreditation mandates and offer practitioners alternatives for evaluating products and services.
Index terms — assessment, inter-rater reliability, rubrics, writing assessment.
Introduction
Assessing the quality of student efforts and products is a continual necessity for technical communicators in academe and industry. Tools that can help with these assessment needs are rubrics, which can be defined as criteria-based scoring schemes that “help us to make the decisions needed to evaluate and assess” [1, p. 2]. Rubrics are similar to checklists, but rather than a simple “yes” or “no” answer to questions, rubrics use numeric scores to differentiate low-, middle-, and high-performance levels.
Evaluators assess writing performances holistically or analytically, depending on their purpose. Holistic rubrics capture an impression of overall quality, which is represented by a single numeric score. This approach is preferred where a snapshot of overall quality is desired. Analytic rubrics assess performances on a series of mutually discrete traits, including content areas, writing style, and document design. Each trait receives a single numeric score that can be averaged into an overall assessment. This approach is recommended for day-to-day use because it offers writers guided feedback for revision [2].
This paper begins by describing uses and research on rubrics and their design considerations with respect to validity and reliability. It then describes a case study of designing six rubrics, including their creation, a description of the raters, and the training process used to achieve a statistically acceptable level of inter-rater agreement. The case study is provided to model a complex process that has been underexplored by the field’s literature and to show instructors and researchers how to address limitations to their assessment projects, and maximize the reliability of their results.
These rubrics were used to compare the effects of teaching technical writing genres through explicit teaching versus more traditional, lecture-based approaches. Explicit teaching is defined as any discussion of a genre’s formal features, including discussion of the cultural, political, or social factors that shape these features [3]. It was hypothesized that students learning via explicit teaching would construct documents that targeted the rhetorical situation more successfully than students learning by traditional methods.
Uses of Rubrics
Rubrics can be used in classroom evaluation, accreditation evaluation, and research projects.
Classroom
In the classroom, rubrics often function as pedagogical tools that outline an instructor’s expectations and educate students on how to meet those expectations. Assessment experts also encourage student participation in rubric construction. This process instills students with a sense of ownership of the assessment process and teaches them the qualities of good writing [1]. The learner-centered environment promoted by this collaboration strengthens students’ individual competencies and the quality of their peer reviews.
Both general and specific rubrics are widely used in technical communication classrooms and were identified as one of the most frequent means for providing feedback to undergraduate and graduate students [4]. General rubrics are used to judge similar types of performances, such as all of the student assignments in a technical writing service course. Thomas created a general rubric for this purpose as well as to encourage a dialogue with students about writing [5]. Similarly, general rubrics have been used to help students assess the persuasion and argumentation levels in their managerial writing [6]. Students later extended this knowledge to assess the quality of their job letters, which often contain bold claims that lack supporting data.
Specific rubrics are used to judge the quality of distinct performances. Taylor developed a specific rubric to help teaching assistants in mechanical engineering assess their students’ lab reports [7]. Engineering TAs were cited as not being well trained before entering the classroom; using the rubric helped them norm their commentary across the course and facilitate learning rather than simply disseminating information.
Finally, rubrics used for classroom evaluation serve as a means for countering claims that students have limited control of their grades because of their instructor’s unpredictable biases and whims. Specifically, college and adult learners can be categorized as either grade motivated or knowledge motivated [1, p. 105]. Well-designed rubrics appeal to the needs of both groups, offering the knowledge-motivated students the qualitative feedback they require to progress and the grade-motivated students the quantitative data they use to assess their progress in the course.
Accreditation
Rubrics create transparency in the accreditation process, teaching participants how standards are evaluated and showing accreditors how systematic assessments are conducted. The No Child Left Behind Act (NCLB) of 2001 mandates that all public K-12 schools define measurable student-achievement levels; these standards are also required by most university and college accreditation councils. Defining these outcomes objectively can provide program administrators and instructors with valuable information that not only meets accreditation demands but also promotes pedagogical or curricular change. Practitioners also benefit from following the accreditation status of academic programs they recruit from.
Rubrics used for accreditation purposes include external and internal assessments. Within the fields of technical and business communication, external accreditation demands seem to motivate most rubric research [5], [8]–[10]. These fields employ internal assessment to evaluate the success of a program or course [8], [11], [12]. In particular, rubrics are a popular means for assessing course portfolios because they complement the assignment’s main goal of capturing a meaningful application of students’ knowledge and discipline-specific skills. Analytic rubrics, in particular, can be developed to evaluate programs and students on a variety of site-specific traits.
Research
In research, rubrics are most frequently used to evaluate documents, written by experimental and control groups, in a consistent and reliable way. Rubrics used to judge writing quality on multiple traits can offer researchers a wealth of information. A recent study used a five-trait rubric to assess the overall writing quality of L2 students who learned the legal memorandum through explicit teaching compared to students who learned the genre through a more context-focused, implicit instruction [13]. Results indicated that the students who received the explicit teaching treatment significantly outperformed the other students in argumentation and local organization (i.e., organization at the sentence level and between ideas), which the cited literature identified as the most genre-specific areas of the legal memorandum.
Rubrics used in research studies also help raters evaluate every document on the same criteria. A recent study in technical communication used multiple-trait rubrics to compare the quality of job materials produced by students whose course had an online component to the materials of students without this component [14]. Further research within business and technical communication can offer practitioners and managers rubrics that yield comparable findings on writing quality and specific rhetorical preferences, such as comparing software documentation across multiple versions.
Strengths and Weaknesses of Rubrics
Holistic Rubrics
An advantage of holistic rubrics is their efficiency in assessing a single skill like writing proficiency. In fact, the timed, impromptu essay, which traditionally employs holistic scoring, is cited as the best-researched assessment type [15, p. 59]. Instructors initially saw the benefit for impressionist evaluations of their students’ writing at, for example, the beginning of the semester compared to the end. However, the expansion of national and state testing has taken holistic scoring outside the classroom, and the same aspects that made it a success have led to its occasional overuse, misapplication, and distrust concerning its merits. As White observed in one such situation:
A scoring methodology that worked well for defined essay questions, when applied sensitively and collegially by a coherent faculty group, had expanded into a one-size-fits-all scoring system delivering reliable grades for many, sometimes quite inappropriate, purposes. [16, p. 585]
Faculty and researchers have also expressed validity concerns related to holistic scoring. Validity refers to how well the assessment targets its purpose. Holistic scores are often given on impromptu essays written under time constraints, such as the writing portions of the SAT, ACT, and GRE exams. These controlled testing environments may not accurately measure students’ true performance, leading some instructors to question the validity and reliability of the scoring. These holistic essay assessments are also often designed to emphasize students’ strengths, which may not indicate appropriate placement decisions [17], [18]. Finally, holistic scores do not educate students as to why their writing performance merited a specific placement, limiting their usefulness as a pedagogical tool. Students would not know if their score of “4,” for example, was due to poor organization or lack of content.
Holistic scoring has also been scrutinized for its reliability, which depends on the consistent application of the tool across multiple raters. The single score generated from this assessment form can mask raters’ biases [19, p. 195]. Reliability concerns can be addressed by involving multiple raters who are trained with “specimen papers,” a collection of authentic samples that reflect each of the rubric’s achievement scales [19, p. 194]. Raters can refer to these agreed-upon samples as they score the data corpus, increasing the likelihood for mutual agreement. However, this approach can force raters to conform to established criteria, which disfavors their writing expertise [20].
Though holistic scoring potentially limits the inferences researchers can make with their data, the approach can still prove meaningful [21]. When comparing student groups for an experimental, writing-based study, for example, measuring the groups’ initial writing proficiency has been noted as important in determining if the final results reflect the investigated treatment or an initial imbalance between the groups [22, pp. 228–229]. Likewise, using a holistic rubric to assess the writing proficiency of prospective job applicants could serve a purpose beyond screening candidates by providing objective, longitudinal employment standards that remain consistent, regardless of the assessor. The acknowledged limitations of holistic rubrics should not deter their use, provided they are applied in ways that reflect their intended purpose.
Analytic Rubrics
Analytic rubrics compensate for many of the shortcomings of holistic rubrics, but they also have validity and reliability issues. The major advantage of analytic rubrics is their ability to assess multiple traits of a single performance, offering a depth of data that can identify specific strengths and deficiencies within a curriculum or program. Analytic rubrics are also more context-aware because they are constructed to assess specific tasks. While these rubrics cannot target every aspect of writing ability, they allow evaluators to select the most contextually relevant criteria [23]. The site specificity of analytic rubrics creates more complex tools that can grow to the point of becoming “dysfunctionally detailed” [24, p. 98], which can impede student learning if there are multiple approaches to successfully completing the performance under evaluation [25].
However, analytic rubrics function beyond writing assessment purposes. An analytic rubric could guide usability specialists on how to evaluate users’ experiences with navigating a website. Managers could also use analytic approaches to measure the effectiveness of their employee training practices. The data yielded from these assessments can create a checks-and-balances system that illustrates actual effectiveness of products and services rather than presumed effectiveness.
Analytic rubrics also have reliability issues. Overly detailed rubrics can produce rater interpretation and thereby subjective scoring [26]. Results of experimental research, however, suggest that analytic rubrics are highly reliable, specifically with inexperienced raters [23], [27]–[29]. These studies indicate that analytic assessment produces a higher rate of inter-rater reliability than holistic assessment because of how the agreement score is calculated. In a holistic assessment, a single score from each rater is calculated. With a five-trait analytic rubric, a composite score from each rater is calculated, adjusting for some disagreement between categories. One study found that the between-rater agreement on individual rubric traits ranged from 61% to 91%, producing consistent composite scores of above 90% [30, p. 350]. Nevertheless, composite scores can inflate reliability because they do not account for chance between the raters [31].
Research on Rubrics in Technical Communication
The impact of NCLB at the K-12 levels has heightened the public’s awareness of assessment and inspired a culture of testing that demands more formalized measures within higher education and the workplace. The field of technical communication offers little research investigating the process of creating valid and reliable assessment rubrics despite their popularity and the presumed acceptance of their pedagogical value. Technical communicators can benefit from learning about the construction of rubrics to improve their curriculum development and to prepare for their own accreditation and certification processes. Likewise, a call for assessment tools sophisticated enough for comprehensive statistical analyses, particularly in experimental and ethnographic studies, has been made [6].
The following section explores how valid and reliable rubrics can be designed for site-specific purposes.
Designing Rubrics
There are two issues to consider when designing a rubric: (1) validity, including the selection of the performance criteria; and (2) reliability, including the training of effective raters and the calculation of their agreement level.
Validity
Ensuring a rubric’s validity includes acknowledging the purpose of the assessment and the strengths and limitations of the form used to collect the data. In rubric construction, defining the areas of assessment is paramount in developing a tool that reflects a discourse community’s goals.
Criteria Selection: Research in technical communication and its closely related fields suggest several approaches for developing site-specific criteria. Johnson, for example, describes a process for assessing the online portfolios produced in a technical writing course, beginning with a revision of the course that required instructors to use an agreed-upon textbook and teach the same five genre-based modules. The consistency implemented at the course level allowed the assessors to develop a rubric that encompassed their shared needs. Identifying themes through interviews with experts and participants in the targeted community [7], [32], [33] as well as reading samples of the actual data corpus [9] are also offered as approaches to defining criteria.
Site-specific criteria can also be adapted from existing information, including textbooks and theoretical models. Textbooks used to teach technical and business writing genres, for example, often articulate consistent standards and practices. Their accompanying instructor manuals also include annotated models of these genres, which emphasize the core function and formal features.
Assessing Validity: Assessing how effectively a rubric responds to its purpose begins with selecting the appropriate form. Again, holistic forms describe an entire performance while analytic forms dissect a performance into multiple, mutually exclusive traits.
The academic fields of business and technical communication appear to favor analytic over holistic forms because of their ability to capture the complexities of workplace writing. Further research may also indicate that business managers prefer the analytic form. Managers no longer rely on “the bottom-line” to make decisions, an approach that might have favored a holistic form. Instead, the “balanced scorecard” has been cited as the most influential and practiced decision-making plan for organizations [34, pp. 2–3]. The approach moves inquiry beyond the bottom line and allows managers to make informed decisions based on a variety of variables, an approach that seems to favor the analytic form.
Finally, the number of achievement scales on a rubric and their accompanying descriptions also influence validity. Scales refer to the word and (or) the number rating that describe each achievement level (e.g., “weak” or 1, “superior” or 5).
Three to six points is appropriate for measuring most classroom performances, but each point must articulate a distinct level of achievement [2]. Nitko suggests using a scale that is comparable to a grade spread, particularly if the rubric is also being used to help students identify their rhetorical strengths and weaknesses [19]. This five-point scale would also benefit researchers who conduct experimental pedagogical studies and need to compare their data set to existing grade spreads. However, this scale has also been discouraged because five points could cause associative interpretations about the assessment, such as the third point of achievement being of average performance [26]. Therefore, the number of scales to include with a rubric depends on its purpose.
The performance descriptions should use parallel language, such as consistent use of adjectives or adverbs (e.g., proficient, capable, adequate, limited, poor) [26]. The scales must also include clear language to discourage subjectivity. Quantifying descriptors can reduce rater interpretation [1]. A descriptor that reads “very frequent grammatical errors,” for example, invites more subjectivity than “five or more grammatical errors.”
Reliability
There are two issues to consider with respect to rubric reliability: (1) the experience and training of the raters and (2) the method used to calculate the agreement between raters. There are various complexities involved in assessing reliability that are relevant to technical and business communicators. It has been noted that holistic assessment, for example, forces raters to conform to established criteria [20], and training a group of raters on the same criteria may only yield data that reflects attention to superficial features [17]. General criteria may also encourage raters to score based on their personal preferences if they are unable to sort the evaluated writing into one of the defined scales [35]. This same claim has also been made concerning analytic rubrics that include a dysfunctional amount of criteria [24]. Despite these assertions, however, little empirical research has investigated these reliability issues, particularly in the training process.
Rater Experience: The empirical research that has addressed reliability offers contradictory evidence as to what qualities constitute a good rater and whether student raters can be normed to effectively apply an assessment tool. Prior teaching experience and sufficient understanding of what constitutes “good writing” have been identified as ideal qualities in a rater [36]. Experience in the former is especially important in placement assessment. However, some studies suggest there may be limitations to the effectiveness of raters with teaching experience. Thomas and McShane noted that their department’s instructors consistently rated their own students’ portfolios a point higher than the second faculty rater, suggesting that even experience does not eliminate bias [12]. Additionally, Johnson’s findings suggest that teaching experience does not instantly translate into rating efficiency. Many of her department’s instructors experienced difficulty with teaching the agreed upon modules because of their unfamiliarity with web design. This initial unfamiliarity could have affected the subsequent assessment of the students’ online portfolios. In fact, Johnson reported an initial disconnect between the assessment scores of the portfolio and the course grades. The correlation between these variables eventually grew, suggesting that students were being evaluated with more accuracy because the instructors were more comfortable with the course content.
Research on how student raters use rubrics further complicates identifying the qualities of effective raters. Since most rubric research focuses on the instructor’s perspective, investigations into how students, or “pedagogically naïve raters,” apply this assessment tool are often underrepresented [27, p. 1510]. A recent study measured how college-level biology students used a rubric to assess their peers’ performance on an oral presentation [27]. Students received no training other than an introduction to the rubric and a review of its five achievement scales. Analysis of a three-year data corpus revealed no statistical significance between the student raters’ mean score and the instructors’ score. Likewise, the students’ academic standing in the course did not seem to influence how accurately they applied the rubric to their peers’ work, suggesting this assessment tool’s clarity and effectiveness.
Rater Training: ESL studies have also revealed insight concerning the value of training raters. A recent control-group experiment indicated that a lack of training does not seem to impair peer assessment of oral presentations (when compared to their instructor’s grades); however, training does enhance the usefulness of the peer commentary [37]. Other research indicates that inexperienced raters benefit from extensive training and interaction with experienced raters [29]. Training also significantly helps inexperienced raters develop intra-rater reliability (self-consistency) [28]. These findings complement the earlier referenced literature regarding the value of involving students in rubric creation. This collaboration instills students with a sense of ownership in the assessment tool and process. This ownership is arguably sustained if the students are trained as raters. In fact, collaboration is cited as an important element to rubric design, regardless of the raters’ experience, because the process creates a “ripple effect” that improves the quality of the assessment and, when applicable, the classroom instruction [38, p. 221].
Other research reports varying approaches to rater training. A consistent training process is described in rubric articles within business and technical communication: The raters meet, review the rubric’s criteria, and then independently evaluate 5 to 15 documents. After comparing and discussing their scores, the raters revise the rubric to reflect the performance under evaluation [6], [8], [11], [12]. This process continues until the raters are comfortable with the rubric and/or achieve a predetermined level of agreement. The minimum for inter-rater agreement depends on the source, but most psychometricians suggest a range between 70% and 80% [22], [39].
Assessing Reliability: Assessing reliability establishes the consistency and stability of the assessment tool. Research across the disciplines offers varying approaches to calculating inter-rater agreement. An investigation of composition research identified at least eight different statistics for assessing inter-rater agreement [40]. Similarly, this study noted researchers’ tendency to report the agreement percentages but to neglect mentioning the statistic used to calculate that reliability. Rubric research in business and technical communication also reflects the aforementioned finding and offers a number of reliability statistics, including percent and adjacent agreement, Pearson correlation coefficient, Cronbach’s alpha, and Cohen’s kappa [6], [8], [11], [12], [14].
Psychometricians offer insight into how a specific reliability statistic can meet the needs of an assessment project. Percent or adjacent agreement (i.e., agreement within a single point) is not considered the most accurate means for assessing reliability because it does not account for chance between the raters’ scores and can possibly inflate the agreement levels [31], [41].
Both the Pearson correlation coefficient and Cronbach’s alpha are classified as consistency estimates, which recognize that two raters may not interpret the rating scale in the same way but are consistent in their own application of the scale [42]. However, both statistics have been labeled insufficient for assessing reliability because they account for a pattern in score distribution rather than a consistency among raters [43]. This point is particularly important when considering how raters interpret an assessment rubric. Raters’ inability to consistently apply criteria may suggest a validity threat to a rubric’s design.
Similar to percent and adjacent agreement, Cohen’s kappa is a consensus agreement that evaluates raters’ scores for exact agreement [42]. However, this statistic is a more rigorous means of assessing reliability because it accounts for chance when measuring the level of agreement between two raters [41]. The statistic is the most widely used inter-rater measurement in the behavioral sciences and is ideal if researchers fear an inflated agreement level because observations may fall into a single category [42], [44]. Since Cohen’s kappa is designed to assess ordinal or scale data, it may be a particularly useful statistic for rubric assessment.
Finally, a minority of researchers argue against assessing reliability. Warnock, for example, did not assess the reliability of his assessment rubric to avoid “an illusionary game of validity in which evaluators struggle to establish idealized versions of the writing traits that meet specific writing criteria” [9, p. 90]. He concedes that norming may prove useful in writing assessment where placement decisions are being made, but
[t]oo often, it seems, writing-assessment research is handcuffed by exaggerated needs to normalize or synchronize the views of assessors; we believed that we could achieve meaningful results without ignoring the effects of context and by respecting the natural subjectivity of the task. [9, p. 98]
While assessing reliability may evolve into an optional choice, it remains an important consideration to any assessment tool and directly impacts the validity of any assessment practice. Specifically, calculating inter-rater reliability can reveal validity flaws with a rubric’s design and lead the researcher to reevaluate the soundness of the instrument.
Case Study
The second half of this paper describes the process of developing valid, reliable, site-specific rubrics. These rubrics were used to compare the effects of teaching technical writing genres through explicit teaching. It was hypothesized that students taught with explicit teaching, which emphasizes a genre’s formal features, would produce writing that better targeted the rhetorical situation compared to students taught through more traditional instructional approaches. Six rubrics were used in this study: a holistic rubric that assessed students’ overall writing proficiency and five analytic rubrics that assessed students’ overall mastery of the job letter, résumé, claim letter, recruitment email, and instruction set. The 192 student subjects were spread across 12 sections of a credit-bearing introductory course in technical writing and taught by one of six instructors. Regardless of the instructor, every course section was organized around a common syllabus.
The experimental design of this case study attempted to emulate an authentic assessment environment, which is relevant to how the scoring rubrics were developed and used. First, the study used a control group, quasiexperimental design. Quasiexperiments are designed to be used in authentic settings, such as the classroom or workplace, because they have a higher likelihood of capturing an organic response than laboratory-based experiments [45]. Next, students were not aware their writing was being assessed (I received exempt status from the Office of Research Services for this study.), eliminating any anxiety associated with being tested. Last, the assignment descriptions for the five assessed genres were situated in authentic rhetorical situations. For example, students had to target their job materials toward an actual position that they were currently qualified for rather than their dream job. All of these variables contributed to the rubrics’ construction and application. This quasiexperimental design, where students in real college classes are taught via different methods, is a common experimental design. It is also a useful design for evaluating two versions of a document, website, or training module.
The rubrics were applied by student raters to evaluate papers that had been written by students in the control and experimental sections. Although rubrics have successfully been used to help teach students expectations about assignments, the rubrics described here were created for research purposes. As such, the student authors did not interact with the rubrics. Students’ final rubric scores were never shared with the instructors so they in no way affected the students’ assignment or course grades. These rubrics were used exclusively to standardize evaluation of student documents, much like in an accreditation or program-evaluation scenario.
Raters
I recruited four undergraduates to evaluate the data corpus. These raters, whom I will call Natalia, Ainsley, Marcie, and Bryant, were all previously enrolled in my 3000-level report-writing course. Based on the literature, these raters would be classified as “pedagogically naïve.” None of them had previous rating experience or were degreed professionals; however, they all had taken at least nine hours of college-level writing and were therefore familiar with the assessed genres and the qualities of good writing. Two of the raters were accounting majors—one was a public relations major, and one was an agricultural and applied economics major. Two of the raters were seniors, and two were juniors, with an overall average GPA of 3.42. Raters were paid US$10 an hour with funds from my summer dissertation award. A financial investment arguably produces higher reliability in any assessment project [46]. This incentive also pairs with the observation that students develop a sense of ownership of the instrument when they are involved in the evaluation process [1]. Raters were heavily involved with revising the analytic rubrics, adding to their investment in this study.
Training
Earlier-cited literature also indicated that inexperienced raters could apply rubrics with extensive training. The raters met with me for training sessions every week for six weeks; each session lasted an average of two hours. Raters knew they were evaluating student writing from a technical writing course, but I never informed them of my research hypotheses to help reduce rater bias. Additionally, personal information was removed from the writing so the raters did not know the identity of the student authors. Raters also signed a nondisclosure agreement that barred them from discussing their work. Raters were trained on all rubrics using authentic student examples. Once the assessment of the formal data corpus began, I provided raters with clusters of 10–15 documents. Each cluster was organized into an envelope and included a random sampling of documents across the 12 evaluated course sections. Each document was independently scored by two raters, and their final scores were recorded on spreadsheet software.
Case Study’s Holistic Rubric
Since this case study’s experimental design involved control groups, the writing quality of all students was assessed with a timed, impromptu essay and then evaluated with the holistic rubric to measure the initial between-group equality. All students wrote a memo to their instructor that (1) discussed their academic major and desired career; and (2) described the writing they would likely encounter in their chosen profession. Students completed this assignment in 20 minutes on the first day of class. The time limit ensured that no course section received more or less writing time than the other sections.
Establishing Validity: Raters assessed these memos by using a modified version of White’s holistic rubric, which he created for timed first-draft writing [18]. The rating scale was organized into six achievement scales, ranging from “superior” (6) to “incompetent” (1). Raters were trained on the holistic rubric during the first 2-hour training session. Three other training sessions focused on this rubric but in smaller time increments. At the first training session, the raters received a copy of White’s rubric, the writing prompt, and 10 student examples.
White designed the rubric around the idea that students should be rewarded for what they do well; even the best writing contains errors, though consistent error patterns often indicate a lower score. I began the training session by describing White’s design rationale and emphasizing that the students wrote their memo on the first day of class under a 20-minute time restriction. I believed that contextualizing the testing environment would produce results that reflected the situation. Raters spent considerable time scoring the first examples and familiarizing themselves with the language that defined the six achievement scales. (See Appendix A.) After scoring each example, we discussed our reactions to it and our score. One of the earliest discussions concerned the following passage:
Because Financial Planning is a client based business, in order to attract clients, one must be able to communicate effectively. Even if I have great financial strategies, if I am unable to communicate those strategies effectively to my clients, then how will I manage their portfolios? I will most likely be unable to maintain them as clients.
Raters weighted this particular passage differently, suggesting early signs of bias. Bryant’s “inadequate” (2) score was primarily based on this passage. He was unimpressed with its repetitious ideas and phrases, citing that the student failed to expand on what “communicating effectively” meant or what “effective strategies” might be used to target clients. Bryant justified his score by referring to White’s descriptor that an “inadequate” response likely “show[s] patterns of serious error”; he considered the writer’s use of repetition “serious.” The other raters shared Bryant’s opinion of the passage but scored the memo “competent” (4). Ainsley referred the group to another caveat of White’s description of “inadequate”—that the response misunderstood or confused the question or used superficial or stereotypical language, none of which was present in this particular response. Furthermore, Ainsley argued that the student developed a strong paragraph on the types of writing she expected to do on the job, an explicit criterion of the assessment prompt.
This example illustrated many of the initial issues that arose during the first training session. Bryant admitted that repetition was an issue he struggled with in his writing, and his lower score was likely a result of this heightened awareness. Instructors might strive to instill this type of awareness in their students, but it is not always a desirable trait in research assistants, especially since Bryant seemed to struggle with scoring holistically versus scoring based on his preferences. As mentioned earlier, low inter-rater reliability is often attributed to evaluators responding to elements outside the rubric’s language or weighting criteria differently than explicated by the rubric [47]. However, what Bryant considered a “serious error” versus what the other raters considered serious raised a strong point concerning the appropriateness of White’s original performance descriptors. Though the rubric likely assessed writing proficiency for White’s purposes, it needed to be tailored to my study’s purposes.
The first session then transitioned from training to revision. At my instruction, the raters independently reviewed the student samples and selected one example that illustrated each achievement scale. After discussion and consensus, I annotated these examples and distributed them as “specimen papers,” which raters could reference as they trained and then coded [19, p. 194]. The results of this exercise allowed the raters to interpret White’s language in a manner that was meaningful to them, training them to differentiate between achievement scales while simultaneously expanding their understanding of “good” versus “bad” writing. In addition, Natalia suggested revising the rubric’s language to reflect the competency level used to address the writing prompt’s two content areas. For example, assessments that failed to address one of the two required content areas would receive an “inadequate” (2) score. This revision process was crucial in achieving rater consensus and reliability.
Establishing Reliability: After the raters and I finished discussing students’ samples and revising the holistic rubric, initial inter-rater reliability was measured with the pairwise percent agreement (APPA). This statistic is calculated by dividing the pairwise identical codes by the total number of pair comparisons. This approach measures reliability among three or more raters and is considered a more precise assessment of agreement between raters than a percent calculation. For example, if three raters scored a 5, 5, 4, respectively, on the same performance, the pairwise agreement would average 33% (an exact percent agreement would be 0%) [48].
Table I. Average Pairwise Percent Agreement.
| Session | Pct. Agr. | Natalia/Marcie | Natalia/Bryant | Natalia/Ainsley | Ainsley/Marcie | Ainsley/Bryant | Bryant/Marcie |
|---|---|---|---|---|---|---|---|
| A (n=8) | 50% | 37.5% | 37.5% | 75% | 25% | 50% | 75% |
| B (n=8) | 62.5% | 75% | 50% | 87.5% | 87.5% | 37.5% | 37.5% |
| C (n=10) | 65% | 80% | 40% | 90% | 70% | 50% | 60% |
| D (n=12) | 70.8% | 83.3% | 66.67% | 91.67% | 75% | 58.33% | 50% |
Table I shows that the average pairwise percent agreement improved with each training session, beginning with 50% agreement and ending with 71%. Inter-rater reliability between all four raters was only assessed during the training process to ensure the rubric maintained an acceptable level of agreement through its revisions. Table I’s agreement breakdown indicated that some combinations agreed more often than others. Natalia and Ainsley averaged 75% agreement on the first session, an acceptable rate by most statistical standards, and strengthened that reliability to 92% by the final session. Bryant and Marcie also averaged 75% agreement during the first session but could not maintain this level. Both were consistently the common denominators in the lowest level of agreement.
After the second training session, I worked with Bryant and Marcie individually (two one-hour sessions) to improve their reliability. The individual attention seemed to help Marcie. Table I shows considerable improvement in her agreement level with Natalia and Ainsley. Bryant continued to struggle and, during the last training session, was always one part of the pair that scored the three lowest levels of agreement.
Before the coding of the data corpus began, I made a few usability-related revisions to the rubric. I reformatted the rubric so that it fit on one page, but I also included a description of the two required content areas and reminders that the assessments were written by students enrolled in a sophomore-level course under a 20-minute time constraint on the first day of class. For the first round of formal coding, I paired Ainsley with Natalia and Bryant with Marcie since these pairs seemed to norm the best during the training. For their first coding batch (i.e., an envelope containing a random selection of 10–15 assessments), Natalia and Ainsley averaged 80% agreement, and Bryant and Marcie averaged 71%.
Cohen’s kappa assessed the inter-rater reliability of the actual data because it is designed to calculate the agreement between two raters. Using three or more raters to assess the formal data corpus would have reduced the likelihood of agreement due to chance, but additional raters were too expensive for this project. The kappa test identified an overall agreement of 85.4% on the writing assessment, indicating a high level of consistency. On average, students scored a 3.75, slightly above the bottom half of the 6-point scale.
Case Study’s Analytic Rubrics
The five analytic rubrics were used to assess the students’ competency in producing five technical writing genres: (1) the job letter, (2) résumé, (3) claim letter, (4) solicitation email, and (5) instruction set. All rubrics were organized by five achievement scales (to represent the standard A–F grade spread), which ranged from “superior” (5) to “incompetent” (1). I chose a five-point scale to create a comparable representation of the instructors’ grades. Each document was evaluated independently by two raters.
The following discussion is on several rubrics that are also included in Appendices B–F. (Appendices B, C, D, E, and F are reproduced as supplementary material on this site.) Their inclusion is an effort to relay the process of creating quality assessment tools that can be used by instructors, program administrators, and researchers. These rubrics were created for a specific purpose and course, following the literature that states assessment must be site specific [47], [49].
Establishing Validity: The criteria for this study’s analytic rubrics were taken from the standards and practices outlined in the two textbooks used by the students [50], [51]. These textbooks’ accompanying instructor manuals, which provided annotated models of the technical writing genres, were also consulted [52], [53].
I included the four raters in the design process of these rubrics. Collaboration is cited as an important element in rubric design, which can create a “ripple effect” that improves the quality of the assessment [38].
The analytic rubrics proved tedious to construct because the categories had to assess mutually exclusive traits so students would not be rewarded or penalized for the same issue across multiple categories. For example, I used the Attention, Interest, Desire, Action (AIDA) model to construct four of the solicitation email’s six categories because of its defined rhetorical moves and previous use in technical communication research [54]. This model begins by grabbing the reader’s attention. A description of the product’s benefits holds the readers’ interest, arousing their desire to take action and purchase the product [54, p. 475]. However, overlap between these rhetorical moves exists. Does a 10% reduction in membership cost, for example, pique the readers’ interest or arouse their desire to join the solicited organization? Raters asked similar questions while constructing this rubric’s categories and eventually agreed that the interest category included statements that developed the reader’s interest, while the desire category included statements that anticipated the barriers readers might have for joining the organization, such as time or money.
In addition, some of the genres assessed in this study, such as the résumé, are complex, and the rhetorical situation dictates which combination of the standard formal features appears in the document. Table II illustrates how detailed the education category became for the résumé rubric. While this section requires certain specifics (college name, degree, graduation or expected graduation date), including additional information can also impact the overall effectiveness of the document. Including a low GPA (below or at 3.0) could hinder an applicant’s success. Likewise, a discussion of relevant coursework is effective if employers understand the specific projects the applicant completed in these courses.
Table II. Final Education Category of the Analytic Résumé Rubric.
| 5 — Superior | 4 — Strong | 3 — Competent | 2 — Weak | 1 — Incompetent | |
|---|---|---|---|---|---|
| Education | Section placed appropriately (see job listing); Includes college name, degree, and graduation date; If GPA is included, it is above a 3.0; Includes no high school information; No mechanical errors or If relevant coursework section is provided, discussion includes course number, title, and a specific description of each course’s content relevant to the job’s duties with no mechanical errors | Section placed appropriately; Includes college name, degree, and graduation date; If GPA is included, it is above a 3.0; Includes no high school information; Contains 1 mechanical error or If relevant coursework section is provided, discussion includes course number, title, and a specific description of each course’s content relevant to the job’s duties or section includes 1 mechanical error | Section placed appropriately; Student may forget to include graduation date but college name and degree are included; If GPA is included it is above a 3.0; High school information may be listed but only takes up 1-2 lines or includes 2 mechanical errors or If relevant coursework section is provided, discussion at least includes course titled and a specific description of each course’s content relevant to the job’s duties or section includes 2 mechanical errors | Section not placed appropriately or Student forgets to include college name or degree or includes GPA lower than a 3.0 or includes high school information that takes up 3+ lines of information or includes 3+ mechanical errors or If relevant coursework section is provided, discussion includes course title but a vague or general description of each course’s content relevant to the job’s duties or no description of course content is included or section includes 3+ mechanical errors | Includes no education information and education is required for the position (see job listing) |
The rubrics created for genres, such as the claim letter, were easier to construct because of the standardization of the business letter as well as the required traits of the claim itself (e.g., a description of the claim, a request for action).
Consistent, parallel language was also used across the five rubrics to apply a standard evaluative measure and to promote inter-rater agreement. In general, a “superior” (5) score indicated that the student fulfilled the task, and a “strong” (4) score indicated that the student fulfilled the task, but included a minor error. A “competent” (3) score described writing that was “wordy but comprehensible,” while a “weak” (2) score included “muddled” and (or) “confusing” language. The “incompetent” (1) score indicated that the required trait was absent from the document.
Finally, indicators describing each achievement level were tailored to the genre’s specific categories to enhance validity and decrease rater interpretation [24]. Table III exemplifies some indicators used to describe the achievement scales of all rubrics. For example, the sections of the salutation/closing category of the claim letter were written based on the recommendation to address correspondence to a person instead of an organization or with a generic salutation like “Dear Sir or Madam.” Including these descriptive indicators also addressed the subjectivity concern of analytic rubrics, which directly impacts the reliability of the assessment.
Table III. The Italicized Indicators for the Salutation/Closing Category of the Claim Letter.
| 5 — Superior | 4 — Strong | 3 — Competent | 2 — Weak | 1 — Incompetent | |
|---|---|---|---|---|---|
| Salutation/Closing | Includes no errors in capitalization or punctuation and directed to a specific or general person (i.e., Customer Service Manger) | Salutation and closing have 1 error in capitalization or punctuation and directed to a specific or general person | Salutation and closing have 2 errors in capitalization or punctuation or directed to a department (i.e., Customer Service) instead of a person | Salutation and closing have 3+ errors in capitalization or punctuation or directed to the organization (i.e., Sony) instead of a person | Salutation and/or closing missing |
Establishing Reliability: The average pairwise percent agreement for all four raters was calculated to monitor reliability during the training process. Cohen’s kappa determined the reliability for the formal data corpus because it is designed to assess two raters. Using this statistic also considers that using a five-point rubric can cause associative interpretations because it could be perceived as a standard grading scale [26]. Cohen’s kappa is recommended if the researcher fears an inflated agreement level because the raters’ observations may fall into a single category.
Involving the raters in the design process seemed to further their understanding of a rubric’s goals and applications. The training process seemed to progress more smoothly on the analytic rubrics; however, the complexity of the criteria resulted in raters spending considerably more time on this assessment than the holistic evaluation. I even restricted raters from “genre swapping”—assessing more than one genre per coding session during the formal assessment—because the training process indicated that it took raters several student samples to adjust to a new rubric. The raters also appeared to norm more consistently with each other than they did with the holistic rubric. Bryant, who struggled on the holistic scoring, learned how to norm himself with the analytic rubrics. Rather than assessing each individual document for every trait, he evaluated a single category on all documents before moving to the next category. He then recoded the first documents of every cluster because he tended to rank them a point lower than the last documents of every cluster.
A noted strength of assessing reliability in rubrics with multiple categories is that the overall composite score adjusts for some rater disagreement [23]. Table IV illustrates the inter-rater breakdown of the six categories in the instruction set rubric, which achieved 82.3% overall agreement. The reliability for this genre ranged from 100% (the title category) to 58% (the graphic aids category). The content of the latter category assessed the purpose of each instructional graphic as well as its relation to the text and its overall visual aesthetic. The safety information category also assessed the placement and visibility of these cautions and/or recommended measures. Perhaps the design and spatial aspects of these two categories accounted for the low inter-rater reliability. Nonetheless, this distribution of category scores is consistent with the range cited in another study [30].
Table IV. Inter-Rater Breakdown of the Instruction Set Rubric by Category.
| Category | Inter-rater reliability (Cohen’s kappa) |
|---|---|
| Title | 100% |
| Introduction | 85.2% |
| Safety information | 66.2% |
| Required steps | 84.0% |
| Conclusion | 88.9% |
| Graphic aids | 58.0% |
Conclusions and Future Directions
The case study was included to demonstrate the process of constructing assessment rubrics. The rubrics were designed in acknowledgement of the assessment purposes and the limitations of the holistic and analytic forms. The timed, impromptu writing test that was holistically scored only assessed the initial equality between the control and treatment groups and showed that the two groups were similar. To account for holistic scoring weaknesses, the single holistic initial writing score was supported with examinations of the students’ GPA, academic major, and university classification. Similarly, the analytic rubrics were designed for a defined purpose and audience and, therefore, are not without limitations. Their detailed criteria are not recommended for student use; instructors looking to enhance their quality of feedback may need to simplify the content with attention to promoting student learning [25]. On the other hand, the relevant literature making these pedagogical claims does not focus on how adult and college learners apply rubrics. Further research could indicate if these students benefit from the depth of analytic rubrics. Similarly, program assessment might require this level of analytic detail, particularly if the rubrics are used longitudinally—either across multiple sections of the same course or across a period of time. These analytic rubrics also allow researchers to run comprehensive statistical analyses that determine performance on multiple writing traits.
The case study also revealed insight surrounding the training of inexperienced raters. Though the pedagogical naiveté of my student raters might be considered a limitation, it is important to note that they encountered the same challenges associated with experienced raters, such as separating their own biases from the rubrics’ criteria. However, the recognition of and subsequent accommodations for these biases are of value to researchers and instructors and merit further study. Bryant’s experience, in particular, suggests that student raters have the ability to recognize and modify these biases in ways that lead to stronger intra- and inter-rater agreement. Contextualizing the testing environment and the process of identifying the specimen papers also likely enhanced this agreement level. The description of how the raters differently interpreted and then modified White’s “inadequate” (2) holistic scale should also indicate that students are able to recognize the qualities that comprise good writing. The fact that the raters obtained this level of reflection without the presence of experienced and pedagogically trained raters (with the exception of myself) suggests that perhaps students can articulate and recognize the qualities of good writing more successfully than they can execute them in their own work.
This paper also suggests ways technical communicators can broaden their inquiry of rubric assessment. The accountability standards outlined by the NCLB Act of 2001 are predicted to continue regardless of a change in presidential administrations, indicating that the academy will continue to articulate measurable outcomes for student learning [55]. In fact, the first full generation of NCLB learners could enter the college classroom as early as 2014 and will arguably be more accustomed to the structured assessment that rubrics provide. Approaches to writing assessment will also need to evolve in light of recent technological advances. The writing components of many standardized tests, including the SAT and ACT, are now being scored online by raters primarily trained through an online tutorial. Research investigating the validity and reliability of this training will be needed. Likewise, technical communication programs are offering hybrid or online undergraduate and graduate degrees. Further research will need to explore how assessment standards, especially rubrics, can accommodate this new generation of learners.
The future success of developing valid and reliable assessment rubrics also depends on feedback and participation from managers and business practitioners. The current economic climate has increased the realities of data-driven decision-making and a pressing need for assessment tools that managers can use to evaluate their products and services as well as their company and employee performance levels. Likewise, the inevitable change in students’ literacies will undoubtedly affect the future of workplace communication. The recently expanded assessment mandates also allow practitioners an organic means of connecting to their relevant academic programs, helping to bridge technical communication as an industry and as an academic discipline. The results of this collaboration could produce a standard assessment rubric that evaluates how students transfer writing knowledge into the workplace.
The academic fields of business and technical communication may carry the brunt of the pedagogical and practical assessment burdens, because they are charged with preparing their classroom students to become valuable workplace communicators. Allen acknowledges that technical communicators may disagree on how to select appropriate, site-specific criteria [49, p. 365–66], but she urges faculty to initiate an assessment discussion rather than fall victim to any “mandated outcomes” [56, p. 94]. While rubrics offer a powerful and versatile approach to improving pedagogy, accreditation, and research, technical communicators must consider how they can be constructed in ways that make their data meaningful. This continued, mindful exploration will enhance the field’s knowledge of assessment practices.
References
- A. M. Quinlan, A Complete Guide to Rubrics: Assessment Made Easy for Teachers, K-12. Lanham, MD: Rowman and Littlefield Education, 2006.
- J. Arter and J. McTighe, Scoring Rubrics in the Classroom: Using Performance Criteria for Assessing and Improving Student Performance. Thousand Oaks, CA: Corwin Press, 2001.
- A. Freedman, “Show and tell? The role of explicit teaching in the learning of new genres,” Research in the Teaching of English, vol. 27, no. 3, pp. 222–249, 1993.
- K. C. Cook, “How much is enough? The assessment of student work in technical communication courses,” Tech. Commun. Quart., vol. 12, no. 1, pp. 47–65, 2003.
- S. Thomas, “The engineering-technical writing connection: A rubric for effective communication,” in Proc. IEEE International Professional Communication Conf., 2005, pp. 517–523.
- P. S. Rogers, “Analytic measures for evaluating managerial writing,” J. Bus. Tech. Commun., vol. 8, no. 4, pp. 380–407, 1994.
- S. S. Taylor, “Comments on lab reports by mechanical engineering teaching assistants: Typical practices and effects of using a grading rubric,” J. Bus. Tech. Commun., vol. 21, no. 4, pp. 402–424, 2007.
- C. S. Johnson, “A decade of research: Assessing change in the technical communication classroom using online portfolios,” J. Bus. Tech. Commun., vol. 36, no. 4, pp. 413–431, 2006.
- S. Warnock, “Methods and results of an accreditation-driven writing assessment in a business context,” J. Bus. Tech. Commun., vol. 23, no. 1, pp. 83–107, 2009.
- L. Fraser, K. Harich, J. Norby, K. Brzovic, T. Rizkallah, and D. Loewy, “Diagnostic and value-added assessment of business writing,” Bus. Commun. Quart., vol. 68, no. 3, pp. 290–305, 2005.
- N. W. Coppola, “Setting the discourse community: Tasks and assessment for the new technical communication service course,” Tech. Commun. Quart., vol. 8, no. 3, pp. 249–267, 1999.
- S. Thomas and B. J. McShane, “Skills and literacies for the 21st century: Assessing an undergraduate professional and technical writing program,” Tech. Commun., vol. 34, no. 4, pp. 412–423, 2007.
- R. Abbuhl, “Hedging and boosting in advanced L2 legal writing,” in Educating for Advanced Foreign Language Capacities, H. Byrnes, H. D. Weger-Guntharp, and K. Sprang, Eds. Washington, DC: Georgetown Univ. Press, 2005.
- S. M. Katz, “Assessing a hybrid format,” J. Bus. Tech. Commun., vol. 22, no. 1, pp. 92–110, 2008.
- S. C. Weigle, Assessing Writing. Cambridge, UK: Cambridge Univ. Press, 2002.
- E. M. White, “The scoring of writing portfolios: Phase 2,” College Comp. Commun., vol. 56, no. 4, pp. 581–600, 2005.
- D. Charney, “The validity of using holistic scoring to evaluate writing,” Res. Teaching English, vol. 18, no. 1, pp. 65–81, 1984.
- E. M. White, Assigning, Responding, Evaluating: A Writing Teacher’s Guide. New York: St. Martin’s Press, 1995.
- A. J. Nitko, Educational Assessment of Students, 4th ed. Upper Saddle River, NJ: Merrill Prentice-Hall, 2004.
- B. Huot, “The literature of direct writing assessment: Major concerns and prevailing trends,” Rev. Educ. Res., vol. 60, no. 2, pp. 237–264, 1990.
- B. Huot, “Reliability, validity, and holistic scoring: What we know and what we need to know,” College Comp. Commun., vol. 41, no. 2, pp. 201–213, 1990.
- R. Beach, “Experimental and descriptive research methods in composition,” in Methods and Methodology in Composition Research, G. Kirsch and P. A. Sullivan, Eds. Carbondale, IL: Southern Illinois Univ. Press, 1992.
- L. Hamp-Lyons, “Scoring procedures for ESL contexts,” in Assessing Second Language Writing in Academic Contexts, L. Hamp-Lyons, Ed. Norwood, NJ: Ablex, 1991, pp. 241–276.
- W. J. Popham, Test Better, Teach Better: The Instructional Role of Assessment. Alexandria, VA: Association for Supervision and Curriculum Development, 2003.
- S. M. Brookhart, Grading. Upper Saddle River, NJ: Pearson, 2004.
- R. Wormeli, Fair Isn’t Always Equal: Assessing & Grading in the Differentiated Classroom. Portland, ME: Stenhouse, 2006.
- J. C. Hafner and P. M. Hafner, “Quantitative analysis of the rubric as an assessment tool: An empirical study of student peer-group rating,” Int. J. Sci. Educ., vol. 25, no. 12, pp. 1509–1528, 2003.
- S. C. Weigle, “Using FACETS to model rater training effects,” Language Testing, vol. 15, no. 2, pp. 263–287, 1998.
- S. C. Weigle, “Effects of training on raters of ESL compositions,” Language Testing, vol. 11, no. 2, pp. 197–223, 1994.
- L. Hamp-Lyons and G. Henning, “Communicative writing profiles: An investigation of the transferability of a multiple-trait scoring instrument across ESL writing assessment contexts,” Language Learning, vol. 41, no. 3, pp. 337–373, 1991.
- R. L. Johnson, J. Penny, and B. Gordon, “The relation between score resolution methods and interrater reliability: An empirical study of an analytic scoring rubric,” Appl. Meas. Educ., vol. 13, no. 2, pp. 121–138, 2000.
- M. Sasaki, “Development of an analytic rating scale for Japanese L1 writing,” Language Testing, vol. 16, no. 4, pp. 457–478, 1999.
- H. Servian and L. Gonsalves, “Analysing how scientists explain their research: A rubric for measuring the effectiveness of scientific explanations,” Int. J. Sci. Educ., vol. 30, no. 11, pp. 1441–1467, 2008.
- D. B. Reeves, Holistic Accountability: Serving Students, Schools, and Community. Thousand Oaks, CA: Corwin, 2002.
- C. Vaughan, “Holistic assessment: What goes on in the raters’ minds,” in Assessing Second Language Writing in Academic Context, L. Hamp-Lyons, Ed. Norwood, NJ: Ablex, 1991, pp. 111–126.
- J. J. Pula and B. A. Huot, “A model of background influences on holistic raters,” in Validating Holistic Scoring for Writing Assessment: Theoretical and Empirical Foundations, M. M. Williamson and B. A. Huot, Eds. Cresskill, NJ: Hampton Press, 1993, pp. 237–265.
- H. Saito, “EFL classroom peer assessment: Training effects on rating and commenting,” Language Testing, vol. 25, no. 4, pp. 553–581, 2008.
- G. S. Hanna and P. A. Dettmer, Assessment for Effective Teaching: Using Context-Adaptive Planning. Boston, MA: Pearson, 2004.
- J. Watt and S. van den Burg, Research Methods for Communication Science. Boston, MA: Allyn and Bacon, 1995.
- R. D. Cherry and P. R. Meyer, “Reliability issues in holistic assessment,” in Validating Holistic Scoring for Writing Assessment: Theoretical and Empirical Foundations, M. M. Williamson and B. A. Huot, Eds. Cresskill, NJ: Hampton Press, 1993, pp. 109–141.
- J. Cohen, “A coefficient for agreement for nominal scales,” Educ. Psychol. Meas., vol. 20, no. 3, pp. 37–46, 1960.
- S. E. Stemler. (2004). A comparison of consensus, consistency, and measurement approaches to estimating interrater reliability. Practical Assessment, Research & Evaluation. [Online]. Available: http://pareonline.net/getvn.asp?v=9&n=4
- G. T. L. Brown, K. Glasswell, and D. Harland, “Accuracy in the scoring of writing: Studies of reliability and validity using a New Zealand writing assessment system,” Assessing Writing, vol. 9, no. 2, pp. 105–121, 2004.
- W. D. Perrault and L. E. Leigh, “Reliability of nominal data based on qualitative judgments,” J. Market. Res., vol. 26, no. 1, pp. 135–148, 1989.
- D. Charney, “Experimental and quasi-experimental research,” in Res. Tech. Commun., L. J. Gurak and M. M. Lay, Eds. Westport, CT: Praeger, 2002, pp. 111–130.
- E. M. White, “Holistic scoring: Past triumphs, future challenges,” in Validating Holistic Scoring for Writing Assessment: Theoretical and Empirical Foundations, M. M. Williamson and B. A. Huot, Eds. Cresskill, NJ: Hampton Press, 1993, pp. 79–108.
- D. E. Tanner, Assessing Academic Achievement. Boston, MA: Allyn and Bacon, 2001.
- R. Larsson, “Case survey methodology: Quantitative analysis of patterns across case studies,” Academy Manage. J., vol. 36, no. 6, pp. 1515–1546, 1993.
- J. Allen, “The role(s) of assessment in technical communication: A review of the literature,” Tech. Commun. Quart., vol. 2, no. 4, pp. 365–388, 1993.
- R. Johnson-Sheehan, Technical Communication Today, 2nd ed. New York: Pearson, 2007.
- M. M. Markel, Tech. Commun., 8th ed. New York: Bedford/St. Martin’s, 2007.
- R. Johnson-Sheehan and P. L. Lynch, Instructor’s Manual to Accompany Technical Communication Today, 2nd ed. New York: Pearson, 2007.
- Resources for Technical Communication, 2nd ed. New York: Pearson, 2007.
- Z. Yunxia, “Structural moves reflected in English and Chinese sales letters,” Discourse Studies, vol. 2, no. 4, pp. 473–493, 2000.
- M. L. Yell and E. Drasgow, No Child Left Behind: A Guide for Professionals. Upper Saddle River, NJ: Pearson Education, Inc., 2005.
- J. Allen, “The impact of student learning outcomes assessment on technical and professional communication programs,” Tech. Commun. Quart., vol. 13, no. 1, pp. 93–108, 2004.
© 2010 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.