Inter Rater Reliability Study with Cohen’s Kappa and Fleiss’ Kappa

This report is a study on the inter-rater reliability analysis based on the manual annotation of 80 different articles by 81 different raters who participated this online rating exercise.
Cohen’s Kappa
Fleiss’ Kappa
COnfusion Matrix
Density Distribution
Data Visuliazation

Steven Wang


August 22, 2022

1. Introduction

This report is a study on the inter-rater reliability analysis based on the manual annotation of 80 different articles by 81 different raters who participated this online rating exercise.

This report covers the annotation results exploratory analysis, quantitative and qualitative analysis by using various inter-rater reliability performance evaluation technique and discussions.

2. Annotation Results Analysis

2.1 Data Import and Explorative Analysis

The raw data is a csv file contains the annotated results of 80 articles by 81 different raters including myself. All articles have been annotated to one of 10 predefined categories by different raters.(See Appendix Category Details for the details of these 10 categories.)

Figure 1 illustrates the voting heatmap for each article.

Figure 1: Article Category Voting Heatmap

From the heatmap above, we can see that some articles are agreed by the majority of voters while some have quite diversified opinions.And also we can see that there are a large proportion of articles categorized as Society. The Other is the least favorable category based on the hierarchical sorting.

Figure 2: The distribution of # of votes for the agreed category

While more than 50% of articles were agreed by 61 or more voters, there are 15% of articles were not agreed by the majority of the voters. This suggests that this 15% of articles could possibly assign to multiple categories.

Figure 3: The ranking of agreed categories based on the # of articles

It is interesting that is only 1 article categorized as War. There is no articles categorized as Other.

The list of articles with agreed categories (gold category) can be found in Appendix Artilce List with Gold Category.

Taking the agreed categories as the gold category and the articles annotated by myself as the prediction. We can see the difference in Figure 4.

Figure 4: # of Articles in Gold Category vs Predicted Category

It is clear that the main discrepancy is coming from 3 categories: Society, Science and Technology, and Business.

2.2 Confusion Matrix

From the proceeding exploratory analysis, we have noticed that we have differences between the predicted categories and the gold categories. I will use the confusion matrix table to present how different in each category as in Table 1.

Table 1: Confusion Matrix Table of Gold Category vs Predicted Category
Gold Category Predicted Category Total
Biz Ent. Error Health Other Politics Sci./Tech Society Sports War
Biz 3 0 0 0 0 3 0 0 0 0 6
Ent. 3 9 0 0 0 0 0 2 0 0 14
Error 0 1 0 0 0 0 0 0 1 0 2
Health 0 0 0 1 0 0 1 0 0 0 2
Other 0 0 0 0 0 0 0 0 0 0 0
Politics 0 0 0 0 0 7 1 2 0 0 10
Sci./Tech 0 0 0 0 0 0 5 2 0 0 7
Society 4 0 0 1 0 2 3 17 0 1 28
Sports 0 3 0 0 0 0 0 0 7 0 10
War 0 0 0 0 0 0 0 0 0 1 1
Total 10 13 0 2 0 12 10 23 8 2 80

A confusion matrix heatmap is created as in Figure 5, where the value is normalized based on the gold category values.

Figure 5: Confusion Matrix Heatmap of Gold Category vs Predicted Category

From the confusion matrix heatmap, we can see that while the diagonal of the matrix is mostly heated the prediction goes astray in certain categories, particularly like Business, Health which have quite strong different opinions.

2.3 Prediction Performance Evaluation

So how good is the overall prediction? I will use common performance evaluation metrics to measure the prediction performance.

Since there are multiple categories, I will calculate the individual category separately and use a weighted average to calculate the overall scores. I calculate the weight of a category based on its proportion in the total article. For instance, if let W_i be the weight of each category and let S_i be any evaluation score, then the overall average score OS is calculated as:

OS = \displaystyle\sum_{i=1}^{n}W_i \times S_i

In this report, I will measure 3 common evaluation metrics:

  1. Recall
  2. Precision
  3. F-Score

The Recall is calculated as: Recall = \frac{TP}{FN + TP} The Precision is calculated as:

Precision = \frac{TP}{FP + TP} The F-Score is calculated as:

{FScore} = \frac{2\times Recall \times Precision}{Recall + Precision}

Table 2 is a summary of my rate (prediction) performance results.

Table 2: Performance evaluation summary for my prediction
measure Biz Ent. Error Health Other Politics Sci./Tech Society Sports War Overall
Recall 50% 64.29% NA 50% NA 70% 71.43% 60.71% 70% 100% 62.71%
Precision 30% 69.23% NA 50% NA 58.33% 50% 73.91% 87.5% 50% 62.15%
Fscore 37.5% 66.67% NA 50% NA 63.64% 58.82% 66.67% 77.78% 66.67% 61.4%

Figure 6: Category and Overall performance metric score

Based on the metric score, my best-predicted category is Sports. My prediction on business articles seems bad.

Overall, I have achieved an averaged F-Score of 62.71%. I have a few questions to be discussed in the discussion section regarding the low score.

2.4 Cohen’s Kappa

Cohen’s Kappa is a very common statistic measurement to testify the interrater agreement between two raters or a test results against the ground truth results (Cohen’s Kappa, Wikipedia 2018).

Cohen’s Kappa coefficient \kappa is calculated as:

\kappa = \frac{P_{(a)} - P_{(e)}}{1 - P_{(e)}} Where P_{(a)} is the actual agreement observed and P_{(e)} is the expected probability for the chance agreement. A confusion matrix table like Table 1 is a good start point to illustrate how to calculate the P_{(e)} value. If we use i to represent row and j to represent column for categories in Table 1. Then:

P_{(e)} = \frac{1}{N^2}\sum_{k=1}^{n}( \displaystyle\sum_{i=1}^{n}Row_{ik}\times\sum_{j=1}^{n}Col_{kj} ) Where N is the total observations and n is the total categories.

Based on Wikipedia (Wikipedia 2018), Cohen’s Kappa coefficient might be interpreted as:

  1. when values < 0, indicating no agreement
  2. when 0-0.20, indicating none to slight agreement
  3. When 0.21-0.40, indicating fair agreement
  4. when 0.41-0.60, indicating moderate agreement
  5. When 0.61-0.80, indicating substantial agreement
  6. when 0.81-1, it is almost perfect agreement

However, there is no clear or sound evidence to support such interpretation. In some cases, these guidelines could be problematic and therefore it should be used cautiously.

Instead of calculating the \kappa by myself, in this report, I used R package irr (Gamer, Lemon & Singh 2012) to perform the calculation. The \kappa value of my prediction against the gold category is: 0.5418. The kappa score suggests that my annotation is moderately agreed by gold actual.

Then how are the other raters performing? Where does my \kappa value sit within the rater group? In figure 7 below, I computed all raters’ \kappa value against the gold category vote.

Figure 7: The distribution of All raters’ Kappa(against gold category)

It is clear that my Kappa is way below the average.

2.5 Fleiss’ Kappa

While Cohen’s Kappa only measures the agreement between two raters, Fleiss’s Kappa can measure the interrater reliability between any number of raters on categorical data rating. (Fleiss’s Kappa, Wikipedia 2018)

Fleiss’ Kappa \kappa can be defined as:

\kappa = \frac{\bar{P} - \bar{P}_{(e)}}{1 - \bar{P}_{(e)}} Where 1 - \bar{P}_{(e)} is the agreement achievable above the chance, while \bar{P} - \bar{P}_{(e)} is the agreement actually achieved above the chance.

Since Fleiss’ Kappa is used to measure multiple raters’ agreement, there is no ground truth category available here to measure the raters’ performance. The equation used to calculate the Fleiss’ Kappa \kappa has to take the number of raters into the consideration.

Let m be the number of raters, let n be the number of categories and N be total number of observations. Similar as in cohen’s Kappa section, If we use i to represent row for subjects and j to represent column for categories, we have an N \times n matrix. Then:

The proportion of a particular category, say j-th column, expressed as p_{j} is:

p_{j} = \frac{1}{Nm}\sum_{i=1}^{N}Col_{ij}

Let P_{i} as the raters’ agreement on i-th subject, it is calculated as:

P_{i} = \frac{1}{m(m-1)}\sum_{j=1}^{n}Row_{ij}(Row_{ij} - 1) \\ = \frac{1}{m(m-1)}\sum_{j=1}^{n}(Row_{ij}^2 - Row_{ij}) \\ = \frac{1}{m(m-1)} [(\sum_{j=1}^{n}(Row_{ij}^2) - (m) ]

With P_{i} and p_{j}, \bar{P} and \bar{P}_{(e)} is calculated as:

$${P} = {i=1}^{n}P{i} \

= ({i=1}^{N}{j=1}{n}Row_{ij}2 - Nm)$$

\bar{P_{(e)}} = \sum_{j=1}^{n}p_{j}^2

With all these available parameters, we can use this equation to calculate the Fleiss’s Kappa \kappa. The interpretation of the \kappa value is same as it is stated in the Cohen’s Kappa section.

I this report I used R package irr (Gamer, Lemon & Singh 2012) to perform Fleiss’ Kappa calculation. For the 81 raters in this data set the \kappa is: 0.5588.

3. References

  1. Matthias Gamer, Jim Lemon and Ian Fellows Puspendra Singh (2012). irr: Various Coefficients of Interrater Reliability and Agreement. R package version 0.84.

  2. McHugh, M. L. (2012). Interrater reliability: the kappa statistic. Biochemia Medica, 22(3), 276-282.

  3., Cohen’s Kappa in Plain English, accessed 1st April 2018,

  4. Wikipedia (2018). Cohen’s Kappa.

  5. Wikipedia (2018). Fleiss’s Kappa.

4. Appendix

4.1 Category Details

Category Category_Short Notes
Business Biz Includes: products, marketing, corporate events, business trends
Entertainment Ent. Includes: celebrity, film, TV, popular events, fashion. Excludes: sports, politics.
Health Health Includes: health advice, medical research
Politics Politics Includes: parliamentary debate; elections; diplomacy. Excludes: war and terrorism
Science and Technology Sci./Tech Includes: scientific research, health research, new technology, environment.
Society Society Includes: crime, tragedy, lifestyle, human interest, popular health. Excludes: politics, entertainment, war and terror
Sports Sports Includes: matches; sports gambling; team politics and personnel. Excludes: fashion.
War War Includes: physical and virtual fighting between large entities, terrorism, genocide
Other Other Try to avoid this! Don't use it if the article could fit in multiple categories and you can't decide which.
Error Error Includes: Articles not in English. Articles with no content text in the lead.

4.2 Article List with Gold Category

ID Gold_Category Article
1 Biz 'Impossible' for Saudi to reduce oil output: minister
2 Ent. 'Paddington' director on the 'sad' departure of Colin Firth |
3 Society 'Parents these days' are judged too harshly
4 Biz 'Ship Your Enemies Glitter' business sells for $85K
5 Society 19 Things Scottish People Miss When They Move To London
6 Ent. 30 Items We Just Know You're Going To Want For Fall
7 Society 36 Hours in Strasbourg, France
8 Politics 72-hour wait for Missouri abortions takes effect next month
9 Health After my cancer diagnosis, my anxiety disappeared. Now I'll do anything to keep this body | Clare Atkinson
10 Sci./Tech Anticipating the rise of the machine
11 Sci./Tech Apple CarPlay infotainment system has lots of potential
12 Society Australia's best private pools on the market
13 Politics Azerbaijan prosecutes a prominent human rights defenderon absurd charges.
14 Ent. Book quiz: How well do you know literature? Take the demonic Wigtown Book Festival pub quiz - Telegraph
15 Society Campaigners call for online retailer to remove 'unacceptable' baby clothes branded with 'sexualised and porn-inspired imagery'
16 Error Coal is 'good for humanity', says Tony Abbott at mine opening
17 Sci./Tech Coming Picture Of A Supermassive Black Hole Will Be The 'Image Of The Century'
18 Politics Cong escalates attack on govt through social media - The Times of India
19 Sports Cricket: Associations set plans for bright future | Otago Daily Times Online News : Otago, South...
20 Society Dad crushed to death in lift as a result of 'safety breaches' - court hears -
21 Society Danger Beneath: 'Fracking' Gas, Oil Pipes Threaten Rural Residents
22 Sports Devastated as captain, Tendulkar wanted to quit: Autobiography - The Times of India
23 Society Dewani cop didn't ace it
24 Society DU to go back to reevaluation system : News
25 Ent. DUBS Bring Down The Noise In Night Clubs
26 Ent. Eastenders: Carter clan set to implode in 2015
27 Ent. EW discusses: Does 'Super Smash Bros.' succeed on the 3DS? |
28 Society Ex-NYPD cop slams 'black brunch' with gun-toting selfie
29 Society EXCLUSIVE: MH370 will 'NEVER be found' as 'there's no sensible theory to where it is'
30 Society Father fatally shoots man in daughter's room in Northeast Philadelphia
31 Society FBI investigating explosion outside NAACP office in Colorado -
32 Society From the flower pot to Pantibar: Enda's gay support evolution
33 Health Gene-Therapy Hope for 'Bubble Boy' Syndrome
34 Society Girl killed, 11 injured in fiery crash on 60 Freeway in South El Monte, all lanes closed
35 Biz Google 'committed' to Ireland despite plans to phase out controversial tax rules -
36 Biz Hamish McRae: Lessons for all as America bounces back
37 Society How Britain Made James Foley's Killer
38 Society How to find love in 2015 without it costing the earth
39 Sports
40 Sports Ice hockey boy gets floored
41 Politics If US had a patent law like ours, they would discover many more drugs: Anand Grover - The Times of India
42 Sports Indian women beat Japan, win bronze in Asiad hockey
43 Ent. Interpol's Second Act: Inside the Gloom Kings' Return
44 Ent. INXS relive one of their most successful years ever and their toughest
45 Politics Iraq Reaches Major Oil Deal With Kurds
46 Society Kane's account of sting draws increasing fire
47 Society Kate Malonyay murdered by Elliot Coulson after uncovering web of lies: coroner
48 Sports Leading golfer Ian Poulter accused of being out of touch with reality
49 Sports Louis van Gaal hands out Christmas presents to fans on Boxing Day
50 Society Met policeman cleared after kicking mother tending to her child in hospital
51 Ent. Musicians Besides Frank Who Have Preferred Masks Onstage
52 Sports Nine to Know: Red Sox Stats Pack Good at Losing Edition
53 Politics Obama proposes consumer data protections, even as a military Twitter account is hacked
54 Politics Outrage as EU politicians demand extra PS8BILLION from taxpayers for Brussels budget
55 Sci./Tech Over half of all gadgets used at Christmas were made by Apple
56 Ent. Paul Hollywood & his wife Alexandra put on a united front at the NTAs
57 Society Philippine storm death toll rises to six - The Times of India
58 Society Poppy Appeal: Amputee soldiers join Joss Stone at Cenotaph vigil
59 Biz Potentially interesting
60 Society Prosecutors seek to have McDonnell jailed during appeal
61 Sports Radamel Falcao: Colombian striker 'really wants to stay' at Manchester United - but it depends on how much he plays
62 War Russian armoured vehicles and military trucks cross border into Ukraine - Telegraph
63 Sports Ryan Giggs' brother says footballer's eight-year affair with his wife 'demolished' his family
64 Ent. Snapper star dies
65 Sci./Tech TeradataVoice: The Catch-22 In Cyber Defense: More Isn't Always Better
66 Sci./Tech The Government's bid to boost mobile coverage could easily backfire - Telegraph
67 Society The unborn babies who are already smiling for the camera
68 Error The weirdest moments from 30 years of 'Ninja Turtles'
69 Society The words that ended my marriage
70 Sci./Tech Tim Cook puts personal touch on iPhone 6 launch - The Times of India
71 Ent. TOWIE's Pascal Craymer nearly spills out of plunging red dress
72 Ent. Translators battle bad subtitles that lead to poor perception of Hong Kong films
73 Politics Vic premier has lost the plot says Geoff Shaw after surviving expulsion
74 Politics Victorian Liberals sack candidate over role for porn star in campaign event
75 Society We put your tax return hell questions to HMRC, and this is what they said
76 Biz We should be banking on a brighter future for Scotland
77 Politics What has the Human Rights Act done for you?
78 Society What is a sinkhole?
79 Ent. Win one of 10 luxury holidays at the Telegraph Cruise Show - Telegraph
80 Society Woman dies after airport scanner interferes with her pacemaker - Telegraph