Inter Rater Reliability Study with Cohen’s Kappa and Fleiss’ Kappa

1. Introduction
2. Annotation Results Analysis
3. References
4. Appendix
- 4.1 Category Details
- 4.2 Article List with Gold Category

1. Introduction

This report is a study on the inter-rater reliability analysis based on the manual annotation of 80 different articles by 81 different raters who participated this online rating exercise.

This report covers the annotation results exploratory analysis, quantitative and qualitative analysis by using various inter-rater reliability performance evaluation technique and discussions.

2. Annotation Results Analysis

2.1 Data Import and Explorative Analysis

The raw data is a csv file contains the annotated results of 80 articles by 81 different raters including myself. All articles have been annotated to one of 10 predefined categories by different raters.(See Appendix Category Details for the details of these 10 categories.)

Figure 1, Article Category Voting Heatmap

From the heatmap above, we can see that some articles are agreed by the majority of voters while some have quite diversified opinions.And also we can see that there are a large proportion of articles categorized as Society. The Other is the least favorable category based on the hierarchical sorting.

Figure 2, The distribution of # of votes for the agreed category

While more than 50% of articles were agreed by 61 or more voters, there are 15% of articles were not agreed by the majority of the voters. This suggests that this 15% of articles could possibly assign to multiple categories.

Figure 3, The ranking of agreed categories based on the # of articles

It is interesting that is only 1 article categorized as War. There is no articles categorized as Other.

The list of articles with agreed categories (gold category) can be found in Appendix Artilce List with Gold Category.

Taking the agreed categories as the gold category and the articles annotated by myself as the prediction. We can see the difference in Figure 4.

Figure 4, # of Articles in Gold Category vs Predicted Category

It is clear that the main discrepancy is coming from 3 categories: Society, Science and Technology, and Business.

2.2 Confusion Matrix

From the proceeding exploratory analysis, we have noticed that we have differences between the predicted categories and the gold categories. I will use the confusion matrix table to present how different in each category as in Table 1.

Table 1, Confusion Matrix Table of Gold Category vs Predicted Category

Gold Category	Predicted Category										Total
Gold Category	Biz	Ent.	Error	Health	Other	Politics	Sci./Tech	Society	Sports	War	Total
Biz	3	0	0	0	0	3	0	0	0	0	6
Ent.	3	9	0	0	0	0	0	2	0	0	14
Error	0	1	0	0	0	0	0	0	1	0	2
Health	0	0	0	1	0	0	1	0	0	0	2
Other	0	0	0	0	0	0	0	0	0	0	0
Politics	0	0	0	0	0	7	1	2	0	0	10
Sci./Tech	0	0	0	0	0	0	5	2	0	0	7
Society	4	0	0	1	0	2	3	17	0	1	28
Sports	0	3	0	0	0	0	0	0	7	0	10
War	0	0	0	0	0	0	0	0	0	1	1
Total	10	13	0	2	0	12	10	23	8	2	80

A confusion matrix heatmap is created as in Figure 5, where the value is normalized based on the gold category values.

Figure 5, Confusion Matrix Heatmap of Gold Category vs Predicted Category

From the confusion matrix heatmap, we can see that while the diagonal of the matrix is mostly heated the prediction goes astray in certain categories, particularly like Business, Health which have quite strong different opinions.

2.3 Prediction Performance Evaluation

So how good is the overall prediction? I will use common performance evaluation metrics to measure the prediction performance.

Since there are multiple categories, I will calculate the individual category separately and use a weighted average to calculate the overall scores. I calculate the weight of a category based on its proportion in the total article. For instance, if let $W_i$ be the weight of each category and let $S_i$ be any evaluation score, then the overall average score OS is calculated as:

\[OS = \displaystyle\sum_{i=1}^{n}W_i \times S_i\]

In this report, I will measure 3 common evaluation metrics:

Recall
Precision
F-Score

The Recall is calculated as: \[Recall = \frac{TP}{FN + TP}\] The Precision is calculated as:

\[Precision = \frac{TP}{FP + TP}\] The F-Score is calculated as:

\[{FScore} = \frac{2\times Recall \times Precision}{Recall + Precision}\]

Table 2 is a summary of my rate (prediction) performance results.

Table 2, Performance evaluation summary for my prediction

measure	Biz	Ent.	Error	Health	Other	Politics	Sci./Tech	Society	Sports	War	Overall
Recall	50%	64.29%	NA	50%	NA	70%	71.43%	60.71%	70%	100%	62.71%
Precision	30%	69.23%	NA	50%	NA	58.33%	50%	73.91%	87.5%	50%	62.15%
Fscore	37.5%	66.67%	NA	50%	NA	63.64%	58.82%	66.67%	77.78%	66.67%	61.4%

Figure 6, Category and Overall performance metric score

Based on the metric score, my best-predicted category is Sports. My prediction on business articles seems bad.

Overall, I have achieved an averaged F-Score of 62.71%. I have a few questions to be discussed in the discussion section regarding the low score.

2.4 Cohen’s Kappa

Cohen’s Kappa is a very common statistic measurement to testify the interrater agreement between two raters or a test results against the ground truth results (Cohen’s Kappa, Wikipedia 2018).

Cohen’s Kappa coefficient $\kappa$ is calculated as:

\[\kappa = \frac{P_{(a)} - P_{(e)}}{1 - P_{(e)}}\] Where $P_{(a)}$ is the actual agreement observed and $P_{(e)}$ is the expected probability for the chance agreement. A confusion matrix table like Table 1 is a good start point to illustrate how to calculate the $P_{(e)}$ value. If we use $i$ to represent row and $j$ to represent column for categories in Table 1. Then:

\[P_{(e)} = \frac{1}{N^2}\sum_{k=1}^{n}( \displaystyle\sum_{i=1}^{n}Row_{ik}\times\sum_{j=1}^{n}Col_{kj} )\] Where $N$ is the total observations and $n$ is the total categories.

Based on Wikipedia (Wikipedia 2018), Cohen’s Kappa coefficient might be interpreted as:

when values < 0, indicating no agreement
when 0-0.20, indicating none to slight agreement
When 0.21-0.40, indicating fair agreement
when 0.41-0.60, indicating moderate agreement
When 0.61-0.80, indicating substantial agreement
when 0.81-1, it is almost perfect agreement

However, there is no clear or sound evidence to support such interpretation. In some cases, these guidelines could be problematic and therefore it should be used cautiously.

Instead of calculating the $\kappa$ by myself, in this report, I used R package irr (Gamer, Lemon & Singh 2012) to perform the calculation. The $\kappa$ value of my prediction against the gold category is: 0.5418. The kappa score suggests that my annotation is moderately agreed by gold actual.

Then how are the other raters performing? Where does my $\kappa$ value sit within the rater group? In figure 7 below, I computed all raters’ $\kappa$ value against the gold category vote.

Figure 7, The distribution of All raters’ Kappa(against gold category)
It is clear that my Kappa is way below the average.

2.5 Fleiss’ Kappa

While Cohen’s Kappa only measures the agreement between two raters, Fleiss’s Kappa can measure the interrater reliability between any number of raters on categorical data rating. (Fleiss’s Kappa, Wikipedia 2018)

Fleiss’ Kappa $\kappa$ can be defined as:

\[\kappa = \frac{\bar{P} - \bar{P}_{(e)}}{1 - \bar{P}_{(e)}}\] Where $1 - \bar{P}_{(e)}$ is the agreement achievable above the chance, while $\bar{P} - \bar{P}_{(e)}$ is the agreement actually achieved above the chance.

Since Fleiss’ Kappa is used to measure multiple raters’ agreement, there is no ground truth category available here to measure the raters’ performance. The equation used to calculate the Fleiss’ Kappa $\kappa$ has to take the number of raters into the consideration.

Let $m$ be the number of raters, let $n$ be the number of categories and $N$ be total number of observations. Similar as in cohen’s Kappa section, If we use $i$ to represent row for subjects and $j$ to represent column for categories, we have an $N \times n$ matrix. Then:

The proportion of a particular category, say $j$-th column, expressed as $p_{j}$ is:

\[p_{j} = \frac{1}{Nm}\sum_{i=1}^{N}Col_{ij}\]

Let $P_{i}$ as the raters’ agreement on $i$-th subject, it is calculated as:

\[P_{i} = \frac{1}{m(m-1)}\sum_{j=1}^{n}Row_{ij}(Row_{ij} - 1) \\ = \frac{1}{m(m-1)}\sum_{j=1}^{n}(Row_{ij}^2 - Row_{ij}) \\ = \frac{1}{m(m-1)} [(\sum_{j=1}^{n}(Row_{ij}^2) - (m) ]\]

With $P_{i}$ and $p_{j}$, $\bar{P}$ and $\bar{P}_{(e)}$ is calculated as:

\[\bar{P} = \frac{1}{N}\sum_{i=1}^{n}P_{i} \\ = \frac{1}{Nm(m-1)}(\sum_{i=1}^{N}\sum_{j=1}^{n}Row_{ij}^2 - Nm)\]

\[\bar{P_{(e)}} = \sum_{j=1}^{n}p_{j}^2\]

With all these available parameters, we can use this equation to calculate the Fleiss’s Kappa $\kappa$. The interpretation of the $\kappa$ value is same as it is stated in the Cohen’s Kappa section.

I this report I used R package irr (Gamer, Lemon & Singh 2012) to perform Fleiss’ Kappa calculation. For the 81 raters in this data set the $\kappa$ is: 0.5588.

3. References

Matthias Gamer, Jim Lemon and Ian Fellows Puspendra Singh (2012). irr: Various Coefficients of Interrater Reliability and Agreement. R package version 0.84. https://CRAN.R-project.org/package=irr
McHugh, M. L. (2012). Interrater reliability: the kappa statistic. Biochemia Medica, 22(3), 276-282.
Stats.stackexchange.com, Cohen’s Kappa in Plain English, accessed 1st April 2018, https://stats.stackexchange.com/questions/82162/cohens-kappa-in-plain-english
Wikipedia (2018). Cohen’s Kappa. https://en.wikipedia.org/wiki/Cohen%27s_kappa
Wikipedia (2018). Fleiss’s Kappa. https://en.wikipedia.org/wiki/Fleiss%27_kappa

4. Appendix

4.1 Category Details

Category	Category_Short	Notes
Business	Biz	Includes: products, marketing, corporate events, business trends
Entertainment	Ent.	Includes: celebrity, film, TV, popular events, fashion. Excludes: sports, politics.
Health	Health	Includes: health advice, medical research
Politics	Politics	Includes: parliamentary debate; elections; diplomacy. Excludes: war and terrorism
Science and Technology	Sci./Tech	Includes: scientific research, health research, new technology, environment.
Society	Society	Includes: crime, tragedy, lifestyle, human interest, popular health. Excludes: politics, entertainment, war and terror
Sports	Sports	Includes: matches; sports gambling; team politics and personnel. Excludes: fashion.
War	War	Includes: physical and virtual fighting between large entities, terrorism, genocide
Other	Other	Try to avoid this! Don’t use it if the article could fit in multiple categories and you can’t decide which.
Error	Error	Includes: Articles not in English. Articles with no content text in the lead.

4.2 Article List with Gold Category

ID	Gold_Category	Article
1	Biz	‘Impossible’ for Saudi to reduce oil output: minister
2	Ent.	‘Paddington’ director on the ‘sad’ departure of Colin Firth \| EW.com
3	Society	‘Parents these days’ are judged too harshly
4	Biz	‘Ship Your Enemies Glitter’ business sells for $85K
5	Society	19 Things Scottish People Miss When They Move To London
6	Ent.	30 Items We Just Know You’re Going To Want For Fall
7	Society	36 Hours in Strasbourg, France
8	Politics	72-hour wait for Missouri abortions takes effect next month
9	Health	After my cancer diagnosis, my anxiety disappeared. Now I’ll do anything to keep this body \| Clare Atkinson
10	Sci./Tech	Anticipating the rise of the machine
11	Sci./Tech	Apple CarPlay infotainment system has lots of potential
12	Society	Australia’s best private pools on the market
13	Politics	Azerbaijan prosecutes a prominent human rights defenderon absurd charges.
14	Ent.	Book quiz: How well do you know literature? Take the demonic Wigtown Book Festival pub quiz - Telegraph
15	Society	Campaigners call for online retailer to remove ‘unacceptable’ baby clothes branded with ‘sexualised and porn-inspired imagery’
16	Error	Coal is ‘good for humanity’, says Tony Abbott at mine opening
17	Sci./Tech	Coming Picture Of A Supermassive Black Hole Will Be The ‘Image Of The Century’
18	Politics	Cong escalates attack on govt through social media - The Times of India
19	Sports	Cricket: Associations set plans for bright future \| Otago Daily Times Online News : Otago, South…
20	Society	Dad crushed to death in lift as a result of ‘safety breaches’ - court hears - Independent.ie
21	Society	Danger Beneath: ‘Fracking’ Gas, Oil Pipes Threaten Rural Residents
22	Sports	Devastated as captain, Tendulkar wanted to quit: Autobiography - The Times of India
23	Society	Dewani cop didn’t ace it
24	Society	DU to go back to reevaluation system : News
25	Ent.	DUBS Bring Down The Noise In Night Clubs
26	Ent.	Eastenders: Carter clan set to implode in 2015
27	Ent.	EW discusses: Does ‘Super Smash Bros.’ succeed on the 3DS? \| EW.com
28	Society	Ex-NYPD cop slams ‘black brunch’ with gun-toting selfie
29	Society	EXCLUSIVE: MH370 will ‘NEVER be found’ as ‘there’s no sensible theory to where it is’
30	Society	Father fatally shoots man in daughter’s room in Northeast Philadelphia
31	Society	FBI investigating explosion outside NAACP office in Colorado - CNN.com
32	Society	From the flower pot to Pantibar: Enda’s gay support evolution
33	Health	Gene-Therapy Hope for ‘Bubble Boy’ Syndrome
34	Society	Girl killed, 11 injured in fiery crash on 60 Freeway in South El Monte, all lanes closed
35	Biz	Google ‘committed’ to Ireland despite plans to phase out controversial tax rules - Independent.ie
36	Biz	Hamish McRae: Lessons for all as America bounces back
37	Society	How Britain Made James Foley’s Killer
38	Society	How to find love in 2015 without it costing the earth
39	Sports	http://www.scotsman.com/news/celebrity/andy-murray-courts-controversy-with-ss-logo-1-3662942
40	Sports	Ice hockey boy gets floored
41	Politics	If US had a patent law like ours, they would discover many more drugs: Anand Grover - The Times of India
42	Sports	Indian women beat Japan, win bronze in Asiad hockey
43	Ent.	Interpol’s Second Act: Inside the Gloom Kings’ Return
44	Ent.	INXS relive one of their most successful years ever and their toughest
45	Politics	Iraq Reaches Major Oil Deal With Kurds
46	Society	Kane’s account of sting draws increasing fire
47	Society	Kate Malonyay murdered by Elliot Coulson after uncovering web of lies: coroner
48	Sports	Leading golfer Ian Poulter accused of being out of touch with reality
49	Sports	Louis van Gaal hands out Christmas presents to fans on Boxing Day
50	Society	Met policeman cleared after kicking mother tending to her child in hospital
51	Ent.	Musicians Besides Frank Who Have Preferred Masks Onstage
52	Sports	Nine to Know: Red Sox Stats Pack Good at Losing Edition
53	Politics	Obama proposes consumer data protections, even as a military Twitter account is hacked
54	Politics	Outrage as EU politicians demand extra PS8BILLION from taxpayers for Brussels budget
55	Sci./Tech	Over half of all gadgets used at Christmas were made by Apple
56	Ent.	Paul Hollywood & his wife Alexandra put on a united front at the NTAs
57	Society	Philippine storm death toll rises to six - The Times of India
58	Society	Poppy Appeal: Amputee soldiers join Joss Stone at Cenotaph vigil
59	Biz	Potentially interesting
60	Society	Prosecutors seek to have McDonnell jailed during appeal
61	Sports	Radamel Falcao: Colombian striker ‘really wants to stay’ at Manchester United - but it depends on how much he plays
62	War	Russian armoured vehicles and military trucks cross border into Ukraine - Telegraph
63	Sports	Ryan Giggs’ brother says footballer’s eight-year affair with his wife ‘demolished’ his family
64	Ent.	Snapper star dies
65	Sci./Tech	TeradataVoice: The Catch-22 In Cyber Defense: More Isn’t Always Better
66	Sci./Tech	The Government’s bid to boost mobile coverage could easily backfire - Telegraph
67	Society	The unborn babies who are already smiling for the camera
68	Error	The weirdest moments from 30 years of ‘Ninja Turtles’
69	Society	The words that ended my marriage
70	Sci./Tech	Tim Cook puts personal touch on iPhone 6 launch - The Times of India
71	Ent.	TOWIE’s Pascal Craymer nearly spills out of plunging red dress
72	Ent.	Translators battle bad subtitles that lead to poor perception of Hong Kong films
73	Politics	Vic premier has lost the plot says Geoff Shaw after surviving expulsion
74	Politics	Victorian Liberals sack candidate over role for porn star in campaign event
75	Society	We put your tax return hell questions to HMRC, and this is what they said
76	Biz	We should be banking on a brighter future for Scotland
77	Politics	What has the Human Rights Act done for you?
78	Society	What is a sinkhole?
79	Ent.	Win one of 10 luxury holidays at the Telegraph Cruise Show - Telegraph
80	Society	Woman dies after airport scanner interferes with her pacemaker - Telegraph