This report is a study on the inter-rater reliability analysis based on the manual annotation of 80 different articles by 81 different raters who participated this online rating exercise.
This report covers the annotation results exploratory analysis, quantitative and qualitative analysis by using various inter-rater reliability performance evaluation technique and discussions.
The raw data is a csv file contains the annotated results of 80 articles by 81 different raters including myself. All articles have been annotated to one of 10 predefined categories by different raters.(See Appendix Category Details for the details of these 10 categories.)
Figure 1, Article Category Voting Heatmap
From the heatmap above, we can see that some articles are agreed by the majority of voters while some have quite diversified opinions.And also we can see that there are a large proportion of articles categorized as Society. The Other is the least favorable category based on the hierarchical sorting.
Figure 2, The distribution of # of votes for the agreed category
While more than 50% of articles were agreed by 61 or more voters, there are 15% of articles were not agreed by the majority of the voters. This suggests that this 15% of articles could possibly assign to multiple categories.
Figure 3, The ranking of agreed categories based on the # of articles
It is interesting that is only 1 article categorized as War. There is no articles categorized as Other.
The list of articles with agreed categories (gold category) can be found in Appendix Artilce List with Gold Category.
Taking the agreed categories as the gold category and the articles annotated by myself as the prediction. We can see the difference in Figure 4.
Figure 4, # of Articles in Gold Category vs Predicted Category
It is clear that the main discrepancy is coming from 3 categories: Society, Science and Technology, and Business.
From the proceeding exploratory analysis, we have noticed that we have differences between the predicted categories and the gold categories. I will use the confusion matrix table to present how different in each category as in Table 1.
Table 1, Confusion Matrix Table of Gold Category vs Predicted CategoryGold Category | Predicted Category | Total | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Biz | Ent. | Error | Health | Other | Politics | Sci./Tech | Society | Sports | War | ||
Biz | 3 | 0 | 0 | 0 | 0 | 3 | 0 | 0 | 0 | 0 | 6 |
Ent. | 3 | 9 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 14 |
Error | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 2 |
Health | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 2 |
Other | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Politics | 0 | 0 | 0 | 0 | 0 | 7 | 1 | 2 | 0 | 0 | 10 |
Sci./Tech | 0 | 0 | 0 | 0 | 0 | 0 | 5 | 2 | 0 | 0 | 7 |
Society | 4 | 0 | 0 | 1 | 0 | 2 | 3 | 17 | 0 | 1 | 28 |
Sports | 0 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 7 | 0 | 10 |
War | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 |
Total | 10 | 13 | 0 | 2 | 0 | 12 | 10 | 23 | 8 | 2 | 80 |
A confusion matrix heatmap is created as in Figure 5, where the value is normalized based on the gold category values.
Figure 5, Confusion Matrix Heatmap of Gold Category vs Predicted Category
From the confusion matrix heatmap, we can see that while the diagonal of the matrix is mostly heated the prediction goes astray in certain categories, particularly like Business, Health which have quite strong different opinions.
So how good is the overall prediction? I will use common performance evaluation metrics to measure the prediction performance.
Since there are multiple categories, I will calculate the individual category separately and use a weighted average to calculate the overall scores. I calculate the weight of a category based on its proportion in the total article. For instance, if let \(W_i\) be the weight of each category and let \(S_i\) be any evaluation score, then the overall average score OS is calculated as:
\[OS = \displaystyle\sum_{i=1}^{n}W_i \times S_i\]
In this report, I will measure 3 common evaluation metrics:
The Recall is calculated as: \[Recall = \frac{TP}{FN + TP}\] The Precision is calculated as:
\[Precision = \frac{TP}{FP + TP}\] The F-Score is calculated as:
\[{FScore} = \frac{2\times Recall \times Precision}{Recall + Precision}\]
Table 2 is a summary of my rate (prediction) performance results.
Table 2, Performance evaluation summary for my prediction
measure | Biz | Ent. | Error | Health | Other | Politics | Sci./Tech | Society | Sports | War | Overall |
---|---|---|---|---|---|---|---|---|---|---|---|
Recall | 50% | 64.29% | NA | 50% | NA | 70% | 71.43% | 60.71% | 70% | 100% | 62.71% |
Precision | 30% | 69.23% | NA | 50% | NA | 58.33% | 50% | 73.91% | 87.5% | 50% | 62.15% |
Fscore | 37.5% | 66.67% | NA | 50% | NA | 63.64% | 58.82% | 66.67% | 77.78% | 66.67% | 61.4% |
Figure 6, Category and Overall performance metric score
Based on the metric score, my best-predicted category is Sports. My prediction on business articles seems bad.
Overall, I have achieved an averaged F-Score of 62.71%. I have a few questions to be discussed in the discussion section regarding the low score.
Cohen’s Kappa is a very common statistic measurement to testify the interrater agreement between two raters or a test results against the ground truth results (Cohen’s Kappa, Wikipedia 2018).
Cohen’s Kappa coefficient \(\kappa\) is calculated as:
\[\kappa = \frac{P_{(a)} - P_{(e)}}{1 - P_{(e)}}\] Where \(P_{(a)}\) is the actual agreement observed and \(P_{(e)}\) is the expected probability for the chance agreement. A confusion matrix table like Table 1 is a good start point to illustrate how to calculate the \(P_{(e)}\) value. If we use \(i\) to represent row and \(j\) to represent column for categories in Table 1. Then:
\[P_{(e)} = \frac{1}{N^2}\sum_{k=1}^{n}( \displaystyle\sum_{i=1}^{n}Row_{ik}\times\sum_{j=1}^{n}Col_{kj} )\] Where \(N\) is the total observations and \(n\) is the total categories.
Based on Wikipedia (Wikipedia 2018), Cohen’s Kappa coefficient might be interpreted as:
However, there is no clear or sound evidence to support such interpretation. In some cases, these guidelines could be problematic and therefore it should be used cautiously.
Instead of calculating the \(\kappa\) by myself, in this report, I used R package irr (Gamer, Lemon & Singh 2012) to perform the calculation. The \(\kappa\) value of my prediction against the gold category is: 0.5418. The kappa score suggests that my annotation is moderately agreed by gold actual.
Then how are the other raters performing? Where does my \(\kappa\) value sit within the rater group? In figure 7 below, I computed all raters’ \(\kappa\) value against the gold category vote.
Figure 7, The distribution of All raters’ Kappa(against gold category)
It is clear that my Kappa is way below the average.
While Cohen’s Kappa only measures the agreement between two raters, Fleiss’s Kappa can measure the interrater reliability between any number of raters on categorical data rating. (Fleiss’s Kappa, Wikipedia 2018)
Fleiss’ Kappa \(\kappa\) can be defined as:
\[\kappa = \frac{\bar{P} - \bar{P}_{(e)}}{1 - \bar{P}_{(e)}}\] Where \(1 - \bar{P}_{(e)}\) is the agreement achievable above the chance, while \(\bar{P} - \bar{P}_{(e)}\) is the agreement actually achieved above the chance.
Since Fleiss’ Kappa is used to measure multiple raters’ agreement, there is no ground truth category available here to measure the raters’ performance. The equation used to calculate the Fleiss’ Kappa \(\kappa\) has to take the number of raters into the consideration.
Let \(m\) be the number of raters, let \(n\) be the number of categories and \(N\) be total number of observations. Similar as in cohen’s Kappa section, If we use \(i\) to represent row for subjects and \(j\) to represent column for categories, we have an \(N \times n\) matrix. Then:
The proportion of a particular category, say \(j\)-th column, expressed as \(p_{j}\) is:
\[p_{j} = \frac{1}{Nm}\sum_{i=1}^{N}Col_{ij}\]
Let \(P_{i}\) as the raters’ agreement on \(i\)-th subject, it is calculated as:
\[P_{i} = \frac{1}{m(m-1)}\sum_{j=1}^{n}Row_{ij}(Row_{ij} - 1) \\ = \frac{1}{m(m-1)}\sum_{j=1}^{n}(Row_{ij}^2 - Row_{ij}) \\ = \frac{1}{m(m-1)} [(\sum_{j=1}^{n}(Row_{ij}^2) - (m) ]\]
With \(P_{i}\) and \(p_{j}\), \(\bar{P}\) and \(\bar{P}_{(e)}\) is calculated as:
\[\bar{P} = \frac{1}{N}\sum_{i=1}^{n}P_{i} \\ = \frac{1}{Nm(m-1)}(\sum_{i=1}^{N}\sum_{j=1}^{n}Row_{ij}^2 - Nm)\]
\[\bar{P_{(e)}} = \sum_{j=1}^{n}p_{j}^2\]
With all these available parameters, we can use this equation to calculate the Fleiss’s Kappa \(\kappa\). The interpretation of the \(\kappa\) value is same as it is stated in the Cohen’s Kappa section.
I this report I used R package irr (Gamer, Lemon & Singh 2012) to perform Fleiss’ Kappa calculation. For the 81 raters in this data set the \(\kappa\) is: 0.5588.
Matthias Gamer, Jim Lemon and Ian Fellows Puspendra Singh (2012). irr: Various Coefficients of Interrater Reliability and Agreement. R package version 0.84. https://CRAN.R-project.org/package=irr
McHugh, M. L. (2012). Interrater reliability: the kappa statistic. Biochemia Medica, 22(3), 276-282.
Stats.stackexchange.com, Cohen’s Kappa in Plain English, accessed 1st April 2018, https://stats.stackexchange.com/questions/82162/cohens-kappa-in-plain-english
Wikipedia (2018). Cohen’s Kappa. https://en.wikipedia.org/wiki/Cohen%27s_kappa
Wikipedia (2018). Fleiss’s Kappa. https://en.wikipedia.org/wiki/Fleiss%27_kappa
Category | Category_Short | Notes |
---|---|---|
Business | Biz | Includes: products, marketing, corporate events, business trends |
Entertainment | Ent. | Includes: celebrity, film, TV, popular events, fashion. Excludes: sports, politics. |
Health | Health | Includes: health advice, medical research |
Politics | Politics | Includes: parliamentary debate; elections; diplomacy. Excludes: war and terrorism |
Science and Technology | Sci./Tech | Includes: scientific research, health research, new technology, environment. |
Society | Society | Includes: crime, tragedy, lifestyle, human interest, popular health. Excludes: politics, entertainment, war and terror |
Sports | Sports | Includes: matches; sports gambling; team politics and personnel. Excludes: fashion. |
War | War | Includes: physical and virtual fighting between large entities, terrorism, genocide |
Other | Other | Try to avoid this! Don’t use it if the article could fit in multiple categories and you can’t decide which. |
Error | Error | Includes: Articles not in English. Articles with no content text in the lead. |
ID | Gold_Category | Article |
---|---|---|
1 | Biz | ‘Impossible’ for Saudi to reduce oil output: minister |
2 | Ent. | ‘Paddington’ director on the ‘sad’ departure of Colin Firth | EW.com |
3 | Society | ‘Parents these days’ are judged too harshly |
4 | Biz | ‘Ship Your Enemies Glitter’ business sells for $85K |
5 | Society | 19 Things Scottish People Miss When They Move To London |
6 | Ent. | 30 Items We Just Know You’re Going To Want For Fall |
7 | Society | 36 Hours in Strasbourg, France |
8 | Politics | 72-hour wait for Missouri abortions takes effect next month |
9 | Health | After my cancer diagnosis, my anxiety disappeared. Now I’ll do anything to keep this body | Clare Atkinson |
10 | Sci./Tech | Anticipating the rise of the machine |
11 | Sci./Tech | Apple CarPlay infotainment system has lots of potential |
12 | Society | Australia’s best private pools on the market |
13 | Politics | Azerbaijan prosecutes a prominent human rights defenderon absurd charges. |
14 | Ent. | Book quiz: How well do you know literature? Take the demonic Wigtown Book Festival pub quiz - Telegraph |
15 | Society | Campaigners call for online retailer to remove ‘unacceptable’ baby clothes branded with ‘sexualised and porn-inspired imagery’ |
16 | Error | Coal is ‘good for humanity’, says Tony Abbott at mine opening |
17 | Sci./Tech | Coming Picture Of A Supermassive Black Hole Will Be The ‘Image Of The Century’ |
18 | Politics | Cong escalates attack on govt through social media - The Times of India |
19 | Sports | Cricket: Associations set plans for bright future | Otago Daily Times Online News : Otago, South… |
20 | Society | Dad crushed to death in lift as a result of ‘safety breaches’ - court hears - Independent.ie |
21 | Society | Danger Beneath: ‘Fracking’ Gas, Oil Pipes Threaten Rural Residents |
22 | Sports | Devastated as captain, Tendulkar wanted to quit: Autobiography - The Times of India |
23 | Society | Dewani cop didn’t ace it |
24 | Society | DU to go back to reevaluation system : News |
25 | Ent. | DUBS Bring Down The Noise In Night Clubs |
26 | Ent. | Eastenders: Carter clan set to implode in 2015 |
27 | Ent. | EW discusses: Does ‘Super Smash Bros.’ succeed on the 3DS? | EW.com |
28 | Society | Ex-NYPD cop slams ‘black brunch’ with gun-toting selfie |
29 | Society | EXCLUSIVE: MH370 will ‘NEVER be found’ as ‘there’s no sensible theory to where it is’ |
30 | Society | Father fatally shoots man in daughter’s room in Northeast Philadelphia |
31 | Society | FBI investigating explosion outside NAACP office in Colorado - CNN.com |
32 | Society | From the flower pot to Pantibar: Enda’s gay support evolution |
33 | Health | Gene-Therapy Hope for ‘Bubble Boy’ Syndrome |
34 | Society | Girl killed, 11 injured in fiery crash on 60 Freeway in South El Monte, all lanes closed |
35 | Biz | Google ‘committed’ to Ireland despite plans to phase out controversial tax rules - Independent.ie |
36 | Biz | Hamish McRae: Lessons for all as America bounces back |
37 | Society | How Britain Made James Foley’s Killer |
38 | Society | How to find love in 2015 without it costing the earth |
39 | Sports | http://www.scotsman.com/news/celebrity/andy-murray-courts-controversy-with-ss-logo-1-3662942 |
40 | Sports | Ice hockey boy gets floored |
41 | Politics | If US had a patent law like ours, they would discover many more drugs: Anand Grover - The Times of India |
42 | Sports | Indian women beat Japan, win bronze in Asiad hockey |
43 | Ent. | Interpol’s Second Act: Inside the Gloom Kings’ Return |
44 | Ent. | INXS relive one of their most successful years ever and their toughest |
45 | Politics | Iraq Reaches Major Oil Deal With Kurds |
46 | Society | Kane’s account of sting draws increasing fire |
47 | Society | Kate Malonyay murdered by Elliot Coulson after uncovering web of lies: coroner |
48 | Sports | Leading golfer Ian Poulter accused of being out of touch with reality |
49 | Sports | Louis van Gaal hands out Christmas presents to fans on Boxing Day |
50 | Society | Met policeman cleared after kicking mother tending to her child in hospital |
51 | Ent. | Musicians Besides Frank Who Have Preferred Masks Onstage |
52 | Sports | Nine to Know: Red Sox Stats Pack Good at Losing Edition |
53 | Politics | Obama proposes consumer data protections, even as a military Twitter account is hacked |
54 | Politics | Outrage as EU politicians demand extra PS8BILLION from taxpayers for Brussels budget |
55 | Sci./Tech | Over half of all gadgets used at Christmas were made by Apple |
56 | Ent. | Paul Hollywood & his wife Alexandra put on a united front at the NTAs |
57 | Society | Philippine storm death toll rises to six - The Times of India |
58 | Society | Poppy Appeal: Amputee soldiers join Joss Stone at Cenotaph vigil |
59 | Biz | Potentially interesting |
60 | Society | Prosecutors seek to have McDonnell jailed during appeal |
61 | Sports | Radamel Falcao: Colombian striker ‘really wants to stay’ at Manchester United - but it depends on how much he plays |
62 | War | Russian armoured vehicles and military trucks cross border into Ukraine - Telegraph |
63 | Sports | Ryan Giggs’ brother says footballer’s eight-year affair with his wife ‘demolished’ his family |
64 | Ent. | Snapper star dies |
65 | Sci./Tech | TeradataVoice: The Catch-22 In Cyber Defense: More Isn’t Always Better |
66 | Sci./Tech | The Government’s bid to boost mobile coverage could easily backfire - Telegraph |
67 | Society | The unborn babies who are already smiling for the camera |
68 | Error | The weirdest moments from 30 years of ‘Ninja Turtles’ |
69 | Society | The words that ended my marriage |
70 | Sci./Tech | Tim Cook puts personal touch on iPhone 6 launch - The Times of India |
71 | Ent. | TOWIE’s Pascal Craymer nearly spills out of plunging red dress |
72 | Ent. | Translators battle bad subtitles that lead to poor perception of Hong Kong films |
73 | Politics | Vic premier has lost the plot says Geoff Shaw after surviving expulsion |
74 | Politics | Victorian Liberals sack candidate over role for porn star in campaign event |
75 | Society | We put your tax return hell questions to HMRC, and this is what they said |
76 | Biz | We should be banking on a brighter future for Scotland |
77 | Politics | What has the Human Rights Act done for you? |
78 | Society | What is a sinkhole? |
79 | Ent. | Win one of 10 luxury holidays at the Telegraph Cruise Show - Telegraph |
80 | Society | Woman dies after airport scanner interferes with her pacemaker - Telegraph |