## 1. Introduction

This report is a study on the inter-rater reliability analysis based on the manual annotation of 80 different articles by 81 different raters who participated this online rating exercise.

This report covers the annotation results exploratory analysis, quantitative and qualitative analysis by using various inter-rater reliability performance evaluation technique and discussions.

## 2. Annotation Results Analysis

### 2.1 Data Import and Explorative Analysis

The raw data is a csv file contains the annotated results of 80 articles by 81 different raters including myself. All articles have been annotated to one of 10 predefined categories by different raters.(See Appendix Category Details for the details of these 10 categories.)

Figure 1, Article Category Voting Heatmap

From the heatmap above, we can see that some articles are agreed by the majority of voters while some have quite diversified opinions.And also we can see that there are a large proportion of articles categorized as Society. The Other is the least favorable category based on the hierarchical sorting.

Figure 2, The distribution of # of votes for the agreed category

While more than 50% of articles were agreed by 61 or more voters, there are 15% of articles were not agreed by the majority of the voters. This suggests that this 15% of articles could possibly assign to multiple categories.

Figure 3, The ranking of agreed categories based on the # of articles

It is interesting that is only 1 article categorized as War. There is no articles categorized as Other.

The list of articles with agreed categories (gold category) can be found in Appendix Artilce List with Gold Category.

Taking the agreed categories as the gold category and the articles annotated by myself as the prediction. We can see the difference in Figure 4.

Figure 4, # of Articles in Gold Category vs Predicted Category

It is clear that the main discrepancy is coming from 3 categories: Society, Science and Technology, and Business.

### 2.2 Confusion Matrix

From the proceeding exploratory analysis, we have noticed that we have differences between the predicted categories and the gold categories. I will use the confusion matrix table to present how different in each category as in Table 1.

Table 1, Confusion Matrix Table of Gold Category vs Predicted Category
Gold Category Predicted Category Total
Biz Ent. Error Health Other Politics Sci./Tech Society Sports War
Biz 3 0 0 0 0 3 0 0 0 0 6
Ent. 3 9 0 0 0 0 0 2 0 0 14
Error 0 1 0 0 0 0 0 0 1 0 2
Health 0 0 0 1 0 0 1 0 0 0 2
Other 0 0 0 0 0 0 0 0 0 0 0
Politics 0 0 0 0 0 7 1 2 0 0 10
Sci./Tech 0 0 0 0 0 0 5 2 0 0 7
Society 4 0 0 1 0 2 3 17 0 1 28
Sports 0 3 0 0 0 0 0 0 7 0 10
War 0 0 0 0 0 0 0 0 0 1 1
Total 10 13 0 2 0 12 10 23 8 2 80

A confusion matrix heatmap is created as in Figure 5, where the value is normalized based on the gold category values.

Figure 5, Confusion Matrix Heatmap of Gold Category vs Predicted Category

From the confusion matrix heatmap, we can see that while the diagonal of the matrix is mostly heated the prediction goes astray in certain categories, particularly like Business, Health which have quite strong different opinions.

### 2.3 Prediction Performance Evaluation

So how good is the overall prediction? I will use common performance evaluation metrics to measure the prediction performance.

Since there are multiple categories, I will calculate the individual category separately and use a weighted average to calculate the overall scores. I calculate the weight of a category based on its proportion in the total article. For instance, if let $$W_i$$ be the weight of each category and let $$S_i$$ be any evaluation score, then the overall average score OS is calculated as:

$OS = \displaystyle\sum_{i=1}^{n}W_i \times S_i$

In this report, I will measure 3 common evaluation metrics:

1. Recall
2. Precision
3. F-Score

The Recall is calculated as: $Recall = \frac{TP}{FN + TP}$ The Precision is calculated as:

$Precision = \frac{TP}{FP + TP}$ The F-Score is calculated as:

${FScore} = \frac{2\times Recall \times Precision}{Recall + Precision}$

Table 2 is a summary of my rate (prediction) performance results.

Table 2, Performance evaluation summary for my prediction

measure Biz Ent. Error Health Other Politics Sci./Tech Society Sports War Overall
Recall 50% 64.29% NA 50% NA 70% 71.43% 60.71% 70% 100% 62.71%
Precision 30% 69.23% NA 50% NA 58.33% 50% 73.91% 87.5% 50% 62.15%
Fscore 37.5% 66.67% NA 50% NA 63.64% 58.82% 66.67% 77.78% 66.67% 61.4%

Figure 6, Category and Overall performance metric score