Actions

WikiHarass: Difference between revisions

From Algolit

Line 2: Line 2:
 
|-
 
|-
 
| Type: || Dataset
 
| Type: || Dataset
 +
|-
 +
|Number of words: || 1.039.789
 +
|-
 +
|Unique words: || 6.4136
 
|-
 
|-
 
| Source: || English Wikipedia
 
| Source: || English Wikipedia
Line 13: Line 17:
 
*A corpus of all 95 million user and article talk diffs made between 2001–2015 scored by the personal attack model.
 
*A corpus of all 95 million user and article talk diffs made between 2001–2015 scored by the personal attack model.
 
*A human annotated dataset of 1m crowd-sourced annotations that cover 100k talk page diffs (with 10 judgements per diff).
 
*A human annotated dataset of 1m crowd-sourced annotations that cover 100k talk page diffs (with 10 judgements per diff).
 
  
 
For Algolit, a smaller section of the Detox dataset was used, taken from [https://conversationai.github.io/wikidetox/testdata/tox-sorted/Wikipedia%20Toxicity%20Sorted%20%28Toxicity%405%5BAlpha%5D%29.html Jigsaw's Github], which contains both constructive and vandalist edits.
 
For Algolit, a smaller section of the Detox dataset was used, taken from [https://conversationai.github.io/wikidetox/testdata/tox-sorted/Wikipedia%20Toxicity%20Sorted%20%28Toxicity%405%5BAlpha%5D%29.html Jigsaw's Github], which contains both constructive and vandalist edits.
  
 
[[Category:Algoliterary-Encounters]]
 
[[Category:Algoliterary-Encounters]]

Revision as of 20:42, 30 October 2017

Type: Dataset
Number of words: 1.039.789
Unique words: 6.4136
Source: English Wikipedia
Developed by: Wikimedia Foundation

The Detox dataset is a project by Wikimedia and Perspective API to train a neural network that would detect the level of toxicity of a comment.

The original dataset consists of:

  • A corpus of all 95 million user and article talk diffs made between 2001–2015 scored by the personal attack model.
  • A human annotated dataset of 1m crowd-sourced annotations that cover 100k talk page diffs (with 10 judgements per diff).

For Algolit, a smaller section of the Detox dataset was used, taken from Jigsaw's Github, which contains both constructive and vandalist edits.