WikiHarass: Difference between revisions
From Algolit
Line 5: | Line 5: | ||
|Number of words: || 1.039.789 | |Number of words: || 1.039.789 | ||
|- | |- | ||
− | |Unique words: || | + | |Unique words: || 64.136 |
|- | |- | ||
| Source: || English Wikipedia | | Source: || English Wikipedia |
Revision as of 20:43, 30 October 2017
Type: | Dataset |
Number of words: | 1.039.789 |
Unique words: | 64.136 |
Source: | English Wikipedia |
Developed by: | Wikimedia Foundation |
The Detox dataset is a project by Wikimedia and Perspective API to train a neural network that would detect the level of toxicity of a comment.
The original dataset consists of:
- A corpus of all 95 million user and article talk diffs made between 2001–2015 scored by the personal attack model.
- A human annotated dataset of 1m crowd-sourced annotations that cover 100k talk page diffs (with 10 judgements per diff).
For Algolit, a smaller section of the Detox dataset was used, taken from Jigsaw's Github, which contains both constructive and vandalist edits.