|Number of words:||1.039.789|
|Developed by:||Wikimedia Foundation|
The original dataset consists of:
- A corpus of all 95 million user and article talk diffs made between 2001 and 2015 scored by the personal attack model.
- A human annotated dataset of 1m crowd-sourced annotations that cover 100k talk page diffs (with 10 judgements per diff).
For Algolit, a smaller section of the Detox dataset was used, taken from Jigsaw's Github, which contains both constructive and vandalist edits.