WikiHarass: Difference between revisions
From Algolit
(9 intermediate revisions by 2 users not shown) | |||
Line 3: | Line 3: | ||
| Type: || Dataset | | Type: || Dataset | ||
|- | |- | ||
− | | Developed by: || | + | |Number of words: || 1.039.789 |
+ | |- | ||
+ | |Unique words: || 64.136 | ||
+ | |- | ||
+ | | Source: || English Wikipedia | ||
+ | |- | ||
+ | | Developed by: || Wikimedia Foundation | ||
|} | |} | ||
− | The [https://meta.wikimedia.org/wiki/Research:Detox Detox dataset] | + | The [https://meta.wikimedia.org/wiki/Research:Detox Detox dataset] is a project by Wikimedia and [[Crowd Embeddings| Perspective API]] to train a neural network that would detect the level of toxicity of a comment. |
− | The [https://figshare.com/projects/Wikipedia_Talk/16731 dataset] consists of: | + | The [https://figshare.com/projects/Wikipedia_Talk/16731 original dataset] consists of: |
− | *A corpus of all 95 million user and article talk diffs made between | + | *A corpus of all 95 million user and article talk diffs made between 2001 and 2015 scored by the personal attack model. |
*A human annotated dataset of 1m crowd-sourced annotations that cover 100k talk page diffs (with 10 judgements per diff). | *A human annotated dataset of 1m crowd-sourced annotations that cover 100k talk page diffs (with 10 judgements per diff). | ||
+ | For Algolit, a smaller section of the Detox dataset was used, taken from [https://conversationai.github.io/wikidetox/testdata/tox-sorted/Wikipedia%20Toxicity%20Sorted%20%28Toxicity%405%5BAlpha%5D%29.html Jigsaw's Github], which contains both constructive and vandalist edits. | ||
[[Category:Algoliterary-Encounters]] | [[Category:Algoliterary-Encounters]] |
Latest revision as of 13:55, 2 November 2017
Type: | Dataset |
Number of words: | 1.039.789 |
Unique words: | 64.136 |
Source: | English Wikipedia |
Developed by: | Wikimedia Foundation |
The Detox dataset is a project by Wikimedia and Perspective API to train a neural network that would detect the level of toxicity of a comment.
The original dataset consists of:
- A corpus of all 95 million user and article talk diffs made between 2001 and 2015 scored by the personal attack model.
- A human annotated dataset of 1m crowd-sourced annotations that cover 100k talk page diffs (with 10 judgements per diff).
For Algolit, a smaller section of the Detox dataset was used, taken from Jigsaw's Github, which contains both constructive and vandalist edits.