An Ethnography of Datasets: Difference between revisions
From Algolit
Line 1: | Line 1: | ||
by Algolit | by Algolit | ||
− | We often start the monthly Algolit meetings by searching for datasets or trying to create them. One of the easiest ways is to use already-existing corpora, made available through kits like [nltk] or [scikit-learn]. Nltk contains, among others, The Declaration of Human Rights, inaugural speeches from US presidents, and movie reviews from IMDb. Each style of writing will conjure different relations between the words. | + | We often start the monthly Algolit meetings by searching for datasets or trying to create them. One of the easiest ways is to use already-existing corpora, made available through kits like [http://www.nltk.org/ nltk] or [https://scikit-learn.org scikit-learn]. Nltk contains, among others, The Declaration of Human Rights, inaugural speeches from US presidents, and movie reviews from IMDb. Each style of writing will conjure different relations between the words. |
With this work, we look at the datasets most commonly used by data scientists to train machine algorithms. What material do they consist of? Who collected them? When? | With this work, we look at the datasets most commonly used by data scientists to train machine algorithms. What material do they consist of? Who collected them? When? |
Revision as of 12:29, 20 March 2019
by Algolit
We often start the monthly Algolit meetings by searching for datasets or trying to create them. One of the easiest ways is to use already-existing corpora, made available through kits like nltk or scikit-learn. Nltk contains, among others, The Declaration of Human Rights, inaugural speeches from US presidents, and movie reviews from IMDb. Each style of writing will conjure different relations between the words.
With this work, we look at the datasets most commonly used by data scientists to train machine algorithms. What material do they consist of? Who collected them? When?
Concept & interface: Cristina Cochior