An Ethnography of Datasets: Difference between revisions

Revision as of 21:06, 21 March 2019

by Algolit

We often start the monthly Algolit meetings by searching for datasets or trying to create them. Sometimes we use already-existing corpora, made available through the Natural Language Toolkit nltk. NLTK contains, among others, The Universal Declaration of Human Rights, inaugural speeches from US presidents, or movie reviews from the popular site Internet Movie Database (IMDb). Each style of writing will conjure different relations between the words and will reflect the moment in time from which they originate. In this sense, the Python package manager for natural language processing could be regarded as a time capsule. The material that was selected to be included was deemed useful for at least one community, yet it is perceived as a universal default through the ease with which it is made available.

With this work, we look at the datasets most commonly used by data scientists to train machine algorithms. What material do they consist of? Who collected them? When?

Concept & interface: Cristina Cochior

@@ Line 1: / Line 1: @@
 by Algolit
-We often start the monthly Algolit meetings by searching for datasets or trying to create them. One of the easiest ways is to use already-existing corpora, made available through the Natural Language Toolkit [http://www.nltk.org/ nltk]. Nltk contains, among others, The Declaration of Human Rights, inaugural speeches from US presidents, and movie reviews from IMDb. Each style of writing will conjure different relations between the words.
+We often start the monthly Algolit meetings by searching for datasets or trying to create them. Sometimes we use already-existing corpora, made available through the Natural Language Toolkit [http://www.nltk.org/ nltk]. NLTK contains, among others, The Universal Declaration of Human Rights, inaugural speeches from US presidents, or movie reviews from the popular site Internet Movie Database (IMDb). Each style of writing will conjure different relations between the words and will reflect the moment in time from which they originate. In this sense, the Python package manager for natural language processing could be regarded as a time capsule. The material that was selected to be included was deemed useful for at least one community, yet it is perceived as a universal default through the ease with which it is made available.
 With this work, we look at the datasets most commonly used by data scientists to train machine algorithms. What material do they consist of? Who collected them? When?
 ------------------------------------

An Ethnography of Datasets: Difference between revisions

From Algolit

Revision as of 21:06, 21 March 2019