Actions

An Ethnography of Datasets: Difference between revisions

From Algolit

 
(9 intermediate revisions by 2 users not shown)
Line 1: Line 1:
What seems to be overlooked in the transfer of bias from a societal level to the machine level is the dataset as an intermediate stage in decision making: the parameters by which a social environment is boxed into are determined by various factors. In the
+
by Algolit
creation of datasets that form the basis on which computer models function, conflict and ambiguity are neglected in favour of making reality computable. Data collection is political, but its politics are rendered invisible in the way it is presented and visualised. Datasets are not a distilled version of reality, nor simply a technology in itself. But as any technology, datasets encode their goal, their purpose and the world view of the makers.
 
  
With this work, we look into the most commonly used datasets for training machine learning and data scientists.
+
We often start the monthly Algolit meetings by searching for datasets or trying to create them. Sometimes we use already-existing corpora, made available through the Natural Language Toolkit [http://www.nltk.org/ nltk]. NLTK contains, among others, The Universal Declaration of Human Rights, inaugural speeches from US presidents, or movie reviews from the popular site Internet Movie Database (IMDb). Each style of writing will conjure different relations between the words and will reflect the moment in time from which they originate. The material included in NLTK was selected because it was judged useful for at least one community of researchers. In spite of specificities related to the initial context of each document, they become universal documents by default, via their inclusion into a collection of publicly available corpora. In this sense, the Python package manager for natural language processing could be regarded as a time capsule. The main reason why The Universal Declaration for Human Rights was included may have been because of the multiplicity of translations, but it also paints a picture of the types of human writing that algorithms train on.
 +
 
 +
With this work, we look at the datasets most commonly used by data scientists to train machine algorithms. What material do they consist of? Who collected them? When?
 +
 
 +
------------------------------------
 +
 
 +
Concept & execution: Cristina Cochior
  
 
[[Category:Data_Workers]][[Category:Data_Workers_EN]]
 
[[Category:Data_Workers]][[Category:Data_Workers_EN]]

Latest revision as of 22:06, 21 March 2019

by Algolit

We often start the monthly Algolit meetings by searching for datasets or trying to create them. Sometimes we use already-existing corpora, made available through the Natural Language Toolkit nltk. NLTK contains, among others, The Universal Declaration of Human Rights, inaugural speeches from US presidents, or movie reviews from the popular site Internet Movie Database (IMDb). Each style of writing will conjure different relations between the words and will reflect the moment in time from which they originate. The material included in NLTK was selected because it was judged useful for at least one community of researchers. In spite of specificities related to the initial context of each document, they become universal documents by default, via their inclusion into a collection of publicly available corpora. In this sense, the Python package manager for natural language processing could be regarded as a time capsule. The main reason why The Universal Declaration for Human Rights was included may have been because of the multiplicity of translations, but it also paints a picture of the types of human writing that algorithms train on.

With this work, we look at the datasets most commonly used by data scientists to train machine algorithms. What material do they consist of? Who collected them? When?


Concept & execution: Cristina Cochior