Actions

Difference between revisions of "An Ethnography of Datasets"

From Algolit

Line 1: Line 1:
 
by Algolit
 
by Algolit
  
In the transfer of bias from a societal level to the machine level the dataset seems to be overlooked as an intermediate stage in decision-making: the parameters by which a social environment is boxed into are determined by various factors. In the
+
We often start the monthly Algolit meetings by searching for datasets or trying to create them. One of the easiest ways is to use already-existing corpora, made available through kits like [nltk] or [scikit-learn]. Nltk contains, among others, The Declaration of Human Rights, inaugural speeches from US presidents, and movie reviews from IMDb. Each style of writing will conjure different relations between the words.
creation of datasets that form the basis on which computer models function, conflict and ambiguity are neglected in favour of making reality computable. Data collection is political, but its politics are rendered invisible in the way it is presented and visualized. Datasets are not a distilled version of reality, nor simply a technology in itself. But as any technology, datasets encode their goal, their purpose and the world view of the makers.
 
  
With this work, we look at the datasets most commonly used by data scientists to train machine algorithms. What material do they consist of? Who collected them? When? For what reason?  
+
With this work, we look at the datasets most commonly used by data scientists to train machine algorithms. What material do they consist of? Who collected them? When?  
  
 
------------------------------------
 
------------------------------------

Revision as of 23:33, 19 March 2019

by Algolit

We often start the monthly Algolit meetings by searching for datasets or trying to create them. One of the easiest ways is to use already-existing corpora, made available through kits like [nltk] or [scikit-learn]. Nltk contains, among others, The Declaration of Human Rights, inaugural speeches from US presidents, and movie reviews from IMDb. Each style of writing will conjure different relations between the words.

With this work, we look at the datasets most commonly used by data scientists to train machine algorithms. What material do they consist of? Who collected them? When?


Concept & interface: Cristina Cochior