Difference between revisions of "Data Workers"
|Line 59:||Line 59:|
===== Works =====
===== Works =====
* [[The Algoliterator]]
* [[The Algoliterator]]
* [[Words ]]
* [[Classifying the World]]
* [[Classifying the World]]
Revision as of 20:24, 19 March 2019
Exhibition at the Mundaneum in Mons from 28 March until 29 April 2019.
The opening is on Thursday 28 March from 18:00 until 22:00. As part of the exhibition, we have invited Allison Parrish, an algoliterary poet from New York. She will give a talk in Passa Porta on Thursday evening 25 April and a workshop in the Mundaneum on Friday 26 April.
Data Workers is an exhibition of algoliterary works, of stories told from an ‘algorithmic storyteller point of view’. The exhibition was created by members of Algolit, a group from Brussels involved in artistic research on algorithms and literature. Every month they gather to experiment with F/LOSS code and texts. Some works are by students of Arts² and external participants to the workshop on machine learning and text organized by Algolit in October 2018 at the Mundaneum.
Companies create artificial intelligence (AI) systems to serve, entertain, record and learn about humans. The work of these machinic entities is usually hidden behind interfaces and patents. In the exhibition, algorithmic storytellers leave their invisible underworld to become interlocutors. The data workers operate in different collectives. Each collective represents a stage in the design process of a machine learning model: there are the Writers, the Cleaners, the Informants, the Readers, the Learners and the Oracles. The boundaries between these collectives are not fixed; they are porous and permeable. At times, Oracles are also Writers. At other times Readers are also Oracles. Robots voice experimental literature, while algorithmic models read data, turn words into numbers, make calculations that define patterns and are able to endlessly process new texts ever after.
The exhibition foregrounds data workers who impact our daily lives, but are either hard to grasp and imagine or removed from the imagination altogether. It connects stories about algorithms in mainstream media to the storytelling that is found in technical manuals and academic papers. Robots are invited to engage in dialogue with human visitors and vice versa. In this way we might understand our respective reasonings, demystify each other's behaviour, encounter multiple personalities, and value our collective labour. It is also a tribute to the many machines that Paul Otlet and Henri La Fontaine imagined for their Mundaneum, showing their potential but also their limits.
Data Workers was created by Algolit.
Works by: Cristina Cochior, Gijs de Heij, Sarah Garcin, An Mertens, Javier Lloret, Louise Dekeuleneer, Florian Van de Weyer, Laetitia Trozzi, Rémi Forte, Guillaume Slizewicz, Michael Murtaugh, Manetta Berends, Mia Melvær.
Thanks to: Mike Kestemont, Michel Cleempoel, François Zajéga, Raphaèle Cornille, Vincent Desfromont, Kris Rutten, Anne-Laure Buisson, David Stampfli.
At the Mundaneum
In the late nineteenth century two young Belgian jurists, Paul Otlet (1868–1944), the 'father of documentation’, and Henri La Fontaine (1854-1943), statesman and Nobel Peace Prize winner, created the Mundaneum. The project aimed to gather all the world’s knowledge and to file it using the Universal Decimal Classification (UDC) system that they had invented. At first it was an International Institutions Bureau dedicated to international knowledge exchange. In the twentieth century the Mundaneum became a universal centre of documentation. Its collections are made up of thousands of books, newspapers, journals, documents, posters, glass plates and postcards indexed on millions of cross-referenced cards. The collections were exhibited and kept in various buildings in Brussels, including the Palais du Cinquantenaire. The remains of the archive only moved to Mons in 1998.
Based on the Mundaneum, the two men designed a World City for which Le Corbusier made scale models and plans. The aim of the World City was to gather, at a global level, the institutions of knowledge: libraries, museums and universities. This project was never realized. It suffered from its own utopia. The Mundaneum is the result of a visionary dream of what an infrastructure for universal knowledge exchange could be. It attained mythical dimensions at the time. When looking at the concrete archive that was developed, that collection is rather eclectic and specific.
Artifical intelligence systems today come with their own dreams of universality and knowledge production. When reading about these systems, the visionary dreams of their makers were there from the beginning of their development in the 1950s. Nowadays, their promise has also attained mythical dimensions. When looking at their concrete applications, the collection of tools is truly innovative and fascinating, but at the same time, rather eclectic and specific. For Data Workers, Algolit combined some of the applications with 10 per cent of the digitized publications of the International Institutions Bureau. In this way, we hope to poetically open up a discussion about machines, algorithms, and technological infrastructures.
Data workers need data to work with. The data that used in the context of Algolit is written language. Machine learning relies on many types of writing. Many authors write in the form of publications, such as books or articles. These are part of organized archives and are sometimes digitized. But there are other kinds of writing too. We could say that every human being who has access to the Internet is a writer each time they interact with algorithms. We chat, write, click, like and share. In return for free services, we leave our data that is compiled into profiles and sold for advertising and research purposes.
Machine learning algorithms are not critics: they take whatever they're given, no matter the writing style, no matter the CV of the author, no matter the spelling mistakes. In fact, mistakes make it better: the more variety, the better they learn to anticipate unexpected text. But often, human authors are not aware of what happens to their work.
Most of the writing we use is in English, some in French, some in Dutch. Most often we find ourselves writing in Python, the programming language we use. Algorithms can be writers too. Some neural networks write their own rules and generate their own texts. And for the models that are still wrestling with the ambiguities of natural language, there are human editors to assist them. Poets, playwrights or novelists start their new careers as assistants of AI.
Machine learning is mainly used to analyse and predict situations based on existing cases. In this exhibition we focus on machine learning models for text processing or Natural Language Processing' (NLP). These models have learned to perform a specific task on the basis of existing texts. The models are used for search engines, machine translations and summaries, spotting trends in new media networks and news feeds. They influence what you get to see as a user, but also have their say in the course of stock exchanges worldwide, the detection of cybercrime and vandalism, etc.
There are two main tasks when it comes to language understanding. Information extraction looks at concepts and relations between concepts. This allows for recognizing topics, places and persons in a text, summarization and questions & answering. The other task is text classification. You can train an oracle to detect whether an email is spam or not, written by a man or a woman, rather positive or negative.
In this zone you can see some of those models at work. During your further journey through the exhibition you will discover the different steps that a human-machine goes through to come to a final model.
Algolit chooses to work with texts that are free of copyright. This means that they have been published under a Creative Commons 4.0 license – which is rare - or that they are in the public domain because the author died more than 70 years ago. This is the case for the publications of the Mundaneum. We received 203 documents that we helped turn into datasets. They are now available for others online. Sometimes we had to deal with poor text formats, and we often dedicated a lot of time to cleaning up documents. We were not alone in doing this.
Books are scanned at high resolution, page by page. This is time-consuming, laborious human work and often the reason why archives and libraries transfer their collections and leave the job to companies like Google. The photos are converted into text via OCR (Optical Character Recognition), a software that recognizes letters, but often makes mistakes, especially when it has to deal with ancient fonts and wrinkled pages. Yet more wearisome human work is needed to improve the texts. This is often carried out by poorly-paid freelancers via micro-payment platforms like Amazon's Mechanical Turk; or by volunteers, like the community around the Distributed Proofreaders Project, which does fantastic work. Whoever does it, or wherever it is done, cleaning up texts is a towering job for which no structural automation yet exists.
Machine learning algorithms need guidance, whether they are supervised or not. In order to separate one thing from another, they need material to extract patterns from. One should carefully choose the study material, and adapt it to the machine's task. It doesn't make sense to train a machine with nineteenth-century novels if its mission is to analyse tweets. A badly written textbook can lead a student to give up on the subject altogether. A good textbook is preferably not a textbook at all.
This is where the dataset comes in: arranged as neatly as possible, organized in disciplined rows and lined-up columns, waiting to be read by the machine. Each dataset collects different information about the world, and like all collections, they are imbued with collectors' bias. You will hear this expression very often: 'data is the new oil'. If only data were more like oil! Leaking, dripping and heavy with fat, bubbling up and jumping unexpectedly when in contact with new matter. Instead, data is supposed to be clean. With each process, each questionnaire, each column title, it becomes cleaner and cleaner, chipping distinct characteristics until it fits the mould of the dataset.
Some datasets combine the machinic logic with the human logic. The models that require supervision multiply the subjectivities of both data collectors and annotators, then propagate what they've been taught. You will encounter some of the datasets that pass as default in the machine learning field, as well as other stories of humans guiding machines.
We communicate with computers through language. We click on icons that have a description in words, we tap words on keyboards, use our voice to give them instructions. Sometimes we trust our computer with our most intimate thoughts and forget that they are extensive calculators. A computer understands every word as a combination of zeros and ones. A letter is read as a specific ASCII number: capital 'A' is 001.
In all models, rule-based, classical machine learning, and neural networks, words undergo some type of translation into numbers in order to understand the semantic meaning of language. This is done through counting. Some models count the frequency of single words, some might count the frequency of combinations of words, some count the frequency of nouns, adjectives, verbs or noun and verb phrases. Some just replace the words in a text by their index numbers. Numbers optimize the operative speed of computer processes, leading to fast predictions, but they also remove the symbolic links that words might have. Here we present a few techniques that are dedicated to making text readable to a machine.
- The Book of Tomorrow in a Bag of Words
- Growing a tree
- Algorithmic readings of Bertillon's portrait parlé
Learners are the algorithms that distinguish machine learning practices from other types of practices. They are pattern finders, capable of crawling through data and generating some kind of specific 'grammar'. Learners are based on statistical techniques. Some need a large amount of training data in order to function, others can work with a small annotated set. Some perform well in classification tasks, like spam identification, others are better at predicting numbers, like temperatures, distances, stock market values, and so on.
The terminology of machine learning is not yet fully established. Depending on the field, whether statistics, computer science or the humanities, different terms are used. Learners are also called classifiers. When we talk about Learners, we talk about the interwoven functions that have the capacity to generate other functions, evaluate and readjust them to fit the data. They are good at understanding and revealing patterns. But they don't always distinguish well which of the patterns should be repeated.
In software packages, it is not always possible to distinguish the characteristic elements of the classifiers, because they are hidden in underlying modules or libraries. Programmers can invoke them using a single line of code. For this exhibition, we therefore developed three table games that show in detail the learning process of simple, but frequently used classifiers.
This is a non-exhaustive wordlist, based on terms that are frequently used in the exhibition. It might help visitors who are not familiar with the vocabulary related to the field of Natural Language Processing (NLP), Algolit or the Mundaneum.
* Algolit: a group from Brussels involved in artistic research on algorithms and literature. Every month they gather to experiment with code and texts that are published under free licenses. http://www.algolit.net
* Algoliterary: word invented by Algolit for works that explore the point of view of the algorithmic storyteller. What kind of new forms of storytelling do we make possible in dialogue with machinic agencies?
* Algorithm: A set of instructions in a specific programming language, that takes an input and produces an output.
* Annotation: The annotation process is a crucial step in supervised machine learning where the algorithm is given examples of what it needs to learn. A spam filter in training will be fed examples of spam and real messages. These examples are entries, or rows from the dataset with a label, spam or non-spam. The labelling of a dataset is work executed by humans, they pick a label for each row of the dataset. To ensure the quality of the labels multiple annotators see the same row and have to give the same label before an example is included in the training data.
* AI or artificial intelligences: In computer science, artificial intelligence (AI), sometimes called machine intelligence, is intelligence demonstrated by machines, in contrast to the natural intelligence displayed by humans and other animals. Computer science defines AI research as the study of ‘intelligent agents’: any device that perceives its environment and takes actions that maximize its chance of successfully achieving its goals. More specifically, Kaplan and Haenlein define AI as ‘a system’s ability to correctly interpret external data, to learn from such data, and to use those learnings to achieve specific goals and tasks through flexible adaptation’. Colloquially, the term ‘artificial intelligence’ is used to describe machines that mimic ‘cognitive’ functions that humans associate with other human minds, such as ‘learning’ and ‘problem solving’. (from Wikipedia: https://en.wikipedia.org/wiki/Artificial_intelligence)
* Bag of Words: The bag-of-words model is a simplifying representation of text used in Natural Language Processing (NLP). In this model, a text is represented as a collection of its unique words, disregarding grammar, punctuation and even word order. The model transforms the text into a unique list of words and how many times they're used in the text, or quite literally a bag of words. Bag of words is often used as a baseline, on which the new model has to perform better.
* Character n-gram: a technique that is used for authorship recognition. When using character n-grams, texts are considered as sequences of characters. Let's consider the character trigram. All the overlapping sequences of three characters are isolated. For example, the character 3-grams of 'Suicide', would be, 'Sui', 'uic', 'ici', 'cid' etc. Patterns found with character n-grams focus on stylistic choices that are unconsciously made by the author. The patterns remain stable over the full length of the text.
* Classical Machine Learning: Naive Bayes, Support Vector Machines and Linear Regression are called classical machine learning algorithms. They perform well when learning with small datasets. But they often require complex Readers. The task the Readers do, is also called feature-engineering (see below). This means that a human needs to spend time on a deep exploratory data analysis of the dataset.
* Constant: Constant is a non-profit, artist-run organisation based in Brussels since 1997 and active in the fields of art, media and technology. Algolit started as a project of Constant in 2012. http://constantvzw.org
* Data workers: artificial intelligences that are developed to serve, entertain, record and know about humans. The work of these machinic entities is usually hidden behind interfaces and patents. In the exhibition, algorithmic storytellers leave their invisible underworld to become interlocutors.
* Dump: According to the English dictionary, a dump is an accumulation of refuse and discarded materials or the place where such materials are dumped. In computing a dump refers to a ‘database dump’, a record of data from a database used for easy downloading or for backing up a database. Database dumps are often published by free software and free content projects, such as Wikipedia, to allow reuse or forking of the database.
* Feature engineering: the process of using domain knowledge of the data to create features that make machine learning algorithms work. This means that a human needs to spend time on a deep exploratory data analysis of the dataset. In Natural Language Processing (NLP) features can be the frequency of words or letters, but also syntactical elements like nouns, adjectives, or verbs. The most significant features for the task to be solved, must be carefully selected and passed over to the classical machine learning algorithm.
* FLOSS or Free Libre Open Source Software: software that anyone is freely licensed to use, copy, study, and change the software in any way, and the source code is openly shared so that people are encouraged to voluntarily improve the design of the software. This is in contrast to proprietary software, where the software is under restrictive copyright licensing and the source code is usually hidden from the users. (From Wikipedia: https://en.wikipedia.org/wiki/Free_and_open-source_software)
* git: a software system for tracking changes in source code during software development. It is designed for coordinating work among programmers, but it can be used to track changes in any set of files. Before starting a new project, programmers create a "git repository" in which they will publish all parts of the code. The git repositories of Algolit can be found on https://gitlab.constantvzw.org/algolit.
* gutenberg.org: Project Gutenberg is an online platform run by volunteers to ‘encourage the creation and distribution of eBooks’. It was founded in 1971 by American writer Michael S. Hart and is the oldest digital library. Most of the items in its collection are the full texts of public domain books. The project tries to make these as free as possible, in long-lasting, open formats that can be used on almost any computer. As of 23 June 2018, Project Gutenberg reached 57,000 items in its collection of free eBooks. (From Wikipedia:https://en.wikipedia.org/wiki/Project_Gutenberg)
* Henri La Fontaine: Henri La Fontaine (1854-1943) is a Belgian politician, feminist and pacifist. He was awarded the Nobel Peace Prize in 1913 for his involvement in the International Peace Bureau and his contribution to the organization of the peace movement. In 1895, together with Paul Otlet, he created the International Bibliography Institute, which became the Mundaneum. Within this institution, which aimed to bring together all the world's knowledge, he contributed to the development of the Universal Decimal Classification (CDU) system.
* Kaggle: an online platform where users find and publish data sets, explore and build machine learning models, work with other data scientists and machine learning engineers, and enter competitions to solve data science challenges. About half a million data scientists are active on Kaggle. It was founded by Goldbloom and Ben Hamner in 2010 and acquired by Google in March 2017.
* Literature: Algolit understands the notion of literature in the way a lot of other experimental authors do: it includes all linguistic production, from the dictionary to the Bible, from Virginia Woolf's entire work to all versions of Terms of Service published by Google since its existence.
* Machine learning models: Algorithms based on statistics, mainly used to analyse and predict situations based on existing cases. In this exhibition we focus on machine learning models for text processing or Natural language processing', in short, 'nlp'. These models have learned to perform a specific task on the basis of existing texts. The models are used for search engines, machine translations and summaries, spotting trends in new media networks and news feeds. They influence what you get to see as a user, but also have their word to say in the course of stock exchanges worldwide, the detection of cybercrime and vandalism, etc.
* Markov Chain: algorithm that scans the text for the transition probability of letter or word occurrences, resulting in transition probability tables which can be computed even without any semantic or grammatical natural language understanding. It can be used for analyzing texts, but also for recombining them. It is is widely used in spam generation.
* Mechanical Turk: The Amazon Mechanical Turk is an online platform for humans to execute tasks that algorithms cannot. Examples include annotating sentences as being positive or negative, spotting number plates, discriminating between face and non-face. The jobs posted on this platform are often paid less than a cent per task. Tasks that are more complex or require more knowledge can be paid up to several cents. Many academic researchers use Mechanical Turk as an alternative to have their students execute these tasks.
* Mundaneum: In the late nineteenth century two young Belgian jurists, Paul Otlet (1868-1944), ‘the father of documentation’, and Henri La Fontaine (1854-1943), statesman and Nobel Peace Prize winner, created The Mundaneum. The project aimed at gathering all the world’s knowledge and file it using the Universal Decimal Classification (UDC) system that they had invented.
* Natural Language: Following Wikipedia, "a natural language or ordinary language is any language that has evolved naturally in humans through use and repetition without conscious planning or premeditation. Natural languages can take different forms, such as speech or signing. They are different from constructed and formal languages such as those used to program computers or to study logic.
* NLP or Natural Language Processing: Natural language processing (NLP) is a collective term referring to automatic computational processing of human languages. This includes algorithms that take human-produced text as input, and attempt to generate text that resembles it.
* Neural Networks: computing systems inspired by the biological neural networks that constitute animal brains. The neural network itself is not an algorithm, but rather a framework for many different machine learning algorithms to work together and process complex data inputs. Such systems ‘learn’ to perform tasks by considering examples, generally without being programmed with any task-specific rules. For example, in image recognition, they might learn to identify images that contain cats by analyzing example images that have been manually labeled as ‘cat’ or ‘no cat’ and using the results to identify cats in other images. They do this without any prior knowledge about cats, for example, that they have fur, tails, whiskers and cat-like faces. Instead, they automatically generate identifying characteristics from the learning material that they process. (From Wikipedia: https://en.wikipedia.org/wiki/Artificial_neural_network)
* Optical Character Recognition (OCR): computer processes for translating images of scanned texts into manipulable text files.
* Oracle: Oracles are prediction or profiling machines, are a specific type of algorithmic models, mostly based on statistics. They are widely used in smartphones, computers, tablets.
* Oulipo: Oulipo stands for Ouvroir de litterature potentielle (Workspace for Potential Literature). Oulipo was created in Paris by the French writers Raymond Queneau and François Le Lionnais. They rooted their practice in the European avant-garde of the twentieth century and in the experimental tradition of the 1960s. For Oulipo, the creation of rules becomes the condition to generate new texts, or what they call potential literature. Later, in 1981, they also created [http://www.alamo.free.fr/ ALAMO, Atelier de littérature assistée par la mathématique et les ordinateurs (Workspace for literature assisted by maths and computers).
* Paul Otlet: Paul Otlet (1868 – 1944) was a Belgian author, entrepreneur, visionary, lawyer and peace activist; he is one of several people who have been considered the father of information science, a field he called "documentation". Otlet created the Universal Decimal Classification, that was widespread in libraries. Together with Henri La Fontaine he created the Palais Mondial (World Palace), later, the Mundaneum to house the collections and activities of their various organizations and institutes.
* Python: the main programming language that is globally used for natural language processing, was invented in 1991 by the Dutch programmer Guido Van Rossum.
* Rule-Based models: Oracles can be created using different techniques. One way is to manually define rules for them. As prediction models they are then called rule-based models, opposed to statistical models. Rule-based models are handy for tasks that are specific, like detecting when a scientific paper concerns a certain molecule. With very little sample data, they can perform well.
* Sentiment analysis: Also called 'opinion mining'. A basic task in sentiment analysis is classifying a given text as positive, negative, or neutral. Advanced, 'beyond polarity' sentiment classification looks, for instance, at emotional states such as 'angry', 'sad', and 'happy'. Sentiment analysis is widely applied to user materials such as reviews and survey responses, comments and posts on social media, and healthcare materials for applications that range from marketing to customer service, from stock exchange transactions to clinical medicine.
* Supervised machine learning models: for the creation of supervised machine learning models, humans annotate sample text with labels before feeding it to a machine to learn. Each sentence, paragraph or text is judged by at least 3 annotators: whether it is spam or not spam, positive or negative etc.
* Training data: Machine learning algorithms need guidance. In order to separate one thing from another, they need texts to extract patterns from. One should carefully choose the training material, and adapt it to the machine's task. It doesn't make sense to train a machine with nineteenth-century novels if its mission is to analyze tweets.
* Unsupervised Machine Learning Models: Unsupervised machine learning models don't need the step of annotation of the data by humans. This saves a lot of time, energy, money. Instead, they need a large amount of training data, which is not always available and can take a long cleaning time beforehand.
* Word embeddings: language modelling techniques that through multiple mathematical operations of counting and ordering, plot words into a multi-dimensional vector space. When embedding words, they transform from being distinct symbols into mathematical objects that can be multiplied, divided, added or substracted.
* Wordnet: Wordnet is a combination of a dictionary and a thesaurus that can be read by machines. According to Wikipedia it was created in the Cognitive Science Laboratory of Princeton University starting in 1985. The project was initially funded by the US Office of Naval Research and later also by other US government agencies including DARPA, the National Science Foundation, the Disruptive Technology Office (formerly the Advanced Research and Development Activity), and REFLEX.