Revision as of 10:28, 1 March 2019 by Cristina
Exhibition in Mundaneum in Mons from 28 March till 29 April 2019.
Data Workers is an exhibition of algoliterary works, of stories told from an ‘algorithmic storyteller point of view’. The works are created by members of Algolit, a group from Brussels involved in artistic research on algorithms and literature. Every month they gather to experiment with F/LOSS code and texts.
Companies create artificial intelligences to serve, entertain, record and know about humans. The work of these machinic entities is usually hidden behind interfaces and patents. In the exhibition, algorithmic storytellers leave their invisible underworld to become interlocutors. The data workers operate in different collectives. Each collective represents a stage in the design process of a machine learning model: there are the Writers, the Cleaners, the Informants, the Readers, the Learners and the Oracles. The boundaries between these collectives are not fixed; they are porous and permeable. Sometimes oracles are also writers. Other times readers are also oracles. Robots voice experimental literature, algorithmic models read data, turn words into numbers, make calculations that define patterns and are able to endlessly process new texts ever after.
The exhibition foregrounds data workers who impact our daily lives, but are either hard to grasp and imagine or removed from the imaginary altogether. It connects stories about algorithms in mainstream media to the storytelling that is found in technical manuals and academic papers. Robots are invited to go into dialogue with human visitors and vice versa. In this way we might understand our respective reasonings, demystify each other's behaviour, encounter multiple personalities, and value our collective labour. It is also a tribute to the many machines that Paul Otlet and Henri La Fontaine imagined for their Mundaneum, showing their potential but also their limits.
Data Workers is a creation by Algolit.
Works by: Cristina Cochior, Gijs de Heij, Sarah Garcin, An Mertens, Javier Lloret, Louise Dekeuleneer, Florian Van de Weyer, Laetitia Trozzi, Rémi Forte, Guillaume Slizewicz, Michael Murtaugh, Manetta Berends, Mia Melvær.
Thanks to: Mike Kestemont, Michel Cleempoel, François Zajéga, Raphaèle Cornille, Kris Rutten, Anne-Laure Buisson, David Stampfli.
In the late nineteenth century two young Belgian jurists, Paul Otlet (1868-1944), ‘the father of documentation’, and Henri La Fontaine (1854-1943), statesman and Nobel Peace Prize winner, created The Mundaneum. The project aimed at gathering all the world’s knowledge and file it using the Universal Decimal Classification (UDC) system that they had invented. At first it was an International Institutions Bureau dedicated to international knowledge exchange. In the 20th century the Mundaneum became a universal centre of documentation. Its collections are made up of thousands of books, newspapers, journals, documents, posters, glass plates and postcards indexed on millions of cross-referenced cards. The collections were exhibited and kept in various buildings in Brussels, including the Palais du Cinquantenaire. The remains of the archive only moved to Mons in 1998.
Based on the Mundaneum, the two men designed a World City for which Le Corbusier made scale models and plans. The aim of the World City was to gather, at a global level, the institutions of intellectual work: libraries, museums and universities. This project was never realised. It suffered from its own utopia. The Mundaneum is the result of a visionary dream of what an infrastructure for universal knowledge exchange could be. It attained mythical dimensions at the time. When looking at the concrete archive that was developed, that collection is rather eclectic and situated.
Artifical intelligences today come with their own dreams of universality and practice of knowledge. When reading about them, the visionary dreams of their makers have been there since the beginning of their development in the 1950s. Nowadays, their promise has also attained mythical dimensions. When looking at their concrete applications, the collection of tools is truly innovative and fascinating, but similarly, rather eclectic and situated. For Data workers, Algolit combined some of the applications with 10% of the digitized publications of the International Institutions Bureau. In this way, we hope to poetically open up a discussion about machines, algorithms, and technological infrastructures.
Data workers need data to work with. The data that is used in the context of Algolit, is written language. Machine learning relies on many types of writing. Many authors write in the form of publications, like books or articles. These are part of organised archives and are sometimes digitized. But there are other kinds of writing too. We could say that every human being who has access to the internet is a writer each time they interact with algorithms. Adding reviews, writing emails or Wikipedia articles, clicking and liking.
Machine learning algorithms are not critics: they take whatever they're given, no matter the writing style, no matter the CV of the author, no matter their spelling mistakes. In fact, mistakes make it better: the more variety, the better they learn to anticipate unexpected text. But often, human authors are not aware of what happens to their work.
Most of the writing we use is in English, some is in French, some in Dutch. Most often we find ourselves writing in Python, the programming language we use. Algorithms can be writers too. Some neural networks write their own rules and generate their own texts. And for the models that are still wrestling with the ambiguities of natural language, there are human editors to assist them. Poets, playwrights or novelists start their new careers as assistants of AI.
Machine Learning is mainly used to analyse and predict situations based on existing cases. In this exhibition we focus on machine learning models for text processing or Natural language processing', in short, 'nlp'. These models have learned to perform a specific task on the basis of existing texts. The models are used for search engines, machine translations and summaries, spotting trends in new media networks and news feeds. They influence what you get to see as a user, but also have their word to say in the course of stock exchanges worldwide, the detection of cybercrime and vandalism, etc.
There are two main tasks when it comes to language understanding. Information extraction looks at concepts and relations between concepts. This allows for recognizing topics, places and persons in a text, summarization and questions & answering. The other task is text classification. You can train an oracle to detect whether an email is spam or not, written by a man or a woman, rather positive or negative.
In this zone you can see some of those models at work. During your further journey through the exhibition you will discover the different steps that a human-machine goes through to come to a final model.
Algolit chooses to work with texts that are free of copyright. This means that they are published under a Creative Commons 4.0 license - which is rare -, or that they are in the public domain because the author has died more than 70 years ago. This is the case for the publications of the Mundaneum. We received 203 documents that we helped turn into datasets. They are now available for others online. Sometimes we have to deal with poor text formats, and we often dedicate a lot of time to cleaning up documents. We are not alone in this.
Books are scanned at high resolution, page by page. This is time-consuming, laborious human work and often the reason why archives and libraries transfer their collections and leave the job to companies like Google. The photos are converted into text via OCR (Optical Character Recognition), a software that recognizes letters, but often makes mistakes, especially when it has to deal with ancient fonts and wrinkled pages. Yet more wearisome human work is needed to improve the texts. This is often achieved through poorly-paid freelancers via micro-payment platforms like Amazon's Mechanical Turk; or by volunteers, such as the community around the Distributed Proofreaders Project, that does fantastic work. Whoever does it, or wherever it is done, cleaning up texts is a towering job for which there is no structural automation yet.
Machine learning algorithms need guidance; whether they are supervised or not. In order to separate one thing from another, they need material to extract patterns from. One should carefully choose the study material, and adapt it to the machine's task. It doesn't make sense to train a machine with 19th Century novels if its mission is to analyze tweets. A badly written textbook can lead a student to give up on the subject altogether. A good textbook is preferably not a textbook at all.
This is where the dataset comes in: arranged as neatly as possible, organised in disciplined rows and lined up columns, waiting to be read by the machine. Each dataset collects different information about the world, and like all collections, they are imbued with collectors' bias. You will hear this expression very often: 'data is the new oil'. If only data were more like oil! Leaking, dripping and heavy with fat, bubbling up and jumping unexpectedly when in contact with new matter. Instead, data is supposed to be clean. With each process, each questionnaire, each column title, it becomes cleaner and cleaner, chipping distinct characteristics until it fits the mould of the dataset.
Some datasets combine the machinic logic with the logic of humans. The models that require supervision multiply the subjectivities of both data collectors and annotators, then propagate what they've been taught. You will encounter some of the datasets that pass as default in the machine learning field, as well as other stories of humans guiding machines.
We communicate with computers through language. We click on icons that have a description in words, we tap words on keyboards, use our voice to give them instructions. Sometimes we trust our computer with our most intimate thoughts and forget that they are extensive calculators. A computer understands every word as a combination of zeros and ones. A letter is read as a specific ASCII number: capital "A" is 001.
In all models, rule based, classical machine learning, and neural networks, words undergo some type of translation into numbers in order to understand the semantic meaning of language. This is done through counting. Some models count the frequency of single words, some might count the frequency of combinations of words, some count the frequency of nouns, adjectives, verbs or noun and verb phrases. Some just replace the words in a text by their index numbers. Numbers optimize the operative speed of computer processes, leading to fast predictions, but they also remove the symbolic links that words might have. Here we present a few techniques that are dedicated to making text readable to a machine.
Learners are the algorithms that distinguish machine learning practices from other types of practices. They are pattern finders, capable of crawling through data and generating some kind of specific 'grammar'. Learners are based on statistical techniques. Some need a large amount of training data in order to function, others can make do with a small annotated set. Some perform well in classification tasks, like spam identification, others are better at predicting numbers, like temperatures, distances, stock market values, and so on.
The terminology of machine learning is not yet fully established. Depending on the field, statistics, computer science or the humanities, varying terms are used. Learners are also called classifiers. When we talk about Learners, we talk about the interwoven functions that have the capacity to generate other functions, evaluate and readjust them to fit the data. They are good at understanding and revealing patterns. But they don't always distinguish well which of the patterns should be repeated.
In software packages, it is not often possible to distinguish the characteristic elements of the classifiers, because they are hidden in underlying modules or libraries, which programmers can invoke using single lines of code. For this exhibition, we have therefore developed three table games that show the learning process of simple, but frequently used classifiers and their evaluators, in detail.