Actions

I-could-have-written-that

From Algolit

Revision as of 14:18, 24 October 2017 by Manetta (talk | contribs) (i-could-have-written-that)

i-could-have-written-that

i-could-have-written-that* is a practice based research project about text based machine learning, questioning the readerly nature of the techniques and proposing to represent them as writing machines. The project includes the poster series from Myth (-1.00) to Power (+1.00) and three writing-systems (writing from Myth (-1.00) to Power (+1.00), Supervised writing & Cosine Similarity morphs) that translate technical elements from machine learning into graphic user interfaces in the browser. The interfaces enable their users to explore the techniques and do a series of test-runs themselves with a textual data source of choice.

After processing the textual source of choice, the writing-systems offer the option to export their outputs to a PDF document.

from Myth (-1.00) to Power (+1.00)

'from myth (-1.00) to power (+1.00)' is a poster series and linguistic mirror reflecting on the subject of certainty in text mining.

The series of statements are the product of a poetic translation excercise based on a script that is included in the text mining software package Pattern (University of Antwerp), called modality.py. This rule-based script is written to calculate the degree of certainty of a sentence, expressed as a value between -1.00 and +1.00. The concept of certainty is divided into nine values each linked to a set of words, of which this set of nouns is an example:


	-1.00: d("fantasy", "fiction", "lie", "myth", "nonsense"),
	-0.75: d("controversy"),
	-0.50: d("criticism", "debate", "doubt"),
	-0.25: d("belief", "chance", "faith", "luck", "perception", "speculation"),
	 0.00: d("challenge", "guess", "feeling", "hunch", "opinion", "possibility", "question"),
	+0.25: d("assumption", "expectation", "hypothesis", "notion", "others", "team"),
	+0.50: d("example", "proces", "theory"),
	+0.75: d("conclusion", "data", "evidence", "majority", "proof", "symptom", "symptoms"),
	+1.00: d("fact", "truth", "power")


A poetic translation exercise, from an interest in a numerical perception of human language, while bending strict categories.

writing from Myth (-1.00) to Power (+1.00)

The writing-system writing from Myth (-1.00) to Power (+1.00) is based on the certainty-detection script modality.py, where also the poster series are based on. The interface is a rule-based reading tool, that highlights the effect of rules that are written by the scientists at the University of Antwerp. The interface also offers the option to change the rules and create a custom reading-rule-set applied to a text of choice.

The default framework of rules in this writing-system are coming from modality.py. The rules are extracted from the Bioscope dataset and Wikipedia articles provided by the CoNLL2010 Shared Task 1, combined with weasel words (words that are tagged by the Wikipedia community as terms that increase the vaguenes of a text) and other words that seemed to make sense.

Supervised writing

The writing-system Supervised writing is built with a set of techniques that are often used in a supervised machine learning project. In a series of steps, the user guided through a language processing system to create a custom counted vocabulary writing exercise. On the way, the user meets the bag-of-words counting principle by exploring its numerical view on human language. With the option to work with text material from three external input sources, Twitter or DuckDuckGo or Wikipedia, this writing-system offers an alternative numerical view to well-known sources of textual data.

Cosine Similarity morphs

The writing-system Cosine Similarity morphs works with unsupervised similarity measurements on sentence level. The textual source of choice is first transformed into a corpus and a vector matrix, after which the cosine similarity function from SciKit Learn is applied. The cosine similarity function is often used in unsupervised machine learning practises to extract hidden semantic information from text. Since the textual data is shown to the computer without any label, this technique is often refered to as unsupervised learning.

The interface offers the user to select from a set of possible counting methods, also called features, to create a spectra of four most similar sentences. While creating multiplicity as result, the interface includes numerical information on the similarity calculations that have been made. The user, the cosine similarity function, the author of the text of choice, and the maker of this writing-system, collectively create a quartet of sentences that morph between linguistic and numerical understanding of similarity.

Colophon

i-could-have-written-that is a project by Manetta Berends and is kindly supported by CBK Rotterdam. The project uses ie. python, SciKit Learn, Pattern, nltk, cgi & jinja2.


* The title 'i-could-have-written-that' is derived from the paper ELIZA--A Computer Program For the Study of Natural Language Communication Between Man and Machine, written by Joseph Weizenbaum and published in 1966.