Actions

Cleaning for Poems: Difference between revisions

From Algolit

Line 1: Line 1:
 
by Algolit
 
by Algolit
 
    
 
    
In a typical digitization process of an archive it's documents are scanned or photographed. These processes produce however pictures. To make the documents searchable they are often transformed into text using Optical Character Recognition software (OCR). If we want to use the archive of the mundaneum to train models using machine learning we need these texts rather than the underlying pictures. Luckily the documents were transformed into text when they were being scanned. Unfortunately the software often makes mistakes: it might recognize a wrong character, it might get confused by a stain an unsual font or the other side of the page shining through.
+
For this exhibition we're working with 3% of the Mundaneum's archive. These documents have first been scanned or photographed. To make the documents searchable they are transformed into text using Optical Character Recognition software (OCR). OCR are algorithmic models that are trained on other texts. They learned to identify characters, words, sentences and paragraphs.  
 
+
The software most often makes 'mistakes'. It might recognize a wrong character, it might get confused by a stain an unusual font or the other side of the page shining through.  
In the case of the Mundaneum we recognized several kinds of mistakes, some characters are wrongly recognized, sometimes it puts a space in between every character and often words are split up as there was a linebreak. But most importantly: as the books were scanned in spreads: the left and right page together, the texts are mixed up: the first sentence of the right page is put after the first sentence of the left page.  
+
These mistakes can also seen as poetic interpretations by the algorithm. They tell us something of how it has been constructed, what it has been learning from, what standards are and how you can explore the limits of a machine. In this installation you can choose how you treat the algorithm's misreadings, pick your degree of poetic cleanness, print your poem and take it home.
 
 
In this interface we ask you to help us clean up our dataset. We show you what we detected as seperate pages or mistakes and ask you to verify or improve our solution. Your corrections are directly used in the retraining of the model but will also be part of the dataset on publishing.
 
  
 
------------------------------------------
 
------------------------------------------

Revision as of 20:11, 4 March 2019

by Algolit

For this exhibition we're working with 3% of the Mundaneum's archive. These documents have first been scanned or photographed. To make the documents searchable they are transformed into text using Optical Character Recognition software (OCR). OCR are algorithmic models that are trained on other texts. They learned to identify characters, words, sentences and paragraphs. The software most often makes 'mistakes'. It might recognize a wrong character, it might get confused by a stain an unusual font or the other side of the page shining through. These mistakes can also seen as poetic interpretations by the algorithm. They tell us something of how it has been constructed, what it has been learning from, what standards are and how you can explore the limits of a machine. In this installation you can choose how you treat the algorithm's misreadings, pick your degree of poetic cleanness, print your poem and take it home.


Concept, code, interface: Gijs de Heij