Actions

Cleaning for Poems: Difference between revisions

From Algolit

Line 1: Line 1:
    In a typical digitization process of an archive it's documents are scanned or photographed. These processes produce however pictures. To make the documents searchable they are often transformed into text using Optical Character Recognition software (OCR). If we want to use the archive of the mundaneum to train models using machine learning we need these texts rather than the underlying pictures. Luckily the documents were transformed into text when they were being scanned. Unfortunately the software often makes mistakes: it might recognize a wrong character, it might get confused by a stain an unsual font or the other side of the page shining through.
+
Cleaning for Poems by Gijs de Heij
 +
 
 +
In a typical digitization process of an archive it's documents are scanned or photographed. These processes produce however pictures. To make the documents searchable they are often transformed into text using Optical Character Recognition software (OCR). If we want to use the archive of the mundaneum to train models using machine learning we need these texts rather than the underlying pictures. Luckily the documents were transformed into text when they were being scanned. Unfortunately the software often makes mistakes: it might recognize a wrong character, it might get confused by a stain an unsual font or the other side of the page shining through.
  
 
     In the case of the Mundaneum we recognized several kinds of mistakes, some characters are wrongly recognized, sometimes it puts a space in between every character and often words are split up as there was a linebreak. But most importantly: as the books were scanned in spreads: the left and right page together, the texts are mixed up: the first sentence of the right page is put after the first sentence of the left page.  
 
     In the case of the Mundaneum we recognized several kinds of mistakes, some characters are wrongly recognized, sometimes it puts a space in between every character and often words are split up as there was a linebreak. But most importantly: as the books were scanned in spreads: the left and right page together, the texts are mixed up: the first sentence of the right page is put after the first sentence of the left page.  

Revision as of 16:27, 28 February 2019

Cleaning for Poems by Gijs de Heij

In a typical digitization process of an archive it's documents are scanned or photographed. These processes produce however pictures. To make the documents searchable they are often transformed into text using Optical Character Recognition software (OCR). If we want to use the archive of the mundaneum to train models using machine learning we need these texts rather than the underlying pictures. Luckily the documents were transformed into text when they were being scanned. Unfortunately the software often makes mistakes: it might recognize a wrong character, it might get confused by a stain an unsual font or the other side of the page shining through.
   In the case of the Mundaneum we recognized several kinds of mistakes, some characters are wrongly recognized, sometimes it puts a space in between every character and often words are split up as there was a linebreak. But most importantly: as the books were scanned in spreads: the left and right page together, the texts are mixed up: the first sentence of the right page is put after the first sentence of the left page. 
   In this interface we ask you to help us clean up our dataset. We show you what we detected as seperate pages or mistakes and ask you to verify or improve our solution. Your corrections are directly used in the retraining of the model but will also be part of the dataset on publishing