Uploaded image for project: 'Islandora'
  1. Islandora
  2. ISLANDORA-2012

Make OCR/HOCR derivative generation opt-out




      The way OCR/HOCR derivative generation currently works is that it is a globally-configured setting that causes Tesseract to run on every page ingested into Islandora. However, individual cases may arise where OCR and/or HOCR derivative generation is not appropriate for specific pages or groups of pages - for example, illustration pages, or handwritten text, or cases where page OCR has been generated by another application or by hand previously.

      Additionally, folks may want to generate less-than-perfect OCR to be cleaned up at a later time, while not wanting to generate HOCR at all, since that's far more difficult to clean up and may not be desired at all.

      In all of these cases, currently, the only option is to either turn on and off the global settings for each ingest, or to remove OCR and/or HOCR datastreams after the fact. The former is problematic, and the latter is a huge time sink.

      We should, in forms where pages are added individually or in bulk to paged content objects, as well as in drush batches, make both OCR and HOCR derivative generation individually able to be opted out of.




            • Assignee:
              dpinokrayon Diego Pino Navarro
              daitken Daniel Aitken
            • Votes:
              1 Vote for this issue
              4 Start watching this issue


              • Created: