Uploaded image for project: 'DSpace'
  1. DSpace
  2. DS-183

XPDF support for filtering PDFs for text extraction/search.

    XMLWordPrintable

    Details

    • Type: Improvement
    • Status: Closed (View Workflow)
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.5.1, 1.5.2
    • Fix Version/s: 1.5.2
    • Component/s: DSpace API
    • Labels:
      None
    • Environment:
      Unix and Linux
    • Attachments:
      3
    • Comments:
      5

      Description

      See original description here...

      https://sourceforge.net/tracker/?func=detail&aid=2745393&group_id=19984&atid=319984

      Here are a pair of mediafilters to process PDF files with the
      XPDF suite (see http://www.foolabs.com/xpdf/ ) replacing the
      one based on PDFBox. They invoke an external command, which
      must be configured. It has been tested on Unix and the concept
      ought to work on Windows (and certainly on MacOS X).

      XPDF2Text is a replacement for the existing PDF media filter, it
      creates extracted text using the pdftotext program. I've observed it
      is about 3 times as fast, and much more reliable, than PDFBox.

      XPDF2Thumbnail creates a thumbnail image for the first page of
      the PDF. This is especially effective for 3D PDF renderings of
      engineering models, but works fine for any document.

      See the instructions in xpdf-filters.html to install it.
      The thumbnail filter needs an additional image library, but
      the text extractor doesn't need anything else.

      This code has been tested with DSpace 1.5.1

        Attachments

        1. xpdf-filters.html
          9 kB
        2. XPDFFilters.patch
          21 kB
        3. xpdf-filters.xml
          6 kB

          Activity

            People

            Assignee:
            mdiggory Mark Diggory
            Reporter:
            mdiggory Mark Diggory
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Dates

              Created:
              Updated:
              Resolved: