Uploaded image for project: 'DSpace'
  1. DSpace
  2. DS-704

Update pdfbox library to improve performance and out-of-box support for pdf extraction

    Details

    • Type: Improvement
    • Status: Closed (View Workflow)
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.7.0
    • Component/s: DSpace API
    • Labels:
      None
    • Attachments:
      2
    • Comments:
      5
    • Documentation Status:
      Not Required

      Description

      We have found that update the pdfbox library to the last stable version (1.2.1) solve all our current issues with pdf text extraction and improve performance.
      This could help people that want rely on the DSpace "out-of-box" pdf extractor without using XPDF.

      Below some samples of exception that go away updating the pdfbox version. Patch attached against trunk r5439

      ==
      java.io.IOException: Error: Could not find font(COSName

      {F1.0}

      ) in map={}
      at org.pdfbox.util.operator.SetTextFont.process(SetTextFont.java:83)
      at org.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:452)
      at org.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:215)
      at org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:174)
      at org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:336)
      at org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:259)
      at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:216)
      at org.dspace.app.mediafilter.PDFFilter.getDestinationStream(PDFFilter.java:139)

      ===

      java.lang.ClassCastException: org.pdfbox.cos.COSArray cannot be cast to org.pdfbox.cos.COSDictionary
      at org.pdfbox.filter.FlateFilter.decode(FlateFilter.java:70)
      at org.pdfbox.cos.COSStream.doDecode(COSStream.java:290)
      at org.pdfbox.cos.COSStream.doDecode(COSStream.java:243)
      at org.pdfbox.cos.COSStream.getUnfilteredStream(COSStream.java:170)
      at org.pdfbox.pdfparser.PDFStreamParser.<init>(PDFStreamParser.java:101)
      at org.pdfbox.cos.COSStream.getStreamTokens(COSStream.java:132)
      at org.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:202)
      at org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:174)
      at org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:336)
      at org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:259)
      at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:216)
      at org.dspace.app.mediafilter.PDFFilter.getDestinationStream(PDFFilter.java:139)

      ====

      java.io.IOException: Unknown colorspace array type:COSName

      {DeviceRGB}

      at org.pdfbox.pdmodel.graphics.color.PDColorSpaceFactory.createColorSpace(PDColorSpaceFactory.java:116)
      at org.pdfbox.pdmodel.PDResources.getColorSpaces(PDResources.java:264)
      at org.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:193)
      at org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:174)
      at org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:336)
      at org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:259)
      at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:216)
      at org.dspace.app.mediafilter.PDFFilter.getDestinationStream(PDFFilter.java:139)

      ===

      java.lang.NullPointerException
      at org.pdfbox.pdmodel.PDPageNode.getAllKids(PDPageNode.java:194)
      at org.pdfbox.pdmodel.PDPageNode.getAllKids(PDPageNode.java:182)
      at org.pdfbox.pdmodel.PDDocumentCatalog.getAllPages(PDDocumentCatalog.java:226)
      at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:216)
      at org.dspace.app.mediafilter.PDFFilter.getDestinationStream(PDFFilter.java:139)

      ===
      java.util.zip.ZipException: unknown compression method
      at java.util.zip.InflaterInputStream.read(Unknown Source)
      at org.pdfbox.filter.FlateFilter.decode(FlateFilter.java:97)
      at org.pdfbox.cos.COSStream.doDecode(COSStream.java:290)
      at org.pdfbox.cos.COSStream.doDecode(COSStream.java:235)
      at org.pdfbox.cos.COSStream.getUnfilteredStream(COSStream.java:170)
      at org.pdfbox.pdfparser.PDFObjectStreamParser.<init>(PDFObjectStreamParser.java:66)
      at org.pdfbox.cos.COSDocument.dereferenceObjectStreams(COSDocument.java:450)
      at org.pdfbox.pdmodel.PDDocument.openProtection(PDDocument.java:908)
      at org.pdfbox.pdmodel.PDDocument.decrypt(PDDocument.java:489)
      at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:204)
      at org.dspace.app.mediafilter.PDFFilter.getDestinationStream(PDFFilter.java:139)

      ===

      java.lang.ArrayIndexOutOfBoundsException
      at java.lang.System.arraycopy(Native Method)
      at java.io.PushbackInputStream.unread(Unknown Source)
      at org.pdfbox.pdfparser.BaseParser.parseCOSString(BaseParser.java:524)
      at org.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:873)
      at org.pdfbox.pdfparser.PDFObjectStreamParser.parse(PDFObjectStreamParser.java:94)
      at org.pdfbox.cos.COSDocument.dereferenceObjectStreams(COSDocument.java:451)
      at org.pdfbox.pdmodel.PDDocument.openProtection(PDDocument.java:908)
      at org.pdfbox.pdmodel.PDDocument.decrypt(PDDocument.java:489)
      at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:204)
      at org.dspace.app.mediafilter.PDFFilter.getDestinationStream(PDFFilter.java:139)

      ===

      java.io.EOFException: Unexpected end of ZLIB input stream
      at java.util.zip.InflaterInputStream.fill(Unknown Source)
      at java.util.zip.InflaterInputStream.read(Unknown Source)
      at org.pdfbox.filter.FlateFilter.decode(FlateFilter.java:97)
      at org.pdfbox.cos.COSStream.doDecode(COSStream.java:290)
      at org.pdfbox.cos.COSStream.doDecode(COSStream.java:235)
      at org.pdfbox.cos.COSStream.getUnfilteredStream(COSStream.java:170)
      at org.pdfbox.pdfparser.PDFStreamParser.<init>(PDFStreamParser.java:101)
      at org.pdfbox.cos.COSStream.getStreamTokens(COSStream.java:132)
      at org.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:202)
      at org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:174)
      at org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:336)
      at org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:259)
      at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:216)
      at org.dspace.app.mediafilter.PDFFilter.getDestinationStream(PDFFilter.java:139)

        Attachments

          Activity

            People

            • Assignee:
              bollini Andrea Bollini
              Reporter:
              bollini Andrea Bollini
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: