Uploaded image for project: 'DSpace'
  1. DSpace
  2. DS-704

Update pdfbox library to improve performance and out-of-box support for pdf extraction

    Details

    • Type: Improvement
    • Status: Closed (View Workflow)
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.7.0
    • Component/s: DSpace API
    • Labels:
      None
    • Attachments:
      2
    • Comments:
      5
    • Documentation Status:
      Not Required

      Description

      We have found that update the pdfbox library to the last stable version (1.2.1) solve all our current issues with pdf text extraction and improve performance.
      This could help people that want rely on the DSpace "out-of-box" pdf extractor without using XPDF.

      Below some samples of exception that go away updating the pdfbox version. Patch attached against trunk r5439

      ==
      java.io.IOException: Error: Could not find font(COSName

      {F1.0}

      ) in map={}
      at org.pdfbox.util.operator.SetTextFont.process(SetTextFont.java:83)
      at org.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:452)
      at org.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:215)
      at org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:174)
      at org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:336)
      at org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:259)
      at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:216)
      at org.dspace.app.mediafilter.PDFFilter.getDestinationStream(PDFFilter.java:139)

      ===

      java.lang.ClassCastException: org.pdfbox.cos.COSArray cannot be cast to org.pdfbox.cos.COSDictionary
      at org.pdfbox.filter.FlateFilter.decode(FlateFilter.java:70)
      at org.pdfbox.cos.COSStream.doDecode(COSStream.java:290)
      at org.pdfbox.cos.COSStream.doDecode(COSStream.java:243)
      at org.pdfbox.cos.COSStream.getUnfilteredStream(COSStream.java:170)
      at org.pdfbox.pdfparser.PDFStreamParser.<init>(PDFStreamParser.java:101)
      at org.pdfbox.cos.COSStream.getStreamTokens(COSStream.java:132)
      at org.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:202)
      at org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:174)
      at org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:336)
      at org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:259)
      at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:216)
      at org.dspace.app.mediafilter.PDFFilter.getDestinationStream(PDFFilter.java:139)

      ====

      java.io.IOException: Unknown colorspace array type:COSName

      {DeviceRGB}

      at org.pdfbox.pdmodel.graphics.color.PDColorSpaceFactory.createColorSpace(PDColorSpaceFactory.java:116)
      at org.pdfbox.pdmodel.PDResources.getColorSpaces(PDResources.java:264)
      at org.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:193)
      at org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:174)
      at org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:336)
      at org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:259)
      at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:216)
      at org.dspace.app.mediafilter.PDFFilter.getDestinationStream(PDFFilter.java:139)

      ===

      java.lang.NullPointerException
      at org.pdfbox.pdmodel.PDPageNode.getAllKids(PDPageNode.java:194)
      at org.pdfbox.pdmodel.PDPageNode.getAllKids(PDPageNode.java:182)
      at org.pdfbox.pdmodel.PDDocumentCatalog.getAllPages(PDDocumentCatalog.java:226)
      at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:216)
      at org.dspace.app.mediafilter.PDFFilter.getDestinationStream(PDFFilter.java:139)

      ===
      java.util.zip.ZipException: unknown compression method
      at java.util.zip.InflaterInputStream.read(Unknown Source)
      at org.pdfbox.filter.FlateFilter.decode(FlateFilter.java:97)
      at org.pdfbox.cos.COSStream.doDecode(COSStream.java:290)
      at org.pdfbox.cos.COSStream.doDecode(COSStream.java:235)
      at org.pdfbox.cos.COSStream.getUnfilteredStream(COSStream.java:170)
      at org.pdfbox.pdfparser.PDFObjectStreamParser.<init>(PDFObjectStreamParser.java:66)
      at org.pdfbox.cos.COSDocument.dereferenceObjectStreams(COSDocument.java:450)
      at org.pdfbox.pdmodel.PDDocument.openProtection(PDDocument.java:908)
      at org.pdfbox.pdmodel.PDDocument.decrypt(PDDocument.java:489)
      at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:204)
      at org.dspace.app.mediafilter.PDFFilter.getDestinationStream(PDFFilter.java:139)

      ===

      java.lang.ArrayIndexOutOfBoundsException
      at java.lang.System.arraycopy(Native Method)
      at java.io.PushbackInputStream.unread(Unknown Source)
      at org.pdfbox.pdfparser.BaseParser.parseCOSString(BaseParser.java:524)
      at org.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:873)
      at org.pdfbox.pdfparser.PDFObjectStreamParser.parse(PDFObjectStreamParser.java:94)
      at org.pdfbox.cos.COSDocument.dereferenceObjectStreams(COSDocument.java:451)
      at org.pdfbox.pdmodel.PDDocument.openProtection(PDDocument.java:908)
      at org.pdfbox.pdmodel.PDDocument.decrypt(PDDocument.java:489)
      at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:204)
      at org.dspace.app.mediafilter.PDFFilter.getDestinationStream(PDFFilter.java:139)

      ===

      java.io.EOFException: Unexpected end of ZLIB input stream
      at java.util.zip.InflaterInputStream.fill(Unknown Source)
      at java.util.zip.InflaterInputStream.read(Unknown Source)
      at org.pdfbox.filter.FlateFilter.decode(FlateFilter.java:97)
      at org.pdfbox.cos.COSStream.doDecode(COSStream.java:290)
      at org.pdfbox.cos.COSStream.doDecode(COSStream.java:235)
      at org.pdfbox.cos.COSStream.getUnfilteredStream(COSStream.java:170)
      at org.pdfbox.pdfparser.PDFStreamParser.<init>(PDFStreamParser.java:101)
      at org.pdfbox.cos.COSStream.getStreamTokens(COSStream.java:132)
      at org.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:202)
      at org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:174)
      at org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:336)
      at org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:259)
      at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:216)
      at org.dspace.app.mediafilter.PDFFilter.getDestinationStream(PDFFilter.java:139)

        Attachments

          Activity

          Hide
          tdonohue Tim Donohue added a comment -

          +1 I'd vote to go ahead and upgrade the version of PDFBox we are using in DSpace 1.7.0. I know there were several issues with the older version.

          Show
          tdonohue Tim Donohue added a comment - +1 I'd vote to go ahead and upgrade the version of PDFBox we are using in DSpace 1.7.0. I know there were several issues with the older version.
          Hide
          bollini Andrea Bollini added a comment -

          from the devel mailing list
          > I tried using this patch and I get the following message:
          >
          >
          > 1) com.ibm.icu:icu4j:jar:3.8.1
          >
          > Try downloading the file manually from the project website.
          >
          > Then, install it using the command:
          > mvn install:install-file -DgroupId=com.ibm.icu -DartifactId=icu4j -Dversion=3.8.1 -Dpackaging=jar -Dfile=/path/to/file
          >
          > Where do I get icu4j:jar:3.8.1?
          >
          > Thank you!
          > Jose

          Show
          bollini Andrea Bollini added a comment - from the devel mailing list > I tried using this patch and I get the following message: > > > 1) com.ibm.icu:icu4j:jar:3.8.1 > > Try downloading the file manually from the project website. > > Then, install it using the command: > mvn install:install-file -DgroupId=com.ibm.icu -DartifactId=icu4j -Dversion=3.8.1 -Dpackaging=jar -Dfile=/path/to/file > > Where do I get icu4j:jar:3.8.1? > > Thank you! > Jose
          Hide
          bollini Andrea Bollini added a comment -

          It looks as this dependency was already installed in my local repository, probably it comes from other projects where we use also the jboss maven repository.
          We have two possibilities here:

          1) add the jboss repository to the dspace pom
          <repository>
          <id>jboss-public-repository-group</id>
          <name>JBoss Public Maven Repository Group</name>
          <url>https://repository.jboss.org/nexus/content/groups/public-jboss/</url>
          <layout>default</layout>
          <releases>
          <enabled>true</enabled>
          <updatePolicy>never</updatePolicy>
          </releases>
          <snapshots>
          <enabled>true</enabled>
          <updatePolicy>never</updatePolicy>
          </snapshots>
          </repository>

          2) change the dependency to the version 3.8 as it is present in the main maven repository

          Show
          bollini Andrea Bollini added a comment - It looks as this dependency was already installed in my local repository, probably it comes from other projects where we use also the jboss maven repository. We have two possibilities here: 1) add the jboss repository to the dspace pom <repository> <id>jboss-public-repository-group</id> <name>JBoss Public Maven Repository Group</name> <url> https://repository.jboss.org/nexus/content/groups/public-jboss/ </url> <layout>default</layout> <releases> <enabled>true</enabled> <updatePolicy>never</updatePolicy> </releases> <snapshots> <enabled>true</enabled> <updatePolicy>never</updatePolicy> </snapshots> </repository> 2) change the dependency to the version 3.8 as it is present in the main maven repository
          Hide
          blancojose Jose Blanco added a comment -

          Andrea, Thank you for your help with this. I had 872 items that could not be indexed before I put your patch in place and now I only have 63. Many thanks! Jose

          Here is a sample of the ones that are still giving me errors:

          java.lang.IllegalArgumentException: Width (80) and height (0) cannot be <= 0
          java.lang.IllegalArgumentException: Width (80) and height (0) cannot be <= 0
          at
          java.awt.image.DirectColorModel.createCompatibleWritableRaster(DirectColorModel.java:999)
          at java.awt.image.BufferedImage.<init>(BufferedImage.java:312)
          at
          AND

          java.util.NoSuchElementException
          java.util.NoSuchElementException
          at java.util.AbstractList$Itr.next(AbstractList.java:350)
          at
          org.textmining.text.extraction.WordExtractor.extractText(WordExtractor.java:150)
          at
          org.dspace.app.mediafilter.WordFilter.getDestinationStream(WordFilter.java:95)
          at
          org.dspace.app.mediafilter.MediaFilterManager.processBitstream(MediaFilterManager.java:674)
          at

          AND

          java.io.IOException: expected='obj' actual='l?w'
          org.apache.pdfbox.io.PushBackInputStream@1536eec
          java.io.IOException: expected='obj' actual='l?w'
          org.apache.pdfbox.io.PushBackInputStream@1536eec
          at
          org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:509)
          at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:179)
          at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:859)
          at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:826)
          at
          org.dspace.app.mediafilter.PDFFilter.getDestinationStream(PDFFilter.java:131)
          AND

          java.io.IOException: expected='endstream' actual=''
          org.apache.pdfbox.io.PushBackInputStream@3bc1a1
          java.io.IOException: expected='endstream' actual=''
          org.apache.pdfbox.io.PushBackInputStream@3bc1a1
          at
          AND

          java.lang.RuntimeException: java.io.IOException: Error: Expected hex number,
          actual='D<'
          java.lang.RuntimeException: java.io.IOException: Error: Expected hex number,
          actual='D<'
          at
          org.apache.pdfbox.pdfparser.PDFStreamParser$1.tryNext(PDFStreamParser.java:148)
          at
          org.apache.pdfbox.pdfparser.PDFStreamParser$1.hasNext(PDFStreamParser.java:157)
          at
          org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:241)
          at
          AND

          java.io.IOException: expected true actual='t?{<'
          org.apache.pdfbox.io.PushBackInputStream@970110
          java.io.IOException: expected true actual='t?{<'
          org.apache.pdfbox.io.PushBackInputStream@970110
          at
          org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:968)
          at
          org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionaryValue(BaseParser.java:157)
          at
          org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionary(BaseParser.java:233)
          at
          org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:929)
          at
          AND

          java.io.IOException: Invalid header signature; read 3605422911097735433,
          expected -2226271756974174256
          java.io.IOException: Invalid header signature; read 3605422911097735433,
          expected -2226271756974174256
          at
          org.apache.poi.poifs.storage.HeaderBlockReader.<init>(HeaderBlockReader.java:88)
          at
          org.apache.poi.poifs.filesystem.POIFSFileSystem.<init>(POIFSFileSystem.java:83)
          at
          org.textmining.text.extraction.WordExtractor.extractText(WordExtractor.java:48)
          at
          org.dspace.app.mediafilter.WordFilter.getDestinationStream(WordFilter.java:95)
          at
          org.dspace.app.mediafilter.MediaFilterManager.processBitstream(MediaFilterManager.java:674)
          at
          org.dspace.app.mediafilter.MediaFilterManager.filterBitstream(MediaFilterManager.java:575)

          AND

          java.io.IOException: Unknown dir object c=')' cInt=41 peek=')' peekInt=41
          org.apache.pdfbox.io.PushBackInputStream@36d036
          java.io.IOException: Unknown dir object c=')' cInt=41 peek=')' peekInt=41
          org.apache.pdfbox.io.PushBackInputStream@36d036
          at
          org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:1027)
          at
          org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionaryValue(BaseParser.java:157)
          at
          org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionary(BaseParser.java:233)
          at
          AND

          java.lang.IndexOutOfBoundsException: off < 0 || len < 0 || off+len >
          b.length || off+len < 0!
          java.lang.IndexOutOfBoundsException: off < 0 || len < 0 || off+len >
          b.length || off+len < 0!
          at
          javax.imageio.stream.FileCacheImageInputStream.read(FileCacheImageInputStream.java:157)
          at
          com.sun.imageio.plugins.gif.GIFImageReader.getCode(GIFImageReader.java:306)
          at

          AND

          java.io.IOException: Invalid header signature; read 5789751444030890300,
          expected -2226271756974174256
          java.io.IOException: Invalid header signature; read 5789751444030890300,
          expected -2226271756974174256
          at
          org.apache.poi.poifs.storage.HeaderBlockReader.<init>(HeaderBlockReader.java:88)
          at
          org.apache.poi.poifs.filesystem.POIFSFileSystem.<init>(POIFSFileSystem.java:83)
          at
          org.textmining.text.extraction.WordExtractor.extractText(WordExtractor.java:48)
          at
          org.dspace.app.mediafilter.WordFilter.getDestinationStream(WordFilter.java:95)
          at
          org.dspace.app.mediafilter.MediaFilterManager.processBitstream(MediaFilterManager.java:674)
          at
          org.dspace.app.mediafilter.MediaFilterManager.filterBitstream(MediaFilterManager.java:575)
          at

          AND

          java.io.IOException: Invalid header signature; read 5789751444030890300,
          expected -2226271756974174256
          java.io.IOException: Invalid header signature; read 5789751444030890300,
          expected -2226271756974174256
          at
          org.apache.poi.poifs.storage.HeaderBlockReader.<init>(HeaderBlockReader.java:88)
          at
          org.apache.poi.poifs.filesystem.POIFSFileSystem.<init>(POIFSFileSystem.java:83)
          at
          AND

          javax.imageio.IIOException: Unsupported Image Type
          javax.imageio.IIOException: Unsupported Image Type
          at
          com.sun.imageio.plugins.jpeg.JPEGImageReader.readInternal(JPEGImageReader.java:922)
          at
          com.sun.imageio.plugins.jpeg.JPEGImageReader.read(JPEGImageReader.java:897)
          at javax.imageio.ImageIO.read(ImageIO.java:1422)
          at javax.imageio.ImageIO.read(ImageIO.java:1326)
          at
          org.dspace.app.mediafilter.JPEGFilter.getDestinationStream(JPEGFilter.java:97)
          at
          org.dspace.app.mediafilter.MediaFilterManager.processBitstream(MediaFilterManager.java:674)
          at

          AND

          java.io.IOException: Error: End-of-File, expected line
          java.io.IOException: Error: End-of-File, expected line
          at
          org.apache.pdfbox.pdfparser.BaseParser.readLine(BaseParser.java:1176)
          at
          org.apache.pdfbox.pdfparser.PDFParser.parseHeader(PDFParser.java:302)
          at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:162)
          at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:859)
          at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:826)
          at
          org.dspace.app.mediafilter.PDFFilter.getDestinationStream(PDFFilter.java:131)
          at
          org.dspace.app.mediafilter.MediaFilterManager.processBitstream(MediaFilterManager.java:674)
          at
          AND

          java.io.IOException: expected='>' actual='?'
          java.io.IOException: expected='>' actual='?'
          at
          org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionary(BaseParser.java:268)
          at
          org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:929)
          at
          org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:519)
          at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:179)
          at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:859)
          at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:826)
          at
          org.dspace.app.mediafilter.PDFFilter.getDestinationStream(PDFFilter.java:131)
          at
          org.dspace.app.mediafilter.MediaFilterManager.processBitstream(MediaFilterManager.java:674)
          at

          Show
          blancojose Jose Blanco added a comment - Andrea, Thank you for your help with this. I had 872 items that could not be indexed before I put your patch in place and now I only have 63. Many thanks! Jose Here is a sample of the ones that are still giving me errors: java.lang.IllegalArgumentException: Width (80) and height (0) cannot be <= 0 java.lang.IllegalArgumentException: Width (80) and height (0) cannot be <= 0 at java.awt.image.DirectColorModel.createCompatibleWritableRaster(DirectColorModel.java:999) at java.awt.image.BufferedImage.<init>(BufferedImage.java:312) at AND java.util.NoSuchElementException java.util.NoSuchElementException at java.util.AbstractList$Itr.next(AbstractList.java:350) at org.textmining.text.extraction.WordExtractor.extractText(WordExtractor.java:150) at org.dspace.app.mediafilter.WordFilter.getDestinationStream(WordFilter.java:95) at org.dspace.app.mediafilter.MediaFilterManager.processBitstream(MediaFilterManager.java:674) at AND java.io.IOException: expected='obj' actual='l?w' org.apache.pdfbox.io.PushBackInputStream@1536eec java.io.IOException: expected='obj' actual='l?w' org.apache.pdfbox.io.PushBackInputStream@1536eec at org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:509) at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:179) at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:859) at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:826) at org.dspace.app.mediafilter.PDFFilter.getDestinationStream(PDFFilter.java:131) AND java.io.IOException: expected='endstream' actual='' org.apache.pdfbox.io.PushBackInputStream@3bc1a1 java.io.IOException: expected='endstream' actual='' org.apache.pdfbox.io.PushBackInputStream@3bc1a1 at AND java.lang.RuntimeException: java.io.IOException: Error: Expected hex number, actual='D<' java.lang.RuntimeException: java.io.IOException: Error: Expected hex number, actual='D<' at org.apache.pdfbox.pdfparser.PDFStreamParser$1.tryNext(PDFStreamParser.java:148) at org.apache.pdfbox.pdfparser.PDFStreamParser$1.hasNext(PDFStreamParser.java:157) at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:241) at AND java.io.IOException: expected true actual='t?{<' org.apache.pdfbox.io.PushBackInputStream@970110 java.io.IOException: expected true actual='t?{<' org.apache.pdfbox.io.PushBackInputStream@970110 at org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:968) at org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionaryValue(BaseParser.java:157) at org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionary(BaseParser.java:233) at org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:929) at AND java.io.IOException: Invalid header signature; read 3605422911097735433, expected -2226271756974174256 java.io.IOException: Invalid header signature; read 3605422911097735433, expected -2226271756974174256 at org.apache.poi.poifs.storage.HeaderBlockReader.<init>(HeaderBlockReader.java:88) at org.apache.poi.poifs.filesystem.POIFSFileSystem.<init>(POIFSFileSystem.java:83) at org.textmining.text.extraction.WordExtractor.extractText(WordExtractor.java:48) at org.dspace.app.mediafilter.WordFilter.getDestinationStream(WordFilter.java:95) at org.dspace.app.mediafilter.MediaFilterManager.processBitstream(MediaFilterManager.java:674) at org.dspace.app.mediafilter.MediaFilterManager.filterBitstream(MediaFilterManager.java:575) AND java.io.IOException: Unknown dir object c=')' cInt=41 peek=')' peekInt=41 org.apache.pdfbox.io.PushBackInputStream@36d036 java.io.IOException: Unknown dir object c=')' cInt=41 peek=')' peekInt=41 org.apache.pdfbox.io.PushBackInputStream@36d036 at org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:1027) at org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionaryValue(BaseParser.java:157) at org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionary(BaseParser.java:233) at AND java.lang.IndexOutOfBoundsException: off < 0 || len < 0 || off+len > b.length || off+len < 0! java.lang.IndexOutOfBoundsException: off < 0 || len < 0 || off+len > b.length || off+len < 0! at javax.imageio.stream.FileCacheImageInputStream.read(FileCacheImageInputStream.java:157) at com.sun.imageio.plugins.gif.GIFImageReader.getCode(GIFImageReader.java:306) at AND java.io.IOException: Invalid header signature; read 5789751444030890300, expected -2226271756974174256 java.io.IOException: Invalid header signature; read 5789751444030890300, expected -2226271756974174256 at org.apache.poi.poifs.storage.HeaderBlockReader.<init>(HeaderBlockReader.java:88) at org.apache.poi.poifs.filesystem.POIFSFileSystem.<init>(POIFSFileSystem.java:83) at org.textmining.text.extraction.WordExtractor.extractText(WordExtractor.java:48) at org.dspace.app.mediafilter.WordFilter.getDestinationStream(WordFilter.java:95) at org.dspace.app.mediafilter.MediaFilterManager.processBitstream(MediaFilterManager.java:674) at org.dspace.app.mediafilter.MediaFilterManager.filterBitstream(MediaFilterManager.java:575) at AND java.io.IOException: Invalid header signature; read 5789751444030890300, expected -2226271756974174256 java.io.IOException: Invalid header signature; read 5789751444030890300, expected -2226271756974174256 at org.apache.poi.poifs.storage.HeaderBlockReader.<init>(HeaderBlockReader.java:88) at org.apache.poi.poifs.filesystem.POIFSFileSystem.<init>(POIFSFileSystem.java:83) at AND javax.imageio.IIOException: Unsupported Image Type javax.imageio.IIOException: Unsupported Image Type at com.sun.imageio.plugins.jpeg.JPEGImageReader.readInternal(JPEGImageReader.java:922) at com.sun.imageio.plugins.jpeg.JPEGImageReader.read(JPEGImageReader.java:897) at javax.imageio.ImageIO.read(ImageIO.java:1422) at javax.imageio.ImageIO.read(ImageIO.java:1326) at org.dspace.app.mediafilter.JPEGFilter.getDestinationStream(JPEGFilter.java:97) at org.dspace.app.mediafilter.MediaFilterManager.processBitstream(MediaFilterManager.java:674) at AND java.io.IOException: Error: End-of-File, expected line java.io.IOException: Error: End-of-File, expected line at org.apache.pdfbox.pdfparser.BaseParser.readLine(BaseParser.java:1176) at org.apache.pdfbox.pdfparser.PDFParser.parseHeader(PDFParser.java:302) at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:162) at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:859) at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:826) at org.dspace.app.mediafilter.PDFFilter.getDestinationStream(PDFFilter.java:131) at org.dspace.app.mediafilter.MediaFilterManager.processBitstream(MediaFilterManager.java:674) at AND java.io.IOException: expected='>' actual='?' java.io.IOException: expected='>' actual='?' at org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionary(BaseParser.java:268) at org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:929) at org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:519) at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:179) at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:859) at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:826) at org.dspace.app.mediafilter.PDFFilter.getDestinationStream(PDFFilter.java:131) at org.dspace.app.mediafilter.MediaFilterManager.processBitstream(MediaFilterManager.java:674) at
          Hide
          bollini Andrea Bollini added a comment -

          patch committed to the trunk (use of icu library 3.8)

          Show
          bollini Andrea Bollini added a comment - patch committed to the trunk (use of icu library 3.8)

            People

            • Assignee:
              bollini Andrea Bollini
              Reporter:
              bollini Andrea Bollini
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: