Uploaded image for project: 'DSpace'
  1. DSpace
  2. DS-49

Add support for DjVu-documents - ID: 2234659


    • Type: Improvement
    • Status: Closed (View Workflow)
    • Priority: Major
    • Resolution: Won't Fix
    • Affects Version/s: 1.5.0, 1.5.1, 1.5.2
    • Fix Version/s: None
    • Component/s: None
    • Labels:
    • Attachments:
    • Comments:


      Hello All

      This patch based on

      In DSpace 1.5.0+ we need (before compilation)

      1) Add utility djvutxt (package djvulibre), for Debian it is:
      apt-get install djvulibre-bin

      2) Edit [dspace-source]/dspace/config/dspace.cfg, text-block "### Media
      Filter / Format Filter plugins"
      and add DjVu-support in 3 places:

      filter.plugins = ... \
      DjVu Text Extractor

      plugin.named.org.dspace.app.mediafilter.FormatFilter = ... \
      org.dspace.app.mediafilter.DjVuFilter = DjVu Text Extractor

      filter.org.dspace.app.mediafilter.DjVuFilter.inputFormats = DjVu

      3) Edit [dspace-source]/dspace/config/registries/bitstream-formats.xml
      and add next


      4) Create file
      with next content

      Version: 0.1
      DSpace version: 1.4.2 beta
      Author: Ivan Penev
      e-mail: inpenev at gmail.com

      package org.dspace.app.mediafilter;

      import java.io.InputStream;
      import java.io.FileInputStream;
      import java.io.BufferedInputStream;
      import java.io.ByteArrayInputStream;
      import java.io.OutputStream;
      import java.io.FileOutputStream;
      import java.io.BufferedOutputStream;
      import java.io.FileReader;
      import java.io.BufferedReader;
      import java.io.File;


      • This class provides a media filter for processing files of type DjVu.
      • <p>The current implementation uses a program called
        <code>djvutxt</code>, which extracts the text layer from a previously
        OCR-ed DjVu file and saves it into a UTF-8 text document. The program
        is distributed with the <code>djvulibre</code> package which is freely
        available under the GPL license from <a
        for both Unix and Windows operating systems. Hence, for the media
        filter to work it is required that <code>djvutxt</code> is a valid
        command (in the working environment).</p>

      public class DjVuFilter extends MediaFilter

      • Get a filename for a newly created filtered bitstream.
      • @param sourceName
      • name of source bitstream
      • @return filename generated by the filter - for example, document.djvu
      • becomes document.djvu.txt

      public String getFilteredName(String sourceName)
      return sourceName + ".txt";


      • Get name of the bundle this filter will stick its generated
      • @return "TEXT"
        public String getBundleName()
        return "TEXT";


      • Get name of the bitstream format returned by this filter.
      • @return "Text"

      public String getFormatString()
      return "Text";


      • Get a string describing the newly-generated bitstream.
      • @return "Extracted text"

      public String getDescription()
      return "Extracted text";


      • Get a bitstream filled with the extracted text from a DjVu bitstream.
      • <p>The bitstream supplied as a parameter is written to a DjVu
        file on the file system (in the working directory), and the system
        command <code>djvutxt</code> is called on the latter to produce a
        UTF-8 text file containg the extracted text. The file is then copied
        to a bitstream. Finally, the auxiliary files are removed from the file
        system, and the generated bitsream is returned as a result.</p>
      • <p>WARNING! Write access to the working directory is needed for
        this method to operate! No exception handling provided!</p>
      • @param source
      • input stream
      • @return result of filter's transformation, written out to a bitstream

      public InputStream getDestinationStream(InputStream source) throws
      /* Some convenience initializations. */
      final String cmd = "djvutxt";
      final String fileName = "aux";
      final String djvuFileName = fileName + ".djvu";
      final String txtFileName = fileName + ".txt";

      /* Store input bitstresam to auxiliary DjVu file. */
      File djvuFile = streamToFile(source, djvuFileName);

      /* Invoke external command djvutxt with appropriate agruments
      to do the actual job... */
      final String[] cmdArray =

      {cmd, djvuFileName, txtFileName}

      Process p = Runtime.getRuntime().exec(cmdArray);
      /* ...and wait for it to terminate */

      /* Copy extracted text from file to an independent bitstream,
      and optionally print the text to standard output. */
      File txtFile = new File(txtFileName);
      InputStream dest = fileToStream(txtFile, MediaFilterManager.isVerbose);

      /* Then remove auxiliary files...*/
      /* ...and return resulting bitstream. */
      return dest;


      • Write given input stream to a file on the file system.
      • <p>WARNING! No exception handling!</p>
      • @param inStream input stream
      • @param fileName name of the file to be generated
      • @return <code>File</code> object associated with the generated file
      • @throws Exception

      private File streamToFile(InputStream inStream, String fileName)
      throws Exception

      { /* Data will be read from input stream in chunks of size e.g. 4KB. */ final int chunkSize = 4096; byte[] byteArray = new byte[chunkSize]; /* Open the stream for buffered reading. */ InputStream bufInStream = new BufferedInputStream(inStream); /* Create an empty file (if the file already exists, it will be left untouched) to store the supplied bitstream... */ File file = new File(fileName); file.createNewFile(); /* ...and associate a buffered output stream with it. */ OutputStream bufOutStream = new BufferedOutputStream(new FileOutputStream(file)); /* Copy data from input stream to newly generated file. */ int readBytes = -1; while ((readBytes = bufInStream.read(byteArray, 0, chunkSize)) != -1) bufOutStream.write(byteArray, 0, readBytes); /* Stop transactions to the file system... */ bufOutStream.close(); /* ...and return result. */ return file; } /** * Produce input stream from a given file on the file system. * <p>WARNING! No exception handling!</p> * * @param file <code>File</code> object associated with the given file * * @return input stream containing the data read from file * *@throws Exception */ private InputStream fileToStream(File file, boolean verbose) throws Exception { /* Open the stream for reading. */ InputStream inStream = new FileInputStream(file); /* Allocate necessary memory for data buffer. */ byte[] byteArray = new byte[(int)file.length()]; /* Load file contents into buffer. */ inStream.read(byteArray); /* And imediately close transactions with the file system. */ inStream.close(); /* If required to send the retrieved data to standard output... */ if (verbose) { /* Open the file again, but this tim handle it as a character stream... */ BufferedReader bufReader = new BufferedReader(new FileReader(file)); /* ...then print its contents line by line to the standard output... */ String lineOfText = null; while ((lineOfText = bufReader.readLine()) != null) System.out.println(lineOfText); /* ...and close connection to the file. */ bufReader.close(); } /* Finally, generate and return input stream containing desired data. */ return new ByteArrayInputStream(byteArray); } }

      5) Compilation/recompilation
      cd [dspace-source]/dspace/dspace-1.5.0-src-release/dspace/
      mvn package

      6) Install or for recompilation -

      {edit work bitstream-formats.xml & dspace.cfg as above and replace dspace-api-1.5.0.jar from folders webapps/jspui/WEB-INF/lib/, lib/, webapps/lni/WEB-INF/lib/, webapps/oai/WEB-INF/lib/, webapps/xmlui/WEB-INF/lib/ by compiled [dspace-source]/dspace-api/target/dspace-api-1.5.0.jar}

      7) Don't forgive restart Tomcat and run

      With best regards
      Serhij Dubyk




            • Assignee:
              kipkorir2008 Charles Kiplagat
            • Votes:
              0 Vote for this issue
              0 Start watching this issue


              • Created: