Details

    • Type: Improvement
    • Status: Closed (View Workflow)
    • Priority: Major
    • Resolution: Won't Fix
    • Affects Version/s: 1.5.0, 1.5.1, 1.5.2
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None
    • Attachments:
      0
    • Comments:
      1

      Description

      Hello All

      This patch based on
      http://mailman.mit.edu/pipermail/dspace-general/2007-May/001513.html

      In DSpace 1.5.0+ we need (before compilation)

      1) Add utility djvutxt (package djvulibre), for Debian it is:
      apt-get install djvulibre-bin

      2) Edit [dspace-source]/dspace/config/dspace.cfg, text-block "### Media
      Filter / Format Filter plugins"
      and add DjVu-support in 3 places:

      filter.plugins = ... \
      DjVu Text Extractor

      plugin.named.org.dspace.app.mediafilter.FormatFilter = ... \
      org.dspace.app.mediafilter.DjVuFilter = DjVu Text Extractor

      filter.org.dspace.app.mediafilter.DjVuFilter.inputFormats = DjVu

      3) Edit [dspace-source]/dspace/config/registries/bitstream-formats.xml
      and add next

      <bitstream-type>
      <mimetype>image/vnd.djvu</mimetype>
      <short_description>DjVu</short_description>
      <description>DjVu</description>
      <support_level>1</support_level>
      <internal>false</internal>
      <extension>djvu</extension>
      <extension>djv</extension>
      </bitstream-type>

      4) Create file
      [dspace-source]/dspace-api/src/main/java/org/dspace/app/mediafilter/DjVuFil
      ter.java
      with next content

      /*
      DjVuFilter.java
      Version: 0.1
      DSpace version: 1.4.2 beta
      Author: Ivan Penev
      e-mail: inpenev at gmail.com
      */

      package org.dspace.app.mediafilter;

      import java.io.InputStream;
      import java.io.FileInputStream;
      import java.io.BufferedInputStream;
      import java.io.ByteArrayInputStream;
      import java.io.OutputStream;
      import java.io.FileOutputStream;
      import java.io.BufferedOutputStream;
      import java.io.FileReader;
      import java.io.BufferedReader;
      import java.io.File;

      /**

      • This class provides a media filter for processing files of type DjVu.
      • <p>The current implementation uses a program called
        <code>djvutxt</code>, which extracts the text layer from a previously
        OCR-ed DjVu file and saves it into a UTF-8 text document. The program
        is distributed with the <code>djvulibre</code> package which is freely
        available under the GPL license from <a
        href="http://djvu.sourceforge.net/">http://djvu.sourceforge.net/</a>
        for both Unix and Windows operating systems. Hence, for the media
        filter to work it is required that <code>djvutxt</code> is a valid
        command (in the working environment).</p>
        */

      public class DjVuFilter extends MediaFilter
      {
      /**

      • Get a filename for a newly created filtered bitstream.
        *
      • @param sourceName
      • name of source bitstream
      • @return filename generated by the filter - for example, document.djvu
      • becomes document.djvu.txt
        */

      public String getFilteredName(String sourceName)
      {
      return sourceName + ".txt";
      }

      /**

      • Get name of the bundle this filter will stick its generated
        bitstreams.
        *
      • @return "TEXT"
        */
        public String getBundleName()
        {
        return "TEXT";
        }

      /**

      • Get name of the bitstream format returned by this filter.
        *
      • @return "Text"
        */

      public String getFormatString()
      {
      return "Text";
      }

      /**

      • Get a string describing the newly-generated bitstream.
        *
      • @return "Extracted text"
        */

      public String getDescription()
      {
      return "Extracted text";
      }

      /**

      • Get a bitstream filled with the extracted text from a DjVu bitstream.
      • <p>The bitstream supplied as a parameter is written to a DjVu
        file on the file system (in the working directory), and the system
        command <code>djvutxt</code> is called on the latter to produce a
        UTF-8 text file containg the extracted text. The file is then copied
        to a bitstream. Finally, the auxiliary files are removed from the file
        system, and the generated bitsream is returned as a result.</p>
      • <p>WARNING! Write access to the working directory is needed for
        this method to operate! No exception handling provided!</p>
        *
      • @param source
      • input stream
        *
      • @return result of filter's transformation, written out to a bitstream
        */

      public InputStream getDestinationStream(InputStream source) throws
      Exception
      {
      /* Some convenience initializations. */
      final String cmd = "djvutxt";
      final String fileName = "aux";
      final String djvuFileName = fileName + ".djvu";
      final String txtFileName = fileName + ".txt";

      /* Store input bitstresam to auxiliary DjVu file. */
      File djvuFile = streamToFile(source, djvuFileName);

      /* Invoke external command djvutxt with appropriate agruments
      to do the actual job... */
      final String[] cmdArray =

      {cmd, djvuFileName, txtFileName}

      ;
      Process p = Runtime.getRuntime().exec(cmdArray);
      /* ...and wait for it to terminate */
      p.waitFor();

      /* Copy extracted text from file to an independent bitstream,
      and optionally print the text to standard output. */
      File txtFile = new File(txtFileName);
      InputStream dest = fileToStream(txtFile, MediaFilterManager.isVerbose);

      /* Then remove auxiliary files...*/
      djvuFile.delete();
      txtFile.delete();
      /* ...and return resulting bitstream. */
      return dest;
      }

      /**

      • Write given input stream to a file on the file system.
      • <p>WARNING! No exception handling!</p>
        *
      • @param inStream input stream
      • @param fileName name of the file to be generated
        *
      • @return <code>File</code> object associated with the generated file
        *
      • @throws Exception
        */

      private File streamToFile(InputStream inStream, String fileName)
      throws Exception

      { /* Data will be read from input stream in chunks of size e.g. 4KB. */ final int chunkSize = 4096; byte[] byteArray = new byte[chunkSize]; /* Open the stream for buffered reading. */ InputStream bufInStream = new BufferedInputStream(inStream); /* Create an empty file (if the file already exists, it will be left untouched) to store the supplied bitstream... */ File file = new File(fileName); file.createNewFile(); /* ...and associate a buffered output stream with it. */ OutputStream bufOutStream = new BufferedOutputStream(new FileOutputStream(file)); /* Copy data from input stream to newly generated file. */ int readBytes = -1; while ((readBytes = bufInStream.read(byteArray, 0, chunkSize)) != -1) bufOutStream.write(byteArray, 0, readBytes); /* Stop transactions to the file system... */ bufOutStream.close(); /* ...and return result. */ return file; } /** * Produce input stream from a given file on the file system. * <p>WARNING! No exception handling!</p> * * @param file <code>File</code> object associated with the given file * * @return input stream containing the data read from file * *@throws Exception */ private InputStream fileToStream(File file, boolean verbose) throws Exception { /* Open the stream for reading. */ InputStream inStream = new FileInputStream(file); /* Allocate necessary memory for data buffer. */ byte[] byteArray = new byte[(int)file.length()]; /* Load file contents into buffer. */ inStream.read(byteArray); /* And imediately close transactions with the file system. */ inStream.close(); /* If required to send the retrieved data to standard output... */ if (verbose) { /* Open the file again, but this tim handle it as a character stream... */ BufferedReader bufReader = new BufferedReader(new FileReader(file)); /* ...then print its contents line by line to the standard output... */ String lineOfText = null; while ((lineOfText = bufReader.readLine()) != null) System.out.println(lineOfText); /* ...and close connection to the file. */ bufReader.close(); } /* Finally, generate and return input stream containing desired data. */ return new ByteArrayInputStream(byteArray); } }

      5) Compilation/recompilation
      cd [dspace-source]/dspace/dspace-1.5.0-src-release/dspace/
      mvn package

      6) Install or for recompilation -

      {edit work bitstream-formats.xml & dspace.cfg as above and replace dspace-api-1.5.0.jar from folders webapps/jspui/WEB-INF/lib/, lib/, webapps/lni/WEB-INF/lib/, webapps/oai/WEB-INF/lib/, webapps/xmlui/WEB-INF/lib/ by compiled [dspace-source]/dspace-api/target/dspace-api-1.5.0.jar}

      7) Don't forgive restart Tomcat and run
      /usr/share/dspace/bin/filter-media

      With best regards
      Serhij Dubyk

        Attachments

          Activity

          Hide
          vly Van Ly added a comment - - edited

          [14:12] <stuartlewis> DS-49 - Major/Improvement - Add support for DjVu-documents - ID: 2234659 - http://jira.dspace.org/jira/browse/DS-49 - [unassigned / Charles Kiplagat]
          [14:12] <canderson34> 0
          [14:12] <ClaudiaJuergen> -1 it's up to the admin to provide other media filter plugins, we should stick to a basic default set
          [14:12] <stuartlewis> -1 out of scope (should be in wiki_
          [14:12] <fnkepler> 0
          [14:12] <mhwood> 0
          [14:12] <jat_ysu> 0
          [14:13] <bollini> -1
          [14:13] <lcs> -1 out of scope, make it an add-on
          [14:13] <pketienne> 0
          [14:13] <bradmc> One minute: -3 out of scope, mark 'won't fix' on DS-49

          See also:
          http://wiki.dspace.org/index.php/JIRA_Cleanup#2009-08-25

          Show
          vly Van Ly added a comment - - edited [14:12] <stuartlewis> DS-49 - Major/Improvement - Add support for DjVu-documents - ID: 2234659 - http://jira.dspace.org/jira/browse/DS-49 - [unassigned / Charles Kiplagat] [14:12] <canderson34> 0 [14:12] <ClaudiaJuergen> -1 it's up to the admin to provide other media filter plugins, we should stick to a basic default set [14:12] <stuartlewis> -1 out of scope (should be in wiki_ [14:12] <fnkepler> 0 [14:12] <mhwood> 0 [14:12] <jat_ysu> 0 [14:13] <bollini> -1 [14:13] <lcs> -1 out of scope, make it an add-on [14:13] <pketienne> 0 [14:13] <bradmc> One minute: -3 out of scope, mark 'won't fix' on DS-49 See also: http://wiki.dspace.org/index.php/JIRA_Cleanup#2009-08-25

            People

            • Assignee:
              Unassigned
              Reporter:
              kipkorir2008 Charles Kiplagat
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: