DSpace
  1. DSpace
  2. DS-638

check files on input for viruses, and verify file format

    Details

    • Type: New Feature New Feature
    • Status: Closed Closed (View Workflow)
    • Priority: Minor Minor
    • Resolution: Fixed
    • Fix Version/s: 1.8.0
    • Component/s: JSPUI
    • Labels:
      None
    • Attachments:
      5
    • Comments:
      17

      Description

      This patch uses JHOVE to provide rough-and-ready format checking by identifying that the file/bitstream extension matches formats verifiable by JHOVE. (Currently DSpace accepts a deposit's file extension as gospel, so a user could tack a ".txt" extension onto a GIF and DSpace would assign the incorrect format to the file based on that incorrect extension.)
      This patch also also contains code to check the file for the presence of viruses.

      In order to use this patch you must have jhove and ClamAV installed on your system.

      Important notes:

      (1) HTML identification has proved unreliable ( by jhove ), so this patch does not return accurate results for that
      file format.
      (2) This code does not fully incorporate JHOVE's validation functions; it only verifies that what depositors intended to submit is in fact what they submitted.

      The following are returned messages when an error is detected:

      Text in [brackets] is a returned value, ALLCAPS can/should be modified to reflect your current installation.

      Questionable AIFF, GIF, JPG, PDF, TIF, WAVE, XML:

      DSPACE could not verify that your file is a valid [file_format_extension]. Please check the file format and ".[file_format_extension]" extension.

      Questionable TXT:

      DSPACE found the text file you are trying to upload is neither UTF-8 nor ASCII. Please verify that your file is in the format you wanted.

      Spaces in filenames ( this is an additional check ):

      The file name contains spaces; this is not recommended. If possible, please replace spaces with underscores: "_".

      Virus detected:

      DSPACE detected a virus in this file. Please repair it and resume the deposit. If you need assistance, please contact us: EMAIL_ADDRESS.

      To get the patch working:

      Add the jhove conf files to

      [dspace]/jhove direcoty

      Here are the conf files:

      jhove-aiff.conf
      jhove-ascii.conf
      jhove-gif.conf
      jhove-jpeg.conff
      jhove-pdf.conf
      jhove-tiff.conf
      jhove-utf8.conf
      jhove-wave.conf
      jhove-xml.conf

      Also the following files were changed:

      dspace-api/src/main/java/org/dspace/submit/step/UploadStep.java
      dspace-jspui/dspace-jspui-api/src/main/java/org/dspace/app/webui/submit/step/JSPUploadStep.java
      dspace-api/src/main/java/org/dspace/content/FormatIdentifier.java

      dspace/modules/jspui/src/main/webapp/submit/get-file-format.jsp ( locally customized )
      dspace/modules/jspui/src/main/webapp/submit/upload-error-virus.jsp ( new file - placed in locally modified area for the jspui interface)

      These files are attached with this patch.

        Issue Links

          Activity

          Hide
          Robin Taylor added a comment -
          I have attached the modified UploadStep just to ease the review process.

          Cheers.
          Show
          Robin Taylor added a comment - I have attached the modified UploadStep just to ease the review process. Cheers.
          Hide
          Mark Diggory added a comment -
          We are actually using the FileUpload step code, so I'm all for its addition to the codebase... If there were any way to harness the Curation task for ClamAV at that point in the Submission WF, it might help with consolidating the code a bit.
          Show
          Mark Diggory added a comment - We are actually using the FileUpload step code, so I'm all for its addition to the codebase... If there were any way to harness the Curation task for ClamAV at that point in the Submission WF, it might help with consolidating the code a bit.
          Hide
          Mark H. Wood added a comment -
          I agree that workflow seems like a logical place to do this. What I was trying to say (and this definitely wants a ticket of its own) is that there's no need for sites to reconfigure for this if we rework ingestion so that *every submission enters a workflow*. There can be mandatory workflow steps with no user group assigned, which are carried out without human contact, such as ingestion-time curation activities (such as virus scanning). If no interactive steps are configured then that's all that happens.
          Show
          Mark H. Wood added a comment - I agree that workflow seems like a logical place to do this. What I was trying to say (and this definitely wants a ticket of its own) is that there's no need for sites to reconfigure for this if we rework ingestion so that *every submission enters a workflow*. There can be mandatory workflow steps with no user group assigned, which are carried out without human contact, such as ingestion-time curation activities (such as virus scanning). If no interactive steps are configured then that's all that happens.
          Hide
          Mark Diggory added a comment -
          Actually, the point of putting it in the file upload step is so the submitter can correct and be aware of the problem on the file that is causing it. It shouldn't even get into the Bitstream, the file should be tested prior to addition to DSpace. Most of the Webapps now cache the upload in a file object behind the HttpRServletReqest object or inside cocoon. The point of doing a virus scan in file upload is so the user knows theres a problem and fixes it before it ever reaches DSpace or the Reviewer Workflow.

          Also Configurable Reviewer workflow will certainly be able to support automated steps that do things like Jove detection etc....
          Show
          Mark Diggory added a comment - Actually, the point of putting it in the file upload step is so the submitter can correct and be aware of the problem on the file that is causing it. It shouldn't even get into the Bitstream, the file should be tested prior to addition to DSpace. Most of the Webapps now cache the upload in a file object behind the HttpRServletReqest object or inside cocoon. The point of doing a virus scan in file upload is so the user knows theres a problem and fixes it before it ever reaches DSpace or the Reviewer Workflow. Also Configurable Reviewer workflow will certainly be able to support automated steps that do things like Jove detection etc....
          Hide
          Robin Taylor added a comment -
          This is half done. 1.8 makes use of the existing Curation Framework to perform virus checking at the point of file upload.

          I will raise a new Jira issue for the file format checking to allow this one to be assigned to version 1.8.

          Thanks to Jose Blanco and others for their contributions.
          Show
          Robin Taylor added a comment - This is half done. 1.8 makes use of the existing Curation Framework to perform virus checking at the point of file upload. I will raise a new Jira issue for the file format checking to allow this one to be assigned to version 1.8. Thanks to Jose Blanco and others for their contributions.

            People

            • Assignee:
              Robin Taylor
              Reporter:
              Jose Blanco
            • Votes:
              1 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: