Details

    • Type: Improvement
    • Status: Closed
    • Priority: Minor
    • Resolution: Ready for Test
    • Fix Version/s: 7.x-1.4
    • Component/s: PDF Solution Pack
    • Labels:
      None

      Description

      We should pass documents through a commandline utility to sanitize them prior to ingest.

      PDF's specifically can create a lot of problems.. the two most important problems being:

      1. Non conforming pdfs cause a failure to ingest (image magick not creating thumbs, or solr not indexing),
      2. Security issues; They can carry malicious javascript or launch actions to trigger their payload, or to allow the exploitation of vulnerabilities.

      There is no way we can have every user conform to a particular pdf format, use an early enough compatibility, use a restricted set of fonts, embed fonts, remove all invalid links, check for malicious javascript, etc.

      We *could* make a simple change and convert every incoming pdf to pdf/a format, it will drop all JavaScript, encryption, audio/video, external links, etc.

      Or, we could use another sanitization processes such as (but limited to) a python commandline utlity called ExeFilter http://www.decalage.info/exefilter

        Attachments

          Activity

            People

            • Assignee:
              dmoses Don Moses
              Reporter:
              krisbulman Kris Bulman
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: