Uploaded image for project: 'DSpace'
  1. DSpace
  2. DS-4007

PDF Text Extractor can cause strings like "content-type" to show up in search snippets

    XMLWordPrintable

    Details

    • Type: Improvement
    • Status: Closed (View Workflow)
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 5.9
    • Fix Version/s: 5.10
    • Component/s: filter-media
    • Labels:
      None
    • Attachments:
      0
    • Comments:
      4
    • Documentation Status:
      Needed

      Description

      Terry Brady [11:29 AM]
      We run the PDF Text Extractor. For items with no extractable text, we end up with something like the following in SOLR. Have others worked around this issue?
      Sample Solr FullText
      ‚Äč
      "handle": "10822/559380",
      "SolrIndexer.lastIndexed": "2018-09-13T17:21:42Z",
      "fulltext": [
      " \n \nstream_source_info whistleblowing.pdf.txt \nstream_content_type text/plain \nstream_size 222 \nContent-Encoding ISO-8859-1 \nstream_name whistleblowing.pdf.txt \nContent-Type text/plain; charset=ISO-8859-1 \n \n \n \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n \n "

      The "content-type" can sometimes appear in search snippets.

      Tom Desair [11:34 AM]
      The "context-type" part is added by SOLR, so I'm not sure you can exclude that.
      You can adjust the `FullTextContentStreams.buildFullTextList` method to ignore extracted text bitstreams with a size < 500 or 1000. (edited)
      https://github.com/DSpace/DSpace/blob/master/dspace-api/src/main/java/org/dspace/discovery/FullTextContentStreams.java#L82
      That class is passed to SOLR when indexing:
      https://github.com/DSpace/DSpace/blob/master/dspace-api/src/main/java/org/dspace/discovery/SolrServiceImpl.java#L772

      Terry Brady [11:36 AM]
      Thanks @tom_desair. I like that suggestion!
      I will try it out locally. If I like the result, I will post a PR as an option.

      Tom Desair [11:38 AM]
      Maybe you can make the "minimal required size" for extracted text bitstreams to be indexed, configurable

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                terrywbrady Terry Brady
                Reporter:
                terrywbrady Terry Brady
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: