Uploaded image for project: 'DSpace'
  1. DSpace
  2. DS-790

SOLR - Spider detection to match on hostname or useragent

    Details

    • Type: Improvement
    • Status: Closed (View Workflow)
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 1.6.0, 1.6.1, 1.6.2, 1.7.0
    • Fix Version/s: 4.0
    • Component/s: Solr
    • Environment:
      solr
    • Attachments:
      0
    • Comments:
      14
    • Documentation Status:
      Complete or Committed

      Description

      Spiders are currently detected by matching their IP address to one listed in the /dspace/config/spiders/ip-list-X.txt, however as spiders change IP addresses, or the ip-list is unmaintained, then many spiders can slip through, however they will usually keep their user agent or hostname intact.

      I've noticed a sore point in my solr data, where msnbot is completely unfiltered by solr. They have an additional ip list: http://www.iplists.com/nw/msn.txt however it is very old, and with additional bingbots on the horizon, it would be easier to detect, and filter them out of the logs by user-agent, then to maintain all of the IP address ranges. The code to do this in SOLR is unimplemented, and this ticket is a place holder to encourage this work to filter out based on user agent / dns-hostname to be finished.

      To see all of the hits from msnbot that are unfiltered, look at: http://localhost:8080/solr/statistics/select?q=dns:msnbot*&facet=true&facet.field=dns&facet.mincount=1&facet.limit=5000

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                mwood Mark H. Wood
                Reporter:
                peterdietz Peter Dietz
                Reviewer:
                Peter Dietz
              • Votes:
                0 Vote for this issue
                Watchers:
                8 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: