Details

    • Roadmap Theme:
      Search

      Description

      Apache Tika is a toolkit that can extract text and metadata from a wide variety of mimetyped formats (including PDF, via PDFBox). Employing Tika as an extraction engine in GSearch would immediately expand enormously the possible range of material over which GSearch could operate, and going forward, GSearch would benefit from new parsers and better-performing parsers created as part of that effort.

        Attachments

          Activity

            People

            • Assignee:
              gertsp Gert Schmeltz Pedersen
              Reporter:
              ajs6f@virginia.edu A. Soroka
              Reviewer:
              A. Soroka
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: