We should pass documents through a commandline utility to sanitize them prior to ingest.
PDF's specifically can create a lot of problems.. the two most important problems being:
1. Non conforming pdfs cause a failure to ingest (image magick not creating thumbs, or solr not indexing),
Or, we could use another sanitization processes such as (but limited to) a python commandline utlity called ExeFilter http://www.decalage.info/exefilter