Recently I have implemented a system to index and hence search PDF files. The principle used is that a MODx plugin tied to the OnDocFormSave event calls pdftotext (part of the xpdf-utils package) and puts the output into a template variable that is then searched.
You will need pdftotext on your server - if you are using shared hosting this could be a problem as only the server administrator can install it. Another server issue is that the plugin uses shell_exec which is disabled if PHP is running in safe mode.
You will also need to install docmanager within your MODx assets directory (not to be confused with the MODx module of the same name). I am using version 0.5.3b - note that at the time of writing (17 May 2009) the latest version of docmanager is for the MODx revolution alpha release, but previous releases are still available.
- Download the plugin. Go to Resources » Plugins, and create a new Plugin. Paste in the code and under the "System Events" tab tick the OnDocFormSave event.
- Create a template variable of type file, named 'download' - this is for the actual PDF file. If you already have a template variable with your PDF files and do not which to change its name, or just want to call it something else anyway, then you can just change the configuration line within the plugin that defines PDF_FILE_TV.
- Create a template variable of type textarea, named 'downloadText'. This is to hold the automatically generated text version of the PDF file. As above, if you wish to call it something else just change the configuration line within the plugin that defines TEXT_TV. Make sure you assign this template variable to the same templates as the file template variable.
Thats it! If you already have PDFs stored you will have to open (edit) those documents and save them again (no other action or editing is actually required). The downloadText will fill automatically.
Don't forget that to actually have the search snippet search the downloadText template variable, you will need to pass an &searchFields parameter in that search snippet call, e.g.
[!Search? &searchFields=`pagetitle, longtitle, introtext, content, downloadtext`!]
Credit: Thanks to Ben Carter for the suggestion to use pdftotext to populate a template variable.