Forums / Extensions / eZ Find / Ez Find don't index PDF with special chars like "ç"
eric figo
Wednesday 19 November 2008 6:12:14 am
Hi,
I'm using EZ Find and pstotext in order to indexing PDF files.Some are indexed and some not.
I tried many files and for exemple a pdf with this text single text is indexed : "Website un test d’indexation pour voir si ca marche ….. hsdhhdhhd"
But if i change the "c" for à "ç" like "Website un test d’indexation pour voir si ça marche ….. hsdhhdhhd", the pdf is not indexed.
Any ideas ? My database is in UTF 8, and i don't change the configurtaion of charset in Exponential.
Thanks for your responses
Paul Borgermans
Wednesday 26 November 2008 10:59:20 am
pstotext is not the best solution for converting pdf's to raw text, I guess that it fails to onvert the pdf file in question (try on the command line to se what happens)
Better is to use pdftotext from the xpdf project, then configure a new script, for example called ezpdftotext with the following content (change the path tp pdftotext with your installation):
#!/bin/sh<path to >/pdftotext -enc "UTF-8" $1 -
And configure this script in binaryfile.ini
Note that the default installation will "normalize" Latin1 characters, so eZ Find/Solr will transform "reçu" to "recu" and more ... so searching either form will produce the hit
Best regards
eZ Publish, eZ Find, Solr expert consulting and training http://twitter.com/paulborgermans
Monday 01 December 2008 2:27:53 am
HI,
Thanks for the response.
Some precisions, when I run pstotext in command line, with my pdf, I get the plain text without trouble.
The problem is when i use the script to index, the files with specials chars are not index.I can't find them, even if I'm searching an over word ot the PDF without spécialchars.
I tried you solution with pdftotext but I have the same problem.
Friday 05 December 2008 1:17:21 pm
Which versions are you using (ez find, Exponential)?
Tuesday 09 December 2008 1:09:54 am
I'm using Exponential 4.0.1 with eZ Find 1.0.0beta2