Forums / Developer / Accented characters are not working in solr search

"Please Note:
  • At the specific request of Ibexa we are changing this projects name to "Exponential" or "Exponential (CMS)" effective as of August, 11th 2025.
  • This project is not associated with the original eZ Publish software or its original developer, eZ Systems or Ibexa".

Accented characters are not working in solr search

Author Message

Praveen Kumar

Tuesday 16 August 2011 5:00:23 pm

Hi, 
This is Praveen. I am using apache-solr in our project to support search on cities. I having a problem with the accented characters while searching. 
For example: 
My city name is 'vrély'. 
if i search for 'vr*', it is giving the result. 
But if i search for 'vrél*', it is not giving any results.  
But if i search without accented characters like 'vre*', it again give results. 
My city field type is "text" and my schema.xml for this as follows: 
        <fieldType name="text" class="solr.TextField" positionIncrementGap="100">
            <analyzer type="index">
                <tokenizer class="solr.WhitespaceTokenizerFactory"/>
                
                
                <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"/>
                <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
                <filter class="solr.LowerCaseFilterFactory"/>
                <filter class="solr.ASCIIFoldingFilterFactory"/>
                <filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt"/>
            </analyzer>
            <analyzer type="query">
                <tokenizer class="solr.WhitespaceTokenizerFactory"/>
                <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
                <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"/>
                <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
                <filter class="solr.LowerCaseFilterFactory"/>
                <filter class="solr.ASCIIFoldingFilterFactory"/>
                <filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt"/>
            </analyzer>
        </fieldType>
Any suggestions or solution to resolve my problem is appreciable. 
Thanks in Advance... 
Regards, 
Praveen Kumar 

Ivo Lukac

Wednesday 17 August 2011 12:59:15 am

There could be 2 things:

- either your index and query analyzer are not the same (e.g. there is a small difference: catenateWords="0" catenateNumbers="0") so tokens are not the same in both situations or

- the "é" character is somehow badly encoded when sent to solr as a query

I had a similar problem before when I used jetty, it didn't support utf-8 queries very well. I switched to tomcat. Could be that jetty resolved those issues in newer version, I didn't check.

Anyway, you need to be aware that "vrély" is always tokenized as "vrely", that is why you are finding it with vr* and vre*

http://www.linkedin.com/in/ivolukac
http://www.netgen.hr/eng/blog
http://twitter.com/ilukac

Philippe VINCENT-ROYOL

Wednesday 17 August 2011 1:24:20 am

Just a question : which version of solr do you use? 

Certified Developer (4.1): http://auth.ez.no/certification/verify/272607
Certified Developer (4.4): http://auth.ez.no/certification/verify/377321

G+ : http://plus.tl/dspe
Twitter : http://twitter.com/dspe