The ICU configuration files contains a rule to remove control characters :
<transform rule="[:Control:] Any-Remove"/>
This rule is before tokenization.
The problem is that "[:Control:]" regex contains line feed, carriage return and tab. See http://www.regular-expressions.info/posixbrackets.html.
So when several lines are indexed, last word of line is joined with first line of next line. Thoses words are then not searchable.
For example :
First line
Second line
This will become "First lineSecond line", tokenized as "First", "lineSecond" and "line".
Test plan :
- Use ICU in Zebra configuration
- Choose an indexed field, like 300$a
- Create a new record
- Enter several lines in choosen field, like :
First line
Second line
- Index this record
=> Without patch the search on "Second" does not return the record
=> With patch the search on "Second" returns the record
- Same tests with tab and carriage return instead of line feed
Signed-off-by: Chris Cormack <chris@bigballofwax.co.nz>
Signed-off-by: Kyle M Hall <kyle@bywatersolutions.com>
Signed-off-by: Tomas Cohen Arazi <tomascohen@gmail.com>
<icu_chain locale="">
- <transform rule="[:Control:] Any-Remove"/>
+ <!-- Remove control characters except \t\n\r -->
+ <transform rule="[\x00-\x08\x0B\x0C\x0E-\x1F\x7F] Any-Remove"/>
<tokenize rule="l"/>
<transform rule="[:Punctuation:] Remove"/>
<transform rule="NFD"/>
<icu_chain locale="">
<transliterate rule="\'>\ "/>
<transliterate rule="[:Number:] { '-' > '' "/>
- <transform rule="[:Control:] Any-Remove"/>
+ <!-- Remove control characters except \t\n\r -->
+ <transform rule="[\x00-\x08\x0B\x0C\x0E-\x1F\x7F] Any-Remove"/>
<tokenize rule="l"/>
<transform rule="[[:WhiteSpace:][:Punctuation:]] Remove"/>
<transform rule="NFD"/>