LUCENE-1488] multilingual analyzer based on icu. DM, I really appreciate your review.
You have brought up some good ideas that I haven't yet thought about. All I see is a bit of JavaDoc and an extraneous unused variable (ICUTokenizer: private PositionIncrementAttribute posIncAtt Yeah there are some TODOs, and cleanup on the tokenstreams, and the API in general. its not easy to customize the way its supposed to be: where you as a user can actually supply BreakIterator impls to the tokenizer and say "use these rules/dictionary/whatever for tokenizing XYZ script only". I'm wondering whether it would make sense to have multiple representations of a token with the same position in the index. LUCENE-1343] A replacement for ISOLatin1AccentFilter that does a more thorough job of removing diacritical marks or non-spacing modifiers. Robert Muir, Would it make sense to have a Greek filter that strips diacritics?
My thought is that if the letter is Greek then the diacritics would be removed, but otherwise it would not. The GreekLowerCaseFilter (incorrectly named) does this also, somewhat. it removes tone marks... but this might not be what you "want" (depending on what that is), if you are dealing with polytonic Greek (sorry for my ignorance of the biblical test you are looking at, but I think it is ancient Greek?) Yes, I'm referring to ancient Greek (grc, not el) and they are tone and breathing marks. Solr - User - Best way to index without diacritics.