This page last changed on Jul 08, 2008 by smaddox.

When searching for content based on search terms entered by the user, Confluence splits the text of the content into tokens, and then filters and modifies those tokens according to the following rules.

Tokenisation

Confluence uses Lucene's Standard Tokenizer. This splits the text into tokens as follows:

  • Splits words at punctuation characters, removing punctuation. However, a dot that's not followed by white space is considered part of a token.
  • Splits words at hyphens, unless there's a number in the token, in which case the whole token is interpreted as a product number and is not split.
  • Recognises email addresses and internet host names as one token.

An example: The string 'foo-bar5' won't be split into 'foo' and 'bar5', so a search for 'bar5' or 'bar*' will not find any results.

Filtering

Confluence then:

  • Removes "'s" from the ends of words.
  • Removes the dots from acronyms, e.g. I.B.M. becomes IBM.
  • Converts everything to lower case.
  • Removes common words like 'the' and 'or' are removed.
  • Converts words to their stems. For example, 'fishing' and 'fishes' both become 'fish'.
RELATED TOPICS

Searching Confluence

Document generated by Confluence on Mar 16, 2011 18:19