日期:2014-05-16  浏览次数:20453 次

翻译 Lucene's NumericRangeQuery javadoc
我们已开发出Lucene的扩展包以使用特殊的变精度的字符串编码格式存储数字值(所有的诸如 double,long,float,和int的数字值会被转换为字典排序字符串的表示并以不同的精度存储,对于如何存储的细节,可以参看NumericUtils),一个range会被递归的分成多个小段以方便搜索: Range中间部分在Trie树中会以低精度搜索,边界则会以高精度搜索。这样可以急剧减少term的数量。

对于那些比较大的变长的值,我们提供了8种不同的精度(每个减少8位),最低精度的只有一个字节,这样最低精度的只有256个值。总的来说,一个range可以包含最大7*255*2 + 255 = 3825个不同term(当有个term对每个不同值-索引中的8字节数字range 几乎cover所有值;最大使用255个不同值,因为它将总是可能减少到全的256个值-使用低精度从而能用一个term表示)。实际中,我们能看到300个terms(使用500,000元数据记录索引和一个统一的值分布)

We have developed an extension to Apache Lucene that stores the numerical values in a special string-encoded format with variable precision (all numerical values like doubles, longs, floats, and ints are converted to lexicographic sortable string representations and stored with different precisions, for a more detailed description of how the values are stored, see NumericUtils). A range is then divided recursively into multiple intervals for searching: The center of the range is searched only with the lowest possible precision in the trie, while the boundaries are matched more exactly. This reduces the number of terms dramatically.

For the variant that stores long values in 8 different precisions (each reduced by 8 bits) that uses a lowest precision of 1 byte, the index contains only a maximum of 256 distinct values in the lowest precision. Overall, a range could consist of a theoretical maximum of 7*255*2 + 255 = 3825 distinct terms (when there is a term for every distinct value of an 8-byte-number in the index and the range covers almost all of them; a maximum of 255 distinct values is used because it would always be possible to reduce the full 256 values to one term with degraded precision). In practice, we have seen up to 300 terms in most cases (index with 500,000 metadata records and a uniform value distribution).