日期:2014-05-16  浏览次数:20451 次

coreseek一元切分模式中英文单词不切分问题
        网站搜索使用coreseek(sphinx),采用的一元分词模式,但按照官方网站的文档说明,却不支持英文单词、数字串一元分词,如:光华路SOHO,输入soho中任一字母不能查找出soho;输入soho可以查出,如标题中仅一个字母时,是可以的,如光华路h,输入“h”,可以查出,由此推断英文单词没有做一元分词索引,仔细查看文档:
(http://www.coreseek.cn/products-install/ngram_len_cjk/ 文档地址,此处仅列出主要部分)
#部分文档:

     ngram_chars = U+4E00..U+9FBF, U+3400..U+4DBF, U+20000..U+2A6DF, U+F900..U+FAFF,\
U+2F800..U+2FA1F, U+2E80..U+2EFF, U+2F00..U+2FDF, U+3100..U+312F, U+31A0..U+31BF,\
U+3040..U+309F, U+30A0..U+30FF, U+31F0..U+31FF, U+AC00..U+D7AF, U+1100..U+11FF,\
U+3130..U+318F, U+A000..U+A48F, U+A490..U+A4CF
     
charset_table = U+FF10..U+FF19->0..9, 0..9, U+FF41..U+FF5A->a..z, U+FF21..U+FF3A->a..z,\
A..Z->a..z, a..z, U+0149, U+017F, U+0138, U+00DF, U+00FF, U+00C0..U+00D6->U+00E0..U+00F6,\ ......略..


# end

   其中: ngram_chars 表示要进行一元字符切分模式的字符集;
          charset_table 表示可被一元字符切分模式认可的有效字符集;

    仔细对比字符集开头,发现ngram_chars中没有数字与英文字母的集合,呵呵!终于找到原因了,将charset_table字符集开头:“U+FF10..U+FF19->0..9,0..9,U+FF41..U+FF5A->a..z,U+FF21..U+FF3A->a..z,A..Z->a..z, a..z,”部分,复制到ngram_char字符集前头如下:
    ngram_chars =U+FF10..U+FF19->0..9, 0..9, U+FF41..U+FF5A->a..z, U+FF21..U+FF3A->a..z,\
A..Z->a..z, a..z, U+4E00..U+9FBF, U+3400..U+4DBF, U+20000..U+2A6DF, U+F900..U+FAFF,\
U+2F800..U+2FA1F, U+2E80..U+2EFF, U+2F00..U+2FDF, U+3100..U+312F, U+31A0..U+31BF,\
U+3040..U+309F, U+30A0..U+30FF, U+31F0..U+31FF, U+AC00..U+D7AF, U+1100..U+11FF,\
U+3130..U+318F, U+A000..U+A48F, U+A490..U+A4CF
     
charset_table = U+FF10..U+FF19->0..9, 0..9, U+FF41..U+FF5A->a..z, U+FF21..U+FF3A->a..z,\
A..Z->a..z, a..z, U+0149, U+017F, U+0138, U+00DF, U+00FF, U+00C0..U+00D6->U+00E0..U+00F6,\ ......略..
重新执行索引,问题解决。