Sphinx实践

最新推荐文章于 2024-12-05 22:55:57 发布

ExtraMan

最新推荐文章于 2024-12-05 22:55:57 发布

阅读量662

点赞数 1

分类专栏：工作总结文章标签： sphinx

本文链接：https://blog.csdn.net/ExtraMan/article/details/70102992

版权

工作总结专栏收录该内容

2 篇文章

订阅专栏

项目需要用到模糊搜索，数据量上千万级别的时候，用like性能很差（like以字符为单位，没有索引，需要扫描所有的记录），虽然mysql的MYISAM提供全文索引，但是性能却不敢让人恭维，毕竟数据库不是专业做搜索的，这些还是用程序做更加合适，所以就决定用sphinx来做mysql的全文索引工具。

Sphinx 单一索引最大可包含1亿条记录，在1千万条记录情况下的查询速度为0.x秒（毫秒级）。Sphinx创建索引的速度为：创建100万条记录的索引只需 3～4分钟，创建1000万条记录的索引可以在50分钟内完成，而只包含最新10万条记录的增量索引，重建一次只需几十秒。

安装

这里使用api调用的方式，而不是使用sphinxSE（不用重新编译mysql），因为需要检索中文，所以这里使用了coreseek（Sphinx＋LibMMSeg中文分词包），新版的coreseek将词典和sphinx源程序放在了一个包中，因此只需要下载coreseek包就可以了，下载地址http://www.coreseek.cn/

1、解压安装mmseg-3.2.14

./bootstrap #输出的warning信息可以忽略，如果出现error则需要解决
./configure --prefix=/usr/local/mmseg3
make && make install

2、安装csft-3.2.14

sh buildconf.sh #输出的warning信息可以忽略，如果出现error则需要解决
./configure --prefix=/usr/local/coreseek --without-unixodbc --with-mmseg --with-mmseg-includes=/usr/local/mmseg3/include/mmseg/ --with-mmseg-libs=/usr/local/mmseg3/lib/ --with-mysql ##如果提示mysql问题，可以查看MySQL数据源安装说明
make && make install

3、测试mmseg分词，coreseek搜索
cd testpack
cat var/test/test.xml #此时应该正确显示中文
/usr/local/mmseg3/bin/mmseg -d /usr/local/mmseg3/etc var/test/test.xml
/usr/local/coreseek/bin/indexer -c etc/csft.conf --all //建立索引
/usr/local/coreseek/bin/search -c etc/csft.conf 网络搜索 //查询

配置文件

复制一份配置文件cp /usr/local/coreseek/etc/sphinx-min.conf.dist /usr/local/coreseek/etc/sphinx.conf具体的可以下载附件查看

详细的讲解可以查看这篇文章http://www.cnblogs.com/yjf512/p/3598332.html

这里主要讲一下几个重点

1、source数据源

source的名字可以自定义，但是需要与后面的index对应上

type填mysql

sql_query_pre＝SET NAMES utf8 就是指定字符编码

sql_query ＝就是mysql语句(任何出现在SQL语句中，既不是ID属性，也没有使用“sql_attr_类型”设置的字段，都是全文字段)第一列id需为整数

sql_attr 这里有两个作用一个是查询的时候可以把attr属性字段的实际内容带出来，第二个是过滤条件的作用

2、index 索引

   source      = main_src            #索引数据源
   path        = /data/coreseek/var/brief_info_all_main #索引文件存放路径
   docinfo     = extern  #文档信息的存储模式，文档id 存放到spa文件中 文档的属性 存放到spd 文件中
   mlock       = 0  #缓冲内存锁定 searchd会讲spa和spi预读取到内存中。但是如果这部分内存数据长时间没有访问，则它会被交换到磁盘上
   morphology   = none #词形处理器
   min_word_len = 1 # 最小索引词长度，小于这个长度的词不会被索引。
   html_strip   = 0 # html标记清理，是否从输出全文数据中去除HTML标记。
   #中文分词配置，详情请查看：http://www.coreseek.cn/products-install/coreseek_mmseg/
   charset_dictpath = /usr/local/mmseg/etc/ #中文分词使用的目录
   charset_type  = zh_cn.utf-8
   ngram_len     = 0 #必须设置，表示取消原有的一元字符切分模式，不使其对中文分词产生干扰

客户端访问

在api目录下有各种语言的客户端api访问，都是通过tcp连接到searchd服务，这里主要讲一下C++的方式访问

int main(){
	const char * query="庄进发";
    const char * index="brief_info_all_main";

	sphinx_client * client;
	client=sphinx_create ( SPH_TRUE );
	//设置访问searchd的ip和端口
	sphinx_set_server ( client, "localhost", 9312);
	//设置匹配模式 当作一个词来匹配
	sphinx_set_match_mode ( client, SPH_MATCH_ALL );
	//排列展示方式
	sphinx_set_sort_mode ( client, SPH_SORT_EXTENDED, "@weight desc @id desc" );
	//设置结果limit 最大的还要修改search 配置文件
	sphinx_set_limits(client,(uipageIndex-1)*uipageLimit,uipageLimit,10000,0);

	//清理filter
	sphinx_reset_filters(client);
	//过滤属性
	sphinx_int64_t values={1002};
        sphinx_add_filter(client,"product",1,&values,SPH_FALSE);
	//设置时间过滤器
	sphinx_add_filter_range(client,"enter_time",uiStartTime,uiEndTime,SPH_FALSE);
       //查询
	sphinx_result * res = sphinx_query ( client, query, index, NULL);

        printf ( "Query '%s' retrieved %d of %d matches in %d.%03d sec.\n",query, res->total, res->total_found, res->time_msec/1000, res->time_msec%1000 );

         //结果是二维数组 放到一个二维表里面
	 for (uint32_t i=0; i<res->num_matches; i++ ){
	      Json::Value jRow;
		for ( uint32_t j=0; j<res->num_attrs; j++ )
		{
		    if( res->attr_types[j]==SPH_ATTR_STRING){
			 printf ( "%s", sphinx_get_string ( res, i, j ) );
		    }else{
		         printf ( "%u", (unsigned int)sphinx_get_int ( res, i, j ) );
		    }	
		}
		printf ( "\n" );
	}
}

技巧补充：

1、如果过滤的字段里面包含字符串怎么办？比如创建人的rtx名字?

Sphinx里面的sphinx_add_filter不支持字符串，楼主使用的crc32,将字符串转换为整形再过滤
2、对新增的数据怎么处理？

重建索引的时间成本很高，这里利用主索引+增量索引的方式来实现，主索引每天更新一次，增量索引每5分钟更新，然后利用merge 来合并索引
/usr/local/coreseek/bin/indexer -c /usr/local/coreseek/etc/sphinx_brief_all.conf --merge brief_info_all_main brief_info_all_delta --rotate
3、匹配中文不准确怎办？因为建立索引是根据mmsg分词，如果默认的词库不满足，可以动态添加词库参考文章https://www.mawenbao.com/note/mmseg-custom-dict.html
可以去搜狗下载词库http://pinyin.sogou.com/dict/

//将sougou词库解析成mmsg的文本格式

./extract-sougou-dict.py leader.scel -o sougou-dict.txt -mmseg
//合并词库　
./merge-mmseg-dict.py -a unigram.txt -b sougou-dict.txt -o merged.txt
mv unigram.txt.uni uni.lib

//生成词库

mv merged.txt /usr/local/mmseg3/etc/unigram.txt

/usr/local/mmseg3/bin/mmseg -u /usr/local/mmseg3/etc/unigram.txt

//覆盖原来的lib

cd /usr/local/mmseg3/etc

mv unigram.txt.uni uni.lib