最近在做新闻评论数据的网络爬虫,需要将爬去到的评论数据存储在MySQL数据库中,提供评论的查询和展示功能。后来随着数据量的上升,MySQL查询性能急剧下降,有时一个查询要等几分种才有结果,无奈之下只好尝试下数据库分区。
开始评论表的结构如下:
CREATE TABLE `comment` (
`appID` int(11) NOT NULL,
`commentID` varchar(50) NOT NULL,
`uname` varchar(100) DEFAULT NULL,
`content` text,
`cdate` datetime DEFAULT NULL,
`ucity` varchar(50) DEFAULT NULL,
`upnum` int(11) DEFAULT NULL,
`downnum` int(11) DEFAULT NULL,
`newsID` varchar(50) DEFAULT NULL,
`isDeleted` tinyint(1) DEFAULT '0',
PRIMARY KEY (`commentID`,`appID`),
KEY `abc` (`cdate`) USING BTREE,
KEY `bcd` (`commentID`) USING BTREE
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
开始以为这样搞,就万事大吉,后来随着数据的增多,达到5千万级别,发现查询越来越慢,因为评论的查询大部分查询的是评论的内容content,使用like ‘%expression%’做的,当然也可以使用lucene或者solr来创建倒排索引进行查询,本文目标不在此。经过调研发现,评论表的数据具有很高的时效性,也可以对评论表按时间进行分区,来加快查询速度。
创建分区的步骤如下:
1.增加时间cdate为主键
SQL语句如下:
ALTER TABLE `comment`
MODIFY COLUMN `cdate` datetime NOT NULL AFTER `content`,
DROP PRIMARY KEY,
ADD PRIMARY KEY (`commentID`, `appID`, `cdate`);
2.执行分区SQL语句,这里直接创建两年内的分区。自动维护比较麻烦,需要创建存储过程和事件调度器(可以简单理解为时间触发器,按时间周期触发某种操作,这里特指,执行存储过程创建新的分区操作),其中存储过程比较难写,时间计算逻辑复杂。
分区SQL语句:
alter table comment_1 PARTITION BY RANGE(TO_DAYS (cdate))
(
PARTITION p20150115 VALUES LESS THAN (TO_DAYS('2015-01-15')),
PARTITION p20150131 VALUES LESS THAN (TO_DAYS('2015-01-31')),
PARTITION p20150215 VALUES LESS THAN (TO_DAYS('2015-02-15')),
PARTITION p20150228 VALUES LESS THAN (TO_DAYS('2015-02-28')),
PARTITION p20150315 VALUES LESS THAN (TO_DAYS('2015-03-15')),
PARTITION p20150331 VALUES LESS THAN (TO_DAYS('2015-03-31')),
PARTITION p20150415 VALUES LESS THAN (TO_DAYS('2015-04-15')),
PARTITION p20150430 VALUES LESS THAN (TO_DAYS('2015-04-30')),
PARTITION p20150515 VALUES LESS THAN (TO_DAYS('2015-05-15')),
PARTITION p20150531 VALUES LESS THAN (TO_DAYS('2015-05-31')),
PARTITION p20150615 VALUES LESS THAN (TO_DAYS('2015-06-15')),
PARTITION p20150630 VALUES LESS THAN (TO_DAYS('2015-06-30')),
PARTITION p20150715 VALUES LESS THAN (TO_DAYS('2015-07-15')),
PARTITION p20150731 VALUES LESS THAN (TO_DAYS('2015-07-31')),
PARTITION p20150815 VALUES LESS THAN (TO_DAYS('2015-08-15')),
PARTITION p20150831 VALUES LESS THAN (TO_DAYS('2015-08-31')),
PARTITION p20150915 VALUES LESS THAN (TO_DAYS('2015-09-15')),
PARTITION p20150930 VALUES LESS THAN (TO_DAYS('2015-09-30')),
PARTITION p20151015 VALUES LESS THAN (TO_DAYS('2015-10-15')),
PARTITION p20151031 VALUES LESS THAN (TO_DAYS('2015-10-31')),
PARTITION p20151115 VALUES LESS THAN (TO_DAYS('2015-11-15')),
PARTITION p20151130 VALUES LESS THAN (TO_DAYS('2015-11-30')),
PARTITION p20151215 VALUES LESS THAN (TO_DAYS('2015-12-15')),
PARTITION p20151231 VALUES LESS THAN (TO_DAYS('2015-12-31')),
PARTITION p20160115 VALUES LESS THAN (TO_DAYS('2016-01-15')),
PARTITION p20160131 VALUES LESS THAN (TO_DAYS('2016-01-31')),
PARTITION p20160215 VALUES LESS THAN (TO_DAYS('2016-02-15')),
PARTITION p20160228 VALUES LESS THAN (TO_DAYS('2016-02-28')),
PARTITION p20160315 VALUES LESS THAN (TO_DAYS('2016-03-15')),
PARTITION p20160331 VALUES LESS THAN (TO_DAYS('2016-03-31')),
PARTITION p20160415 VALUES LESS THAN (TO_DAYS('2016-04-15')),
PARTITION p20160430 VALUES LESS THAN (TO_DAYS('2016-04-30')),
PARTITION p20160515 VALUES LESS THAN (TO_DAYS('2016-05-15')),
PARTITION p20160531 VALUES LESS THAN (TO_DAYS('2016-05-31')),
PARTITION p20160615 VALUES LESS THAN (TO_DAYS('2016-06-15')),
PARTITION p20160630 VALUES LESS THAN (TO_DAYS('2016-06-30')),
PARTITION p20160715 VALUES LESS THAN (TO_DAYS('2016-07-15')),
PARTITION p20160731 VALUES LESS THAN (TO_DAYS('2016-07-31')),
PARTITION p20160815 VALUES LESS THAN (TO_DAYS('2016-08-15')),
PARTITION p20160831 VALUES LESS THAN (TO_DAYS('2016-08-31')),
PARTITION p20160915 VALUES LESS THAN (TO_DAYS('2016-09-15')),
PARTITION p20160930 VALUES LESS THAN (TO_DAYS('2016-09-30')),
PARTITION p20161015 VALUES LESS THAN (TO_DAYS('2016-10-15')),
PARTITION p20161031 VALUES LESS THAN (TO_DAYS('2016-10-31')),
PARTITION p20161115 VALUES LESS THAN (TO_DAYS('2016-11-15')),
PARTITION p20161130 VALUES LESS THAN (TO_DAYS('2016-11-30')),
PARTITION p20161215 VALUES LESS THAN (TO_DAYS('2016-12-15')),
PARTITION p20161231 VALUES LESS THAN (TO_DAYS('2016-12-31'))
);
创建分区耗时:
在122万数据集上新建分区耗时接近36分钟(ps:本地磁盘为机械硬盘,性能较差,曾一度认为分区失败,不过最后还是创建分区成功)
查看分区结果:
查看方式:执行SQL语句:
select
partition_name part,
partition_expression expr,
partition_description descr,
table_rows
from information_schema.partitions where
table_schema = schema()
and table_name='comment';
程序查询验证:
执行SQL语句:
select * from comment where content like '%南京%' and cdate > '2015-11-01 00:00:00'
查询结果:
折线图如下
结论
1.分区查询时间平均比未分区查询快上3~4倍。
2.对已有大量数据的数据表添加分区比较耗时。