3

日常问题: SQL优化 - Ryan.Miao

 2 years ago
source link: https://www.cnblogs.com/woshimrf/p/16632961.html
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

日常问题: SQL优化

日常开发中,除了开辟新项目,业务需求开发,一般还要做负责系统的日常运维。比如线上告警了,出bug了,必须及时修复。这天,运维反馈mysql cpu告警了,然后抓了该时间节点的慢sql日志,要开发分析解决。

686418-20220828160719967-83619073.png

686418-20220828160725191-565153009.png

拿到的慢sql日志:

# Query 1: 1.16 QPS, 1.96x concurrency, ID 0x338A0AEE1CFE3C1D at byte 7687104
# This item is included in the report because it matches --limit.
# Scores: V/M = 0.02
# Time range: 2022-08-12T16:30:00 to 2022-08-12T17:11:32
# Attribute    pct   total     min     max     avg     95%  stddev  median
# ============ === ======= ======= ======= ======= ======= ======= =======
# Count         99    2880
# Exec time     99   4893s      1s      2s      2s      2s   172ms      2s
# Lock time     99   187ms    52us   343us    64us    84us    11us    60us
# Rows sent     97     248       0       1    0.09    0.99    0.28       0
# Rows examine  96 871.46M 308.56k 311.13k 309.85k 298.06k       0 298.06k
# Query size    99 812.81k     289     289     289     289       0     289
# String:
# Hosts        10.22.9.183 (742/25%), 10.26.9.126 (730/25%)... 2 more
# Users        order
# Query_time distribution
#   1us
#  10us
# 100us
#   1ms
#  10ms
# 100ms
#    1s  ################################################################
#  10s+
# Tables
#    SHOW TABLE STATUS LIKE 'serial_number_store'\G
#    SHOW CREATE TABLE `serial_number_store`\G
# EXPLAIN /*!50100 PARTITIONS*/
select *
        from serial_number_store sn
        where 1=1
          and company_code = '8511378117' 
          and warehouse_code = '851' 
          and sku_no = '6902952880' 
          and (serial_no = '5007894' or sub_serial_no = 'v')\G

查询数据库定义,发现定义了几个index

  PRIMARY KEY (`ID`),
  KEY `IDX_SERIAL_NUMBER_2` (`WAREHOUSE_CODE`),
  KEY `IDX_SERIAL_NUMBER_3` (`SKU_NO`),
  KEY `IDX_SERIAL_NUMBER_4` (`SERIAL_NO`),
  KEY `IDX_SERIAL_NUMBER_5` (`SUB_SERIAL_NO`),
  KEY `IDX_SERIAL_NUMBER_6` (`SKU_NAME`),
  KEY `IDX_SERIAL_NUMBER_1` (`COMPANY_CODE`,`WAREHOUSE_CODE`,`SKU_NO`,`SERIAL_NO`) USING BTREE

按最左匹配原则,这条sql应该只会命中一个索引。因为or的另一半无法match。

explain发现实际执行计划:

key: IDX_SERIAL_NUMBER_3
key_len: 259
ref: const
rows: 45864
filtered:  0.95
Extra: Using where

表总数量: 13658763

or的优化技巧之一就是拆成2个可以命中索引的sql, 然后union all.

优化为union all

 explain select *
        from  serial_number_store sn
        where  company_code = '9311046897' 
          and warehouse_code = '931DCA' 
          and sku_no = '6935117818696' 
          and serial_no = '862517054251459'
		  
		  union all
		  
		  select *
        from  serial_number_store sn
        where  company_code = '9311046897' 
          and warehouse_code = '931DCA' 
          and sku_no = '6935117818696' 
          and sub_serial_no = '862517054251459';

最终explain

key:  IDX_SERIAL_NUMBER_4  IDX_SERIAL_NUMBER_5
ref: const        const
rows: 1     1
filtered: 5.0    5.0
extra: using where 

正常到这里,找到解决方案,就算完事了。但作为线上问题的处理,你得分析为啥以前没事,现在出问题了。

查询对应的链路追踪情况:

686418-20220828160735447-917122010.png

和猜测一致,短时间内批量查询。几乎每条sql2s多耗时。虽然是后台任务,但数据量太大导致cpu 100%.

定位实际的代码,mybatis是这么写:

    <sql id="servialNumberStoreEntityParams">
    	<if test="id!=null and id!=''"> and ID = #{id}</if>
        <if test="companyCode!=null and companyCode!=''"> and company_code = #{companyCode}</if>
        <if test="warehouseCode!=null and warehouseCode!=''"> and warehouse_code = #{warehouseCode}</if>
        <if test="sku!=null and sku!=''"> and sku_no = #{sku}</if>
        <if test="serialNo!=null and serialNo!=''"> and (serial_no = #{serialNo} or sub_serial_no = #{serialNo})</if>
        <if test="lotNum!=null and lotNum!=''"> and lot_num = #{lotNum}</if>
    </sql>

这个查询片段有多个sql引用了。比如

select *
from serial_number_store sn
where 1=1
<include refid="servialNumberStoreEntityParams" />

改造成union也不是不行,比如

select *
        from  serial_number_store sn
        where 1=1
        <include refid="servialNumberStoreEntityParams" />
        <if test="serialNo!=null and serialNo!=''">
            and serial_no = #{serialNo}
            union all
            select *
            from cwsp_tb_serial_number_store sn
            where 1=1
            <include refid="servialNumberStoreEntityParams" />
            and sub_serial_no = #{serialNo}
        </if>

但前面说了多个片段引用了,对应多个sql查询方法,然后这多个sql查询方法又会对应多个业务调用。那问题来了,如果改完要测的话,业务场景该怎么测?一时犹豫了,要不要再花额外的时间去搞回归测试,验证。

和运维小哥说,反正是个后台任务,先不改吧。运维看没影响到业务(没人投诉)也就不管了。

然后第二天上班又收到了告警。逃不掉了。

定位代码的时候,发现有个update

<update id="update">
        update serial_number_store
        <set>
            <if test="companyCode!=null and companyCode!=''">  COMPANY_CODE = #{companyCode},</if>
            <if test="warehouseCode!=null and warehouseCode!=''">  WAREHOUSE_CODE = #{warehouseCode},</if>
            <if test="sku!=null and sku!=''">  SKU_NO = #{sku},</if>
            <if test="serialNo!=null and serialNo!=''">  SERIAL_NO = #{serialNo},</if>
            <if test="subSerialNo!=null and subSerialNo!=''">  SUB_SERIAL_NO = #{subSerialNo},</if>
            <if test="erpno!=null and erpno!=''">  ERP_ORDER = #{erpno},</if>
            <if test="docType!=null and docType!=''">  DOCTYPE = #{docType},</if>
            <if test="editTime!=null and editTime!=''">  EDITTM = #{editTime},</if>
            <if test="editWho!=null and editWho!=''">  EDITEMP = #{editWho},</if>
           </set>
        where 1=1
        <include refid="servialNumberStoreEntityParams" />
    </update>

这种sql,假如参数没传,岂不是全表被覆盖? 当然,也能改。前提是梳理调用链路,把这些sql引用的业务场景梳理一遍,确定入参场景,然后修改,然后再模拟场景做测试。想想整个流程,1天不知道搞不搞的定,测试上线等等,还有更长的流程。

这种在设计之初就应该做好优化设计而不是出了问题再改,但当接手古老系统的时候,开发可能换了一波又一波了,这时候除了吐槽之外,只能填坑。与此同时,自己所开发的代码,在若干时间后,也许会被另外一个人吐槽(如果自己发现的坑是自己挖的,自然不会吐槽自己)

__EOF__


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK