34

模拟网络状态的利器 TC

 5 years ago
source link: https://mp.weixin.qq.com/s/4YjncPGre1X7jNbvlwmeDQ?amp%3Butm_medium=referral
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

yIv6BfB.gif

本文主要介绍了可以模拟出多种复杂的互联网传输性能的工具——TC,及具体的模拟方法。

上篇文章回顾: Nginx请求处理流程你了解吗?

在日常生产环境中,如何判断网络运行状况是否正常是一个让大家比较耗神的一件事情,因为我们往往被某些不太友好的人以所谓的“网络问题”甩锅至此并开始了我们洗白的经历,今天给大家介绍一个分析网络状态的好帮手——TC。

说到TC,我们就不得不谈谈Netem(Network Emulator),Netem是Linux2.6及以上内核版本提供的一个网络模拟功能模块。该功能模块可可以用来在性能良好的局域网环境中,模拟出复杂的互联网传输性能。例如:低带宽、传输延迟、丢包等等等情况。

TC是Linu系统中的一个用户工具,全名为Traffic Control(流量控制)。TC可以用来控制Netem模块的工作模式,也就是说如果想使用Netem需要至少两个条件,一是内核中的Netem模块被启用,另一个是要对应的用户态工具TC,它们之间的关系你可以理解为netfilter框架和iptables的关系。

下面就让我们一起来看看TC的有用之处(其实TC有很多功能,我们今天只介绍模拟网络环境的用处),我们先了解一下如下参数代表的意义再开始实验。

Add:表示为指定网卡添加Netem配置。
Change:表示修改已经存在的Netem配置到新的值。
Replace:表示替换已经存在的Netem配置的值。
del:表示删除网卡上的Netem配置。

1

模拟延迟传输

如果你想在一个局域网里模拟远距离传输的延迟可以用这个方法,比如实际用户访问网站延迟为 51 ms,而你测试环境网络交互只需要 1ms,那么只要添加 50ms 额外延迟就行。

[root@tj1-vm-search020 ~]# tc qdisc add dev eth0 root netem delay 50ms
[root@tj1-vm-search019 ~]# ping tj1-vm-search020.kscn
PING tj1-vm-search020.kscn (10.38.167.17) 56(84) bytes of data.
64 bytes from tj1-vm-search020.kscn (10.38.167.17): icmp_seq=1 ttl=64 time=50.0 ms
64 bytes from tj1-vm-search020.kscn (10.38.167.17): icmp_seq=2 ttl=64 time=50.0 ms
64 bytes from tj1-vm-search020.kscn (10.38.167.17): icmp_seq=3 ttl=64 time=50.0 ms
64 bytes from tj1-vm-search020.kscn (10.38.167.17): icmp_seq=4 ttl=64 time=50.0 ms
^C
--- tj1-vm-search020.kscn ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3003ms
rtt min/avg/max/mdev = 50.037/50.044/50.063/0.223 ms

如果在网络中看到非常稳定的时延,很可能是某个地方加了定时器,因为网络线路很复杂,传输过程一定会有变化。因此实际情况网络延迟一定会有变化的,Netem 也考虑到这一点,提供了额外的参数来控制延迟的时间分布。完整的参数列表为:

DELAY := delay TIME [ JITTER [ CORRELATION ]]]
[ distribution { uniform | normal | pareto |  paretonormal } ]

除了延迟时间 TIME 之外,还有三个可选参数:

  • JITTER:抖动,增加一个随机时间长度,让延迟时间出现在某个范围。

  • CORRELATION:相关,下一个报文延迟时间和上一个报文的相关系数。

  • distribution:分布,延迟的分布模式。可以选择的值有 uniform、normal、pareto 和 paretonormal。

先说说 JITTER,如果设置为 20ms,那么报文延迟的时间在 50ms  ± 20ms 之间,具体值随机选择:

[root@tj1-vm-search020 ~]# tc qdisc replace dev eth0 root netem delay 50ms 20ms
[root@tj1-vm-search019 ~]# ping tj1-vm-search020.kscn
PING tj1-vm-search020.kscn (10.38.167.17) 56(84) bytes of data.
64 bytes from tj1-vm-search020.kscn (10.38.167.17): icmp_seq=1 ttl=64 time=69.4 ms
64 bytes from tj1-vm-search020.kscn (10.38.167.17): icmp_seq=2 ttl=64 time=51.9 ms
64 bytes from tj1-vm-search020.kscn (10.38.167.17): icmp_seq=3 ttl=64 time=66.3 ms
64 bytes from tj1-vm-search020.kscn (10.38.167.17): icmp_seq=4 ttl=64 time=57.4 ms
64 bytes from tj1-vm-search020.kscn (10.38.167.17): icmp_seq=5 ttl=64 time=46.0 ms
64 bytes from tj1-vm-search020.kscn (10.38.167.17): icmp_seq=6 ttl=64 time=33.8 ms
^C
--- tj1-vm-search020.kscn ping statistics ---
6 packets transmitted, 6 received, 0% packet loss, time 5007ms
rtt min/avg/max/mdev = 33.877/54.178/69.446/12.063 ms

CORRELATION 指相关性,因为网络状况是平滑变化的,短时间里相邻报文的延迟应该是近似的而不是完全随机的。这个值是个百分比,如果为 100%,就退化到固定延迟的情况;如果是 0% 则退化到随机延迟的情况。

[root@tj1-vm-search020 ~]# tc qdisc replace dev eth0 root netem delay 50ms 20ms 30%
[root@tj1-vm-search019 ~]# ping tj1-vm-search020.kscn
PING tj1-vm-search020.kscn (10.38.167.17) 56(84) bytes of data.
64 bytes from tj1-vm-search020.kscn (10.38.167.17): icmp_seq=1 ttl=64 time=47.6 ms
64 bytes from tj1-vm-search020.kscn (10.38.167.17): icmp_seq=2 ttl=64 time=58.3 ms
64 bytes from tj1-vm-search020.kscn (10.38.167.17): icmp_seq=3 ttl=64 time=47.4 ms
64 bytes from tj1-vm-search020.kscn (10.38.167.17): icmp_seq=4 ttl=64 time=33.8 ms
64 bytes from tj1-vm-search020.kscn (10.38.167.17): icmp_seq=5 ttl=64 time=61.0 ms
^C
--- tj1-vm-search020.kscn ping statistics ---
5 packets transmitted, 5 received, 0% packet loss, time 4005ms
rtt min/avg/max/mdev = 33.898/49.668/61.050/9.610 ms

报文的分布和很多现实事件一样都满足某种统计规律,比如最常用的正态分布。因此为了更逼近现实情况,可以使用 distribution 参数来限制它的延迟分布模型。比如让报文延迟时间满足正态分布:

[root@tj1-vm-search020 ~]#  tc qdisc replace dev eth0 root netem delay 50ms 20ms distribution normal
[root@tj1-vm-search019 ~]# ping tj1-vm-search020.kscn
PING tj1-vm-search020.kscn (10.38.167.17) 56(84) bytes of data.
64 bytes from tj1-vm-search020.kscn (10.38.167.17): icmp_seq=1 ttl=64 time=41.7 ms
64 bytes from tj1-vm-search020.kscn (10.38.167.17): icmp_seq=2 ttl=64 time=44.3 ms
64 bytes from tj1-vm-search020.kscn (10.38.167.17): icmp_seq=3 ttl=64 time=50.7 ms
64 bytes from tj1-vm-search020.kscn (10.38.167.17): icmp_seq=4 ttl=64 time=57.2 ms
64 bytes from tj1-vm-search020.kscn (10.38.167.17): icmp_seq=5 ttl=64 time=37.6 ms
^C
--- tj1-vm-search020.kscn ping statistics ---
5 packets transmitted, 5 received, 0% packet loss, time 4005ms
rtt min/avg/max/mdev = 37.675/46.350/57.249/6.912 ms

这样的话,大部分的延迟会在平均值的一定范围内,而很少接近出现最大值和最小值的延迟。
其他分布方法包括:uniform、pareto 和 paretonormal,这些分布方法感兴趣的读者可以自行了解。对于大多数情况,随机在某个时间范围里延迟就能满足需求的。

2

模拟丢包率

另一个常见的网络异常是因为丢包,丢包会导致重传,从而增加网络链路的流量和延迟。Netem 的 loss 参数可以模拟丢包率,比如发送的报文有 50% 的丢包率(为了容易用 ping 看出来,所以这个数字我选的很大,实际情况丢包率可能比这个小很多,比如 0.5%):

[root@tj1-vm-search020 ~]# tc qdisc change dev eth0 root netem loss 50%
[root@tj1-vm-search019 ~]# ping -c 10 tj1-vm-search020.kscn
PING tj1-vm-search020.kscn (10.38.167.17) 56(84) bytes of data.
64 bytes from tj1-vm-search020.kscn (10.38.167.17): icmp_seq=1 ttl=64 time=0.049 ms
64 bytes from tj1-vm-search020.kscn (10.38.167.17): icmp_seq=2 ttl=64 time=0.038 ms
64 bytes from tj1-vm-search020.kscn (10.38.167.17): icmp_seq=7 ttl=64 time=0.036 ms
64 bytes from tj1-vm-search020.kscn (10.38.167.17): icmp_seq=8 ttl=64 time=0.037 ms
64 bytes from tj1-vm-search020.kscn (10.38.167.17): icmp_seq=9 ttl=64 time=0.035 ms

--- tj1-vm-search020.kscn ping statistics ---
10 packets transmitted, 5 received, 50% packet loss, time 9000ms
rtt min/avg/max/mdev = 0.035/0.039/0.049/0.005 ms

可以从 icmp_seq 序号看出来大约有一半的报文丢掉了,和延迟类似丢包率也可以增加一个相关系数,表示后一个报文丢包概率和它前一个报文的相关性。

3

模拟包重复

报文重复和丢包的参数类似,就是重复率和相关性两个参数,比如随机产生 50% 重复的包:

[root@tj1-vm-search020 ~]# tc qdisc change dev eth0 root netem duplicate 50%
[root@tj1-vm-search019 ~]# ping -c 10 tj1-vm-search020.kscn
PING tj1-vm-search020.kscn (10.38.167.17) 56(84) bytes of data.
64 bytes from tj1-vm-search020.kscn (10.38.167.17): icmp_seq=1 ttl=64 time=0.039 ms
64 bytes from tj1-vm-search020.kscn (10.38.167.17): icmp_seq=1 ttl=64 time=0.044 ms (DUP!)
64 bytes from tj1-vm-search020.kscn (10.38.167.17): icmp_seq=2 ttl=64 time=0.045 ms
64 bytes from tj1-vm-search020.kscn (10.38.167.17): icmp_seq=2 ttl=64 time=0.050 ms (DUP!)
64 bytes from tj1-vm-search020.kscn (10.38.167.17): icmp_seq=3 ttl=64 time=0.033 ms
64 bytes from tj1-vm-search020.kscn (10.38.167.17): icmp_seq=3 ttl=64 time=0.037 ms (DUP!)
64 bytes from tj1-vm-search020.kscn (10.38.167.17): icmp_seq=4 ttl=64 time=0.033 ms
64 bytes from tj1-vm-search020.kscn (10.38.167.17): icmp_seq=4 ttl=64 time=0.038 ms (DUP!)
64 bytes from tj1-vm-search020.kscn (10.38.167.17): icmp_seq=5 ttl=64 time=0.037 ms
64 bytes from tj1-vm-search020.kscn (10.38.167.17): icmp_seq=6 ttl=64 time=0.036 ms
64 bytes from tj1-vm-search020.kscn (10.38.167.17): icmp_seq=6 ttl=64 time=0.039 ms (DUP!)
64 bytes from tj1-vm-search020.kscn (10.38.167.17): icmp_seq=7 ttl=64 time=0.029 ms
64 bytes from tj1-vm-search020.kscn (10.38.167.17): icmp_seq=8 ttl=64 time=0.030 ms
64 bytes from tj1-vm-search020.kscn (10.38.167.17): icmp_seq=8 ttl=64 time=0.034 ms (DUP!)
64 bytes from tj1-vm-search020.kscn (10.38.167.17): icmp_seq=9 ttl=64 time=0.037 ms
64 bytes from tj1-vm-search020.kscn (10.38.167.17): icmp_seq=10 ttl=64 time=0.036 ms

--- tj1-vm-search020.kscn ping statistics ---
10 packets transmitted, 10 received, +6 duplicates, 0% packet loss, time 9001ms
rtt min/avg/max/mdev = 0.029/0.037/0.050/0.007 ms

4

模拟包损坏

报文损坏和报文重复的参数也类似,比如随机产生 2% 损坏的报文(在报文的随机位置造成一个比特的错误)。

[root@tj1-vm-search020 ~]# tc qdisc change dev eth0 root netem corrupt 2%
[root@tj1-vm-search019 ~]# ping  tj1-vm-search020.kscn
PING tj1-vm-search020.kscn (10.38.167.17) 56(84) bytes of data.
64 bytes from tj1-vm-search020.kscn (10.38.167.17): icmp_seq=1 ttl=64 time=0.043 ms
64 bytes from tj1-vm-search020.kscn (10.38.167.17): icmp_seq=2 ttl=64 time=0.040 ms
64 bytes from tj1-vm-search020.kscn (10.38.167.17): icmp_seq=3 ttl=64 time=0.033 ms
64 bytes from tj1-vm-search020.kscn (10.38.167.17): icmp_seq=4 ttl=64 time=0.034 ms
64 bytes from tj1-vm-search020.kscn (10.38.167.17): icmp_seq=5 ttl=64 time=0.033 ms
64 bytes from tj1-vm-search020.kscn (10.38.167.17): icmp_seq=6 ttl=64 time=0.043 ms
64 bytes from tj1-vm-search020.kscn (10.38.167.17): icmp_seq=8 ttl=64 time=0.039 ms
64 bytes from tj1-vm-search020.kscn (10.38.167.17): icmp_seq=10 ttl=64 time=0.056 ms
wrong data byte #39 should be 0x27 but was 0xa7
#16    10 11 12 13 14 15 16 17 18 19 1a 1b 1c 1d 1e 1f 20 21 22 23 24 25 26 a7 28 29 2a 2b 2c 2d 2e 2f
#48    30 31 32 33 34 35 36 37
64 bytes from tj1-vm-search020.kscn (10.38.167.17): icmp_seq=11 ttl=64 time=0.046 ms
64 bytes from tj1-vm-search020.kscn (10.38.167.17): icmp_seq=12 ttl=64 time=0.036 ms
Warning: time of day goes back (-4773815605012725725us), taking countermeasures.
Warning: time of day goes back (-4773815605012725708us), taking countermeasures.
64 bytes from tj1-vm-search020.kscn (10.38.167.17): icmp_seq=13 ttl=64 time=0.000 ms
64 bytes from tj1-vm-search020.kscn (10.38.167.17): icmp_seq=15 ttl=64 time=0.045 ms
64 bytes from tj1-vm-search020.kscn (10.38.167.17): icmp_seq=16 ttl=64 time=0.043 ms
^C
--- tj1-vm-search020.kscn ping statistics ---
16 packets transmitted, 13 received, 18% packet loss, time 15001ms
rtt min/avg/max/mdev = 0.000/0.037/0.056/0.014 ms

5

模拟包乱序

网络传输并不能保证顺序,传输层 TCP 会对报文进行重组保证顺序,所以报文乱序对应用的影响比上面的几种问题要小。

报文乱序和前面的参数不太一样,因为上面的报文问题都是独立的。针对单个报文做操作就行,而乱序则牵涉到多个报文的重组。模拟报乱序一定会用到延迟(因为模拟乱序的本质就是把一些包延迟发送),Netem 有两种方法可以做。

第一种是固定的每隔一定数量的报文就乱序一次。

# 每 5 个报文(第 5、10、15…报文)会正常发送,其他的报文延迟 50ms。
[root@tj1-vm-search020 ~]# tc qdisc change dev eth0 root netem reorder 50% gap 3 delay 50ms
[root@tj1-vm-search019 ~]# ping  -i 0.01 tj1-vm-search020.kscn | more
PING tj1-vm-search020.kscn (10.38.167.17) 56(84) bytes of data.
64 bytes from tj1-vm-search020.kscn (10.38.167.17): icmp_seq=3 ttl=64 time=10.5 ms
64 bytes from tj1-vm-search020.kscn (10.38.167.17): icmp_seq=1 ttl=64 time=50.0 ms
wrong data byte #21 should be 0x15 but was 0x5
#16    10 11 12 13 14 5 16 17 18 19 1a 1b 1c 1d 1e 1f 20 21 22 23 24 25 26 27 28 29 2a 2b 2c 2d 2e 2f
#48    30 31 32 33 34 35 36 37
64 bytes from tj1-vm-search020.kscn (10.38.167.17): icmp_seq=2 ttl=64 time=50.0 ms

要想看到 ping 报文的乱序,我们要保证发送报文的间隔小于报文的延迟时间 50ms,这里用 -i 0.01 把发送间隔设置为 10ms。
第二种方法的乱序是相对随机的,使用概率来选择乱序的报文。

$ tc qdisc change dev enp0s5 root netem reorder 50% 15% delay 300ms
[root@tj1-vm-search019 ~]# ping  -i 0.01 tj1-vm-search020.kscn
PING tj1-vm-search020.kscn (10.38.167.17) 56(84) bytes of data.
64 bytes from tj1-vm-search020.kscn (10.38.167.17): icmp_seq=6 ttl=64 time=11.5 ms
64 bytes from tj1-vm-search020.kscn (10.38.167.17): icmp_seq=4 ttl=64 time=51.5 ms
64 bytes from tj1-vm-search020.kscn (10.38.167.17): icmp_seq=3 ttl=64 time=71.5 ms
64 bytes from tj1-vm-search020.kscn (10.38.167.17): icmp_seq=1 ttl=64 time=111 ms
64 bytes from tj1-vm-search020.kscn (10.38.167.17): icmp_seq=13 ttl=64 time=85.0 ms
wrong data byte #51 should be 0x33 but was 0x23
#16    10 11 12 13 14 15 16 17 18 19 1a 1b 1c 1d 1e 1f 20 21 22 23 24 25 26 27 28 29 2a 2b 2c 2d 2e 2f
#48    30 31 32 23 34 35 36 37
64 bytes from tj1-vm-search020.kscn (10.38.167.17): icmp_seq=12 ttl=64 time=105 ms

50% 的报文会正常发送,其他报文(1-50%)延迟 300ms 发送,这里选择的延迟很大是为了能够明显看出来乱序的结果。

结语

本文介绍了TC在模拟网络状态的几种应用场景,实际上,TC作为Linux提供的高级流量控制工具,还有很多高级用法,诸入SHAPING(限制)、SCHEDULING(调度)、POLICING(策略)、DROPPING(丢弃)、QDISC(排队规则)、CLASS(类)、FILTER(过滤器)。本文无法尽述,仅希望能给大家带来一些基础认识,激发大家深入了解TC。

BRfAnui.jpg!web


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK