如题
简化的程序是在两个不同的 golang 程序建立一条 TCP 连接:发送端按照一定频率发送数据包(数据包最大 2048 字节,也存在少量较小的数据包);接收端接收数据伪代码如下
for {
rd := bufio.NewReaderSize(conn, bufsz)
rlen, err := rd.Read(rbuf[:])
// handle rbuf
}
建立 TCP 连接时发送端和接收端都默认禁用了 Nagle 算法(golang 默认 c.SetNoDelay(true)),但是测试中发现有时发送端的 Send-Q 堆积了很多数据,接收端接收数据很慢,大量 Read 出现 200ms 延迟,并且此时发送端输出端的 CPU 占用很低,远低于正常网络数据读写的情况。
2015-05-20 17:34:44.418 [INFO] ac03 [processPushMsg] send 39 direct msg to connector conn0101 latency: 25270.388
2015-05-20 17:34:44.418 [DEBU] ac04 [handleTCPConn] before get new bufio reader
2015-05-20 17:34:44.418 [DEBU] ac05 [handleTCPConn] before read
2015-05-20 17:34:44.621 [DEBU] ac06 [handleTCPConn] end read
2015-05-20 17:34:44.621 [INFO] ac07 [processPushMsg] send 81 direct msg to connector conn0101 latency: 25474.038
2015-05-20 17:34:44.621 [INFO] ac08 [processPushMsg] send 38 direct msg to connector conn0101 latency: 25474.143
2015-05-20 17:34:44.622 [INFO] ac09 [processPushMsg] send 39 direct msg to connector conn0101 latency: 25374.197
2015-05-20 17:34:44.622 [DEBU] ac0a [handleTCPConn] before get new bufio reader
2015-05-20 17:34:44.622 [DEBU] ac0b [handleTCPConn] before read
2015-05-20 17:34:44.622 [DEBU] ac0c [handleTCPConn] end read
2015-05-20 17:34:44.622 [DEBU] ac0d [handleTCPConn] before get new bufio reader
2015-05-20 17:34:44.622 [DEBU] ac0e [handleTCPConn] before read
2015-05-20 17:34:44.825 [DEBU] ac0f [handleTCPConn] end read
2015-05-20 17:34:44.825 [INFO] ac10 [processPushMsg] send 81 direct msg to connector conn0101 latency: 25578.005
2015-05-20 17:34:44.825 [INFO] ac11 [processPushMsg] send 39 direct msg to connector conn0101 latency: 25578.103
我的理解是发送端数据已经发送到本地网络缓冲区,等待对端读取,但是不清楚接收端为什么延迟 200ms 读取。不知从何入手解决。
1
zhicheng 2015-05-20 18:47:40 +08:00 via Android 1
你知道拥塞算法吗?
|
2
flewover OP @zhicheng 模糊的知道一些概念,但是不知道具体与哪些内核参数相关,如何调优,请问有哪些监控指标可以反映网络拥塞状况?还是需要应用层做流量控制?
|
3
simomo2010 2015-05-20 20:02:17 +08:00 1
@flewover
你这跟tcp栈没关系吧~ 看log 2015-05-20 17:34:44.418 [DEBU] ac05 [handleTCPConn] before read 2015-05-20 17:34:44.621 [DEBU] ac06 [handleTCPConn] end read 这两条记录中间应该是 rlen, err := rd.Read(rbuf[:]) 吧? 我不懂golang的这个接口, 猜想bufferReader在等数据填满它然后返回或者超时(200ms)返回。所以你发包小的时候,它总是需要超时才能返回~ 于是就有了200ms的延迟 |
4
flewover OP @simomo2010
那两条日志确实是在 rlen, err := rd.Read(rbuf[:]) 的前后,但是这个 read 接口只调用一次 reader 的 read,产生延迟应该不是需要填满 buf 引发的。 // Read reads data into p. // It returns the number of bytes read into p. // It calls Read at most once on the underlying Reader, // hence n may be less than len(p). // At EOF, the count will be zero and err will be io.EOF. func (b *Reader) Read(p []byte) (n int, err error) { ... } |
5
simomo2010 2015-05-20 21:24:40 +08:00
|
6
ryd994 2015-05-20 23:15:46 +08:00 via Android 1
不需要禁用啊,加nodelay就好
https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_MRG/1.2/html/Realtime_Tuning_Guide/sect-Realtime_Tuning_Guide-Application_Tuning_and_Deployment-TCP_NODELAY_and_Small_Buffer_Writes.html wireshark截包没有用,因为无论用不用nagle,出来的都是合法的tcp包,大小不同而已。 |
8
flewover OP 用 strace 查看了系统调用,接收端 200ms 出现在 epoll_wait
[<... epoll_wait resumed> {{EPOLLIN|EPOLLOUT, {u32=1626358040, u64=140674690203928}}}, 128, -1) = 1 <0.202147> ] 接收端日志: 2015-05-20 23:40:19.629 [INFO] 54f99 [processPushMsg] send 81 direct msg to connector conn0101 latency: 697548.524 2015-05-20 23:40:19.629 [DEBU] 54f9a [handleTCPConn] before get new bufio reader 2015-05-20 23:40:19.630 [DEBU] 54f9b [handleTCPConn] before read 2015-05-20 23:40:19.834 [DEBU] 54f9c [handleTCPConn] end read rlen: 4068 2015-05-20 23:40:19.834 [INFO] 54f9d [processPushMsg] send 39 direct msg to connector conn0101 latency: 697753.906 strace 日志(sudo strace -f -v -x -tt -T -p 92324 -o strace_recv.log): 94951 23:40:19.629828 write(3, "2015-05-20 23:40:19.629 [INFO] 5"..., 115) = 115 <0.000038> 94951 23:40:19.629958 clock_gettime(CLOCK_REALTIME, {1432136419, 629994695}) = 0 <0.000031> 94951 23:40:19.630103 clock_gettime(CLOCK_REALTIME, {1432136419, 630143536}) = 0 <0.000032> 94951 23:40:19.630219 write(3, "2015-05-20 23:40:19.629 [DEBU] 5"..., 81) = 81 <0.000036> 94951 23:40:19.630337 clock_gettime(CLOCK_REALTIME, {1432136419, 630373340}) = 0 <0.000030> 94951 23:40:19.630479 clock_gettime(CLOCK_REALTIME, {1432136419, 630518521}) = 0 <0.000031> 94951 23:40:19.630600 write(3, "2015-05-20 23:40:19.630 [DEBU] 5"..., 65) = 65 <0.000036> 94951 23:40:19.630717 read(9, 0xc20805f01c, 4068) = -1 EAGAIN (Resource temporarily unavailable) <0.000033> 94951 23:40:19.630833 write(8, "\x01\x03\xf1\x00\x01\x00\x0c\x55\x5c\xa8\x4d\x62\xbb\x97\x68\xde\x01\xd1\x12\x02\x00\x10\x31\x34\x33\x32\x31\x33\x35\x37\x32\x32"..., 1013) = 1013 <0.000103> 94951 23:40:19.631027 write(8, "\x01\x01\xf9\x00\x03\x01\xd4\x55\x5c\x9c\x2a\x62\xbb\x97\x68\x8c\x00\x02\xa9\x55\x5c\x9c\x2a\x62\xbb\x97\x68\x8c\x00\x02\xfc\x55"..., 509) = 509 <0.000067> 94951 23:40:19.631187 write(8, "\x01\x03\xf1\x00\x03\x03\xcc\x55\x5c\x9c\x29\x62\xbb\x97\x68\x8c\x00\x01\x3d\x55\x5c\x9c\x2a\x62\xbb\x97\x68\x8c\x00\x02\x93\x55"..., 1013) = 1013 <0.000058> 94951 23:40:19.631332 epoll_wait(6, {}, 128, 0) = 0 <0.000035> 94951 23:40:19.631440 epoll_wait(6, <unfinished ...> 92326 23:40:19.631496 <... futex resumed> ) = 0 <0.205872> 92326 23:40:19.631554 clock_gettime(CLOCK_MONOTONIC, {1298109, 942247030}) = 0 <0.000033> 92326 23:40:19.631664 clock_gettime(CLOCK_REALTIME, {1432136419, 631698624}) = 0 <0.000031> 92326 23:40:19.631767 select(0, NULL, NULL, NULL, {0, 20} <unfinished ...> 92324 23:40:19.631845 <... futex resumed> ) = 0 <0.206158> 92324 23:40:19.631911 epoll_wait(6, {}, 128, 0) = 0 <0.000064> 92326 23:40:19.632048 <... select resumed> ) = 0 (Timeout) <0.000248> 92326 23:40:19.632112 clock_gettime(CLOCK_MONOTONIC, {1298109, 942807377}) = 0 <0.000031> 92326 23:40:19.632215 futex(0x8f4de0, FUTEX_WAIT, 0, {60, 0} <unfinished ...> 92324 23:40:19.632279 futex(0x8f5838, FUTEX_WAIT, 0, NULL <unfinished ...> 94951 23:40:19.833623 <... epoll_wait resumed> {{EPOLLIN|EPOLLOUT, {u32=1626358040, u64=140674690203928}}}, 128, -1) = 1 <0.202147> 94951 23:40:19.833780 clock_gettime(CLOCK_MONOTONIC, {1298110, 144480859}) = 0 <0.000032> 94951 23:40:19.833902 futex(0x8f5838, FUTEX_WAKE, 1) = 1 <0.000044> 94951 23:40:19.834024 futex(0x8f4de0, FUTEX_WAKE, 1) = 1 <0.000042> 94951 23:40:19.834131 read(9, "\x5c\x9c\x29\x62\xbb\x97\x68\x8c\x00\x01\x56\x63\x6f\x6e\x6e\x30\x31\x30\x31\x55\x5c\x9c\x2a\x62\xbb\x97\x68\x8c\x00\x03\x24\x63"..., 4068) = 4068 <0.000039> 94951 23:40:19.834268 clock_gettime(CLOCK_REALTIME, {1432136419, 834305215}) = 0 <0.000030> 94951 23:40:19.834476 clock_gettime(CLOCK_REALTIME, {1432136419, 834519537}) = 0 <0.000033> 94951 23:40:19.834601 write(3, "2015-05-20 23:40:19.834 [DEBU] 5"..., 73) = 73 <0.000042> 94951 23:40:19.834778 clock_gettime(CLOCK_REALTIME, {1432136419, 834819514}) = 0 <0.000032> 94951 23:40:19.834896 clock_gettime(CLOCK_REALTIME, {1432136419, 834931088}) = 0 <0.000030> 94951 23:40:19.835061 clock_gettime(CLOCK_REALTIME, {1432136419, 835102950}) = 0 <0.000033> 94951 23:40:19.835180 write(3, "2015-05-20 23:40:19.834 [INFO] 5"..., 115) = 115 <0.000037> |
9
simomo2010 2015-05-21 00:12:56 +08:00
@flewover
感觉是发送端23:40:19.833才把包发出来~ |
10
est 2015-05-21 00:21:12 +08:00 3
|
12
simomo2010 2015-05-21 10:08:13 +08:00
|
13
flewover OP |
14
benjiam 2015-05-21 21:35:05 +08:00 via Android
但是测试中发现有时发送端的 Send-Q 堆积了很多数据,接收端接收数据很慢,大量 Read 出现 200ms 延迟,
说明不是客户端问题,server读取超时,更多地问题在于你read后处理问题吧,你简单的读不处理 还能重现吗?网络丢包引起的重传都是很快就能清理掉的,你的recv q有数据也证明不是网线问题。 |