filebeat是日志采集的agent,每个采集点都要部署一份.它作为golang实现的客户端,采集效率确实比较高,如果你认为它占用资源也很小的话就有点儿大错特错了.当然这里的占用资源是指CPU和内存两种,网络的话内网可以忽略不计.
CPU抢占
去年某日压测,发现某一核心网关性能特别差,通过主机监控观察发现,负载特别高.网关繁忙可以理解,高到何种情况呢?进程监控明细拉出来一看,有点儿大跌眼镜:50%的cpu资源被filebeat抢走了.网关只是可怜巴巴的占了20%.原来故事是这样的:大量请求传递到了网关,网关事无巨细的记录了大量的日志,filebeat一看,哇,活儿来了,然后玩命采集,结果喧宾夺主,把资源都占走了.好在是测试环境压测,发现这个问题解决很简单了:filebeat加一条配置max_procs: 1
限制cpu使用核数.
close_inactive: 5m
close_timeout: 30m
clean_inactive: 1h
ignore_older: 2h
max_procs: 1
queue.mem.events: 256
queue.mem.flush.min_events: 128
queue.mem.flush.timeout: 2s
filebeat.prospectors:
....
内存抢占
本来故事到这里应该就结束了,可是没想到事隔一年,准备618的压测过程中,再次遇到了filebeat抢占资源造成节点不稳定的情况.
如图,最顶上那哥们儿,很敬业的将内存占到了11GB,然后直接引起节点内存压力,然后kubelet就开始强制驱逐了.
May 21 01:55:13 k8s-168-25 kubelet[828]: W0521 01:55:13.612640 828 eviction_manager.go:344] eviction manager: attempting to reclaim memory
May 21 01:55:13 k8s-168-25 kubelet[828]: I0521 01:55:13.612687 828 eviction_manager.go:358] eviction manager: must evict pod(s) to reclaim memory
May 21 01:55:13 k8s-168-25 kubelet[828]: I0521 01:55:13.612739 828 eviction_manager.go:376] eviction manager: pods ranked for eviction: filebeat-6w2n8_pro(825f41cc-9a77-11ea-a0c9-00163e10e901), nagios-p9fv8_kube-system(091440e7-9a0c-11ea-a0c9-00163e10e901), error-mail-py- b4nt5_zkt-base(098d70ba-9a0c-11ea-a0c9-00163e10e901), ticket-api-7cc55747f9-278jh_pro(f6b8eb04-990f-11ea-a0c9-00163e10e901), node-exporter- j2f6r_monitoring(ba795520-9a02-11ea-a0c9-00163e10e901), bg-api-gateway-6b949d665b-6tj4n_pro(2b6661ff-9910-11ea-a0c9-00163e10e901)
May 21 01:55:13 k8s-168-25 kubelet[828]: I0521 01:55:13.612881 828 kuberuntime_container.go:547] Killing container "docker:// 30ff3f4d027467ce7e0e479bc1f0597851a5f08084c673e27cc9db786e846cde" with 30 second grace period
May 21 01:55:13 k8s-168-25 kubelet[828]: I0521 01:55:13.644719 828 kubelet.go:1877] SyncLoop (DELETE, "api"): "filebeat-6w2n8_pro(825f41cc- 9a77-11ea-a0c9-00163e10e901)"
May 21 01:55:13 k8s-168-25 kubelet[828]: I0521 01:55:13.649208 828 kubelet.go:1871] SyncLoop (REMOVE, "api"): "filebeat-6w2n8_pro(825f41cc- 9a77-11ea-a0c9-00163e10e901)"
May 21 01:55:13 k8s-168-25 kubelet[828]: I0521 01:55:13.658878 828 kubelet_pods.go:1106] Killing unwanted pod "filebeat-6w2n8"
May 21 01:55:13 k8s-168-25 kubelet[828]: I0521 01:55:13.658922 828 kuberuntime_container.go:547] Killing container "docker:// 30ff3f4d027467ce7e0e479bc1f0597851a5f08084c673e27cc9db786e846cde" with 0 second grace period
May 21 01:55:13 k8s-168-25 kubelet[828]: I0521 01:55:13.672992 828 kubelet.go:1861] SyncLoop (ADD, "api"): "filebeat-r4fbc_pro(0e40b5e0-9ac3-11ea-a0c9-00163e10e901)"
May 21 01:55:13 k8s-168-25 kubelet[828]: W0521 01:55:13.673050 828 eviction_manager.go:144] Failed to admit pod filebeat-r4fbc_pro(0e40b5e0- 9ac3-11ea-a0c9-00163e10e901) - node has conditions: [MemoryPressure]
May 21 01:55:13 k8s-168-25 dockerd[770]: time="2020-05-21T01:55:13+08:00" level=info msg="shim reaped" id=30ff3f4d027467ce7e0e479bc1f0597851a5f08084c673e27cc9db786e846cde module="containerd/tasks"
May 21 01:55:13 k8s-168-25 kubelet[828]: I0521 May 21 01:55:13 k8s-168-25 kubelet[828]: I0521 01:55:13.720046 828 kubelet.go:1877] SyncLoop (DELETE, "api"): "filebeat-r4fbc_pro(0e40b5e0- 9ac3-11ea-a0c9-00163e10e901)"
May 21 01:55:13 k8s-168-25 kubelet[828]: I0521 01:55:13.720486 828 kubelet.go:1871] SyncLoop (REMOVE, "api"): "filebeat-r4fbc_pro(0e40b5e0- 9ac3-11ea-a0c9-00163e10e901)"
May 21 01:55:13 k8s-168-25 kubelet[828]: I0521 01:55:13.720536 828 kubelet.go:2065] Failed to delete pod "filebeat-r4fbc_pro(0e40b5e0-9ac3- 11ea-a0c9-00163e10e901)", err: pod not found
May 21 01:55:13 k8s-168-25 kubelet[828]: I0521 01:55:13.738615 828 kubelet.go:1861] SyncLoop (ADD, "api"): "filebeat-lsl46_pro(0e4b24c0-9ac3-11ea-a0c9-00163e10e901)"
May 21 01:55:13 k8s-168-25 kubelet[828]: W0521 01:55:13.738704 828 eviction_manager.go:144] Failed to admit pod filebeat-lsl46_pro(0e4b24c0- 9ac3-11ea-a0c9-00163e10e901) - node has conditions: [MemoryPressure]
May 21 01:55:13 k8s-168-25 kubelet[828]: I0521 01:55:13.779795 828 kubelet.go:1877] SyncLoop (DELETE, "api"): "filebeat-lsl46_pro(0e4b24c0- 9ac3-11ea-a0c9-00163e10e901)"
May 21 01:55:13 k8s-168-25 kubelet[828]: I0521 01:55:13.782575 828 kubelet.go:1871] SyncLoop (REMOVE, "api"): "filebeat-lsl46_pro(0e4b24c0- 9ac3-11ea-a0c9-00163e10e901)"
May 21 01:55:13 k8s-168-25 kubelet[828]: I0521 01:55:13.782623 828 kubelet.go:2065] Failed to delete pod "filebeat-lsl46_pro(0e4b24c0-9ac3- 11ea-a0c9-00163e10e901)", err: pod not found
May 21 01:55:13 k8s-168-25 kubelet[828]: I0521 01:55:13.799802 828 kubelet.go:1861] SyncLoop (ADD, "api"): "filebeat-2dh8b_pro(0e54e8a8-9ac3-11ea-a0c9-00163e10e901)"
可以看到在驱逐filebeat的过程中,节点内存压力了
node has conditions: [MemoryPressure]
结果filebeat是DaemonSet跑在节点上了,于是就出现了:扁担要绑在板凳上,板凳偏不让扁担绑在板凳上的死循环,有几分钟,直接其它的服务也因为稍微有些内存超过限制直接oom-killer.
限制资源
既然发现了问题,解决办法也就很简单了:限制pod内存使用.不过filebeat的配置文件可没有内存相关的配置,看来只能借助k8s的能力了.想到这里问题来了:filebeat限制的cpu使用1核和k8s的cpu限制1核一样吗?
看了一下cpu的使用情况,看起来是有出入的.
所以直接也把cpu的限制给加上吧
"resources": {
"limits": {
"cpu": "1",
"memory": "2Gi"
},
"requests": {
"cpu": "100m",
"memory": "1Gi"
}
}
后续再观察一下日志采集情况,想来应该是问题不大.
转载请注明: 转自Rainbird的个人博客 本文链接: ELK|filebeat采集日志野蛮占用资源