ELK|filebeat采集日志野蛮占用资源

Sommaire de cet article :
  1. CPU抢占
  2. 内存抢占
  3. 限制资源

filebeat是日志采集的agent,每个采集点都要部署一份.它作为golang实现的客户端,采集效率确实比较高,如果你认为它占用资源也很小的话就有点儿大错特错了.当然这里的占用资源是指CPU和内存两种,网络的话内网可以忽略不计.

CPU抢占

去年某日压测,发现某一核心网关性能特别差,通过主机监控观察发现,负载特别高.网关繁忙可以理解,高到何种情况呢?进程监控明细拉出来一看,有点儿大跌眼镜:50%的cpu资源被filebeat抢走了.网关只是可怜巴巴的占了20%.原来故事是这样的:大量请求传递到了网关,网关事无巨细的记录了大量的日志,filebeat一看,哇,活儿来了,然后玩命采集,结果喧宾夺主,把资源都占走了.好在是测试环境压测,发现这个问题解决很简单了:filebeat加一条配置max_procs: 1 限制cpu使用核数.

close_inactive: 5m
close_timeout: 30m
clean_inactive: 1h
ignore_older: 2h
max_procs: 1 
queue.mem.events: 256
queue.mem.flush.min_events: 128
queue.mem.flush.timeout: 2s
filebeat.prospectors:
....

内存抢占

本来故事到这里应该就结束了,可是没想到事隔一年,准备618的压测过程中,再次遇到了filebeat抢占资源造成节点不稳定的情况.

如图,最顶上那哥们儿,很敬业的将内存占到了11GB,然后直接引起节点内存压力,然后kubelet就开始强制驱逐了.

May 21 01:55:13 k8s-168-25 kubelet[828]: W0521 01:55:13.612640     828 eviction_manager.go:344] eviction manager: attempting to reclaim memory
May 21 01:55:13 k8s-168-25 kubelet[828]: I0521 01:55:13.612687     828 eviction_manager.go:358] eviction manager: must evict pod(s) to reclaim   memory
May 21 01:55:13 k8s-168-25 kubelet[828]: I0521 01:55:13.612739     828 eviction_manager.go:376] eviction manager: pods ranked for eviction:      filebeat-6w2n8_pro(825f41cc-9a77-11ea-a0c9-00163e10e901), nagios-p9fv8_kube-system(091440e7-9a0c-11ea-a0c9-00163e10e901), error-mail-py-         b4nt5_zkt-base(098d70ba-9a0c-11ea-a0c9-00163e10e901), ticket-api-7cc55747f9-278jh_pro(f6b8eb04-990f-11ea-a0c9-00163e10e901), node-exporter-      j2f6r_monitoring(ba795520-9a02-11ea-a0c9-00163e10e901), bg-api-gateway-6b949d665b-6tj4n_pro(2b6661ff-9910-11ea-a0c9-00163e10e901)
May 21 01:55:13 k8s-168-25 kubelet[828]: I0521 01:55:13.612881     828 kuberuntime_container.go:547] Killing container "docker://                30ff3f4d027467ce7e0e479bc1f0597851a5f08084c673e27cc9db786e846cde" with 30 second grace period
May 21 01:55:13 k8s-168-25 kubelet[828]: I0521 01:55:13.644719     828 kubelet.go:1877] SyncLoop (DELETE, "api"): "filebeat-6w2n8_pro(825f41cc-  9a77-11ea-a0c9-00163e10e901)"
May 21 01:55:13 k8s-168-25 kubelet[828]: I0521 01:55:13.649208     828 kubelet.go:1871] SyncLoop (REMOVE, "api"): "filebeat-6w2n8_pro(825f41cc-  9a77-11ea-a0c9-00163e10e901)"
May 21 01:55:13 k8s-168-25 kubelet[828]: I0521 01:55:13.658878     828 kubelet_pods.go:1106] Killing unwanted pod "filebeat-6w2n8"
May 21 01:55:13 k8s-168-25 kubelet[828]: I0521 01:55:13.658922     828 kuberuntime_container.go:547] Killing container "docker://                30ff3f4d027467ce7e0e479bc1f0597851a5f08084c673e27cc9db786e846cde" with 0 second grace period
May 21 01:55:13 k8s-168-25 kubelet[828]: I0521 01:55:13.672992     828 kubelet.go:1861] SyncLoop (ADD, "api"): "filebeat-r4fbc_pro(0e40b5e0-9ac3-11ea-a0c9-00163e10e901)"
May 21 01:55:13 k8s-168-25 kubelet[828]: W0521 01:55:13.673050     828 eviction_manager.go:144] Failed to admit pod filebeat-r4fbc_pro(0e40b5e0- 9ac3-11ea-a0c9-00163e10e901) - node has conditions: [MemoryPressure]
May 21 01:55:13 k8s-168-25 dockerd[770]: time="2020-05-21T01:55:13+08:00" level=info msg="shim reaped"                                           id=30ff3f4d027467ce7e0e479bc1f0597851a5f08084c673e27cc9db786e846cde module="containerd/tasks"
May 21 01:55:13 k8s-168-25 kubelet[828]: I0521 May 21 01:55:13 k8s-168-25 kubelet[828]: I0521 01:55:13.720046     828 kubelet.go:1877] SyncLoop (DELETE, "api"): "filebeat-r4fbc_pro(0e40b5e0-  9ac3-11ea-a0c9-00163e10e901)"
May 21 01:55:13 k8s-168-25 kubelet[828]: I0521 01:55:13.720486     828 kubelet.go:1871] SyncLoop (REMOVE, "api"): "filebeat-r4fbc_pro(0e40b5e0-  9ac3-11ea-a0c9-00163e10e901)"
May 21 01:55:13 k8s-168-25 kubelet[828]: I0521 01:55:13.720536     828 kubelet.go:2065] Failed to delete pod "filebeat-r4fbc_pro(0e40b5e0-9ac3-  11ea-a0c9-00163e10e901)", err: pod not found
May 21 01:55:13 k8s-168-25 kubelet[828]: I0521 01:55:13.738615     828 kubelet.go:1861] SyncLoop (ADD, "api"): "filebeat-lsl46_pro(0e4b24c0-9ac3-11ea-a0c9-00163e10e901)"
May 21 01:55:13 k8s-168-25 kubelet[828]: W0521 01:55:13.738704     828 eviction_manager.go:144] Failed to admit pod filebeat-lsl46_pro(0e4b24c0- 9ac3-11ea-a0c9-00163e10e901) - node has conditions: [MemoryPressure]
May 21 01:55:13 k8s-168-25 kubelet[828]: I0521 01:55:13.779795     828 kubelet.go:1877] SyncLoop (DELETE, "api"): "filebeat-lsl46_pro(0e4b24c0-  9ac3-11ea-a0c9-00163e10e901)"
May 21 01:55:13 k8s-168-25 kubelet[828]: I0521 01:55:13.782575     828 kubelet.go:1871] SyncLoop (REMOVE, "api"): "filebeat-lsl46_pro(0e4b24c0-  9ac3-11ea-a0c9-00163e10e901)"
May 21 01:55:13 k8s-168-25 kubelet[828]: I0521 01:55:13.782623     828 kubelet.go:2065] Failed to delete pod "filebeat-lsl46_pro(0e4b24c0-9ac3-  11ea-a0c9-00163e10e901)", err: pod not found
May 21 01:55:13 k8s-168-25 kubelet[828]: I0521 01:55:13.799802     828 kubelet.go:1861] SyncLoop (ADD, "api"): "filebeat-2dh8b_pro(0e54e8a8-9ac3-11ea-a0c9-00163e10e901)"

可以看到在驱逐filebeat的过程中,节点内存压力了
node has conditions: [MemoryPressure]
结果filebeat是DaemonSet跑在节点上了,于是就出现了:扁担要绑在板凳上,板凳偏不让扁担绑在板凳上的死循环,有几分钟,直接其它的服务也因为稍微有些内存超过限制直接oom-killer.

限制资源

既然发现了问题,解决办法也就很简单了:限制pod内存使用.不过filebeat的配置文件可没有内存相关的配置,看来只能借助k8s的能力了.想到这里问题来了:filebeat限制的cpu使用1核和k8s的cpu限制1核一样吗?

看了一下cpu的使用情况,看起来是有出入的.
所以直接也把cpu的限制给加上吧

"resources": {
  "limits": {
    "cpu": "1",
    "memory": "2Gi"
  },
  "requests": {
    "cpu": "100m",
    "memory": "1Gi"
  }
}

后续再观察一下日志采集情况,想来应该是问题不大.


相关博文

About rainbird

IOS攻城狮
This entry was posted in ELK, K8S and tagged , , , , , , , , , , , , . Bookmark the permalink.

发表评论