欢迎关注专栏:APM领域
文章目录
继 上一篇,我们实现了从日志中查看链路追踪,并通过
Node graph
能看到一个链路之间的调用关系,本篇解决使用
Service Graph
展示多个业务服务之间的关系调用图
最终效果
Service Graph
安装部署
如果按照我上一篇提到的安装Tempo
服务和Prometheus
服务操作的小伙伴,现在直接打开Grafana
的Explore
,选择Tempo
数据源,发现已经有拓扑图了
我在这里重头给大家撸一遍关键配置,让他们在现用的服务上也能启动Service Graph
功能
1、部署介绍
1)trace和span
一条完整的链路包含trace
和span
,一个trace
包含多个span
,组合起来一般是时间轴的方式展示,如Jaeger和Tempo等
––|–––––––|–––––––|–––––––|–––––––|–––––––|–––––––|–––––––|–> time
[Span A···················································]
[Span B··············································]
[Span D··········································]
[Span C········································]
[Span E·······] [Span F··] [Span G··] [Span H··]
单独看span
,会包含多个父子关系,更像一个族谱,每一个span会基于Span A
继续传承下去,直到它没有“子嗣”而停止传承
[Span A] ←←←(the root span)
|
+------+------+
| |
[Span B] [Span C] ←←←(Span C is a `ChildOf` Span A)
| |
[Span D] +---+-------+
| |
[Span E] [Span F] >>> [Span G] >>> [Span H]
↑
↑
↑
(Span G `FollowsFrom` Span F)
使用OpenTelemetry Agent后,会帮助我们把链路数据转换成otlp协议,一般收集上来的span字段足够我们使用,所以这一块我们不需要额外做啥工作,直接拿来使用即可
2)链路数据的运输图
trace
包含span
,来看下数据的运输路线
2、提前条件
1、已安装Kubernetes
2、安装Prometheus (安装文档)
URL:http://prometheus-server.monitoring.svc.cluster.local
3、安装Grafana (安装文档)
URL:http://grafana95.monitoring.svc.cluster.local
4、安装Promtail (安装文档)
5、安装Loki (安装文档)
URL:http://loki-distributed-gateway.logs.svc.cluster.local
6、安装Tempo (安装文档)
OTLP URL:tempo-distributed-distributor-discovery.trace.svc.cluster.local:4317
这里没有使用jaeger或者其他的链路跟踪服务原因有很多,首先Tempo是Grafana生态的,与Grafana服务有很好的集成体验,如Node Graph
和Service Graph
;支持分布式部署,与Loki设计类似,专为处理大规模的追踪数据;Tempo可以与各种数据采集工具和追踪库集成,比如OpenTelemetry、Jaeger、Zipkin等,大家查看Grafana Tempo官网文档可发现,Grafana与日后非常火爆的OpenTelemetry做了大量的兼容。
7、安装OpenTelemetry (安装文档)
HTTP URL:http://otel-opentelemetry-collector.otel.svc.cluster.local:4318
1)接入OpenTelemetry agent发送数据
参考文档:https://dongweizhen.blog.csdn.net/article/details/138793963
2)配置receivers输入数据,配置exporters输出数据
参考文档:https://dongweizhen.blog.csdn.net/article/details/138793963
3)开启Service Graph,推送数据到Prometheus
参考文档:https://dongweizhen.blog.csdn.net/article/details/138806269
Tempo实现拓扑图需要安装Metrics-generator
组件,所以必须安装
Grafana官网是这么写的:
Metrics-generator leverages the data available in the ingest path in Tempo to provide additional value by generating metrics from traces.
The metrics-generator internally runs a set of processors. Each processor ingests spans and produces metrics. Every processor derives different metrics. Currently, the following processors are available:
1、Service graphs
2、Span metrics
3、Local blocks
…注:关于Grafana开启Service Graph,官网还提供了另外一种方法,使用Grafana Agent服务,To enable service graphs when using Grafana Agent, refer to the Grafana Agent and service graphs documentation.
使用helm安装,下面关于配置都在values.yaml
文件中修改:
开启Metrics-generator`组件:
229 metricsGenerator:
230 # -- Specifies whether a metrics-generator should be deployed
231 enabled: true #启用metrics-generator,可以生产service graph
配置连接Prometheus服务:
361 remote_write:
362 - url: http://prometheus-server.monitoring.svc.cluster.local/api/v1/write #远程写入prometheus remote_write API地址
那为什么Tempo
要把数据推送到Prometheus
,这是因为Prometheus
可以告诉你某个服务的平均响应时间是多少,而Tempo
的Service Graph
则能展示这个响应时间是如何在服务间流动并导致最终结果的,即分别从追踪和指标两个角度提供了对业务服务行为的深入洞察,结合一起之后才是我们在Grafana
中看到的服务拓扑图
开启otlp协议端口:
975 otlp: #这个眼熟,以后将非常强大,是一种协议规范,可以使用opentelemetry把数据写入进来,然后使用grafana展示
976 http:
977 # -- Enable Tempo to ingest Open Telemetry HTTP traces
978 enabled: true
真正开启Metrics-generator`收集功能:
1366 global_overrides:
1367 per_tenant_override_config: /runtime-config/overrides.yaml
1368 metrics_generator_processors: #新增该配置(我没有找到单独的overrides.yml怎么配置,干脆直接在这里写了)
1369 - 'service-graphs'
1370 - 'span-metrics'
4)开启Prometheus remote_write功能
参考文档:https://dongweizhen.blog.csdn.net/article/details/139092662
Helm安装,修改values.yaml
文件
.
.
.
251 remoteWrite:
252 - url: http://prometheus-server.monitoring.svc.cluster.local/api/v1/write # 暴露远程写入API
.
.
.
278 extraArgs:
279 web.enable-remote-write-receiver: null #开启远程写入功能
.
.
.
到此,在Prometheus中可以看到Service Graph
指标
Metric Name | Type | Labels | Description |
---|---|---|---|
traces_service_graph_request_total | Counter | client , server , connection_type | Total count of requests between two nodes |
traces_service_graph_request_failed_total | Counter | client , server , connection_type | Total count of failed requests between two nodes |
traces_service_graph_request_server_seconds | Histogram | client , server , connection_type | Time for a request between two nodes as seen from the server |
traces_service_graph_request_client_seconds | Histogram | client , server , connection_type | Time for a request between two nodes as seen from the client |
traces_service_graph_unpaired_spans_total | Counter | client , server , connection_type | Total count of unpaired spans |
traces_service_graph_dropped_spans_total | Counter | client , server , connection_type | Total count of dropped spans |
如果没有上述指标,说明配置有问题,Service Graph
是不会有数据的,但是Node Graph
是有的
5)在Tempo数据源开启Service Graph
grafana数据源Tempo配置:
1、从Prometheus获取指标数据
2、开启Node Graph
打开Explore
,现在Tempo数据源的Service Graph
和Node Graph
有数据了