Prometheus+Grafana+Loki 生产级可观测性体系搭建实战

根据 Datadog 2025 年度基础设施监控报告，接入完整可观测性体系的团队，故障平均恢复时间（MTTR）比没有监控的团队快 14 倍。Prometheus 已成为云原生监控的事实标准，CNCF 毕业项目中超过 85% 使用它作为核心指标引擎。然而，很多团队的 Prometheus 只停留在"能看 CPU 内存"的水平，远未发挥其真正威力。本文将从架构设计到生产调优，手把手搭建一套覆盖指标（Metrics）、日志（Logs）、告警（Alerts）的完整可观测性体系，附可直接复制使用的 Docker Compose、PromQL 查询和告警规则。

📊 一、可观测性三大支柱与架构设计

为什么需要可观测性而不是单纯监控？

监控（Monitoring）回答的是"系统是否正常"，而可观测性（Observability）回答的是"系统为什么出问题"。传统监控工具如 Zabbix 和 Nagios 擅长"红绿灯式"的健康检查——CPU 高了告警、服务挂了告警——但当你的 API 响应变慢时，它们无法告诉你"是数据库慢了还是某个下游服务超时了"。

维度	传统监控	可观测性体系
数据类型	指标为主	指标 + 日志 + 链路追踪
问题定位	告警 → 人工逐台排查	告警 → 关联日志 → 定位根因
数据关联	各系统数据孤岛	指标与日志互相跳转
探索能力	预定义固定 Dashboard	Ad-hoc 查询任意维度
典型工具	Zabbix、Nagios、Cacti	Prometheus + Grafana + Loki

💡 **提示：**可观测性不是要取代监控，而是在监控基础上增加了"可探索性"。一个健康检查接口配合 Prometheus 就是最简单的监控；但当你能在 Grafana 里从一个 CPU 飙高的图表直接跳转到对应时间段的错误日志时，这才是可观测性的核心价值。

架构全景与组件职责

整套体系由五个核心组件构成，各司其职：

✅ Prometheus — 指标采集与时序数据库（TSDB），负责周期性拉取（Pull）各服务的 Metrics 数据
✅ Grafana — 统一可视化面板，支持接入 Prometheus、Loki、MySQL 等多种数据源并创建交互式图表
✅ Loki — 日志聚合引擎，由 Grafana Labs 开发，天然与 Grafana 深度集成，支持 LogQL 查询
✅ Promtail — 日志采集代理，负责从容器和主机收集日志并推送到 Loki
✅ Alertmanager — 告警路由与通知中心，支持告警分组、抑制、静默和多渠道通知（钉钉/飞书/邮件）

Pull 模式 vs Push 模式：为什么 Prometheus 选择拉取？

很多刚接触 Prometheus 的开发者会疑惑：为什么不让应用主动推送指标（Push），而是 Prometheus 主动来拉取（Pull）？这不是多了一次网络请求吗？实际上，Pull 模式有三个关键优势：

✅ 服务发现天然集成 — Prometheus 通过 Kubernetes API、Consul、DNS 或文件发现目标，自动采集新上线的服务，不需要应用侧做任何额外配置
✅ 健康状态可观 — 如果 Prometheus 拉取失败，说明目标服务不可达，这本身就是一个告警信号。Push 模式下，服务挂了就不会再推送，你反而不知道它已经挂了
✅ 开发调试友好 — 任何服务暴露 /metrics 端点就能被采集，用 curl http://localhost:8080/metrics 就能在本地调试

不过 Pull 模式也有局限：短生命周期的批处理任务（Job）可能在 Prometheus 采集之前就退出了。这类场景可以用 Pushgateway（Push 中转站）作为补充，但不要滥用——Pushgateway 适合临时任务，不适合长期运行的服务。

架构数据流如下：

应用服务 → /metrics 端点 → Prometheus 拉取 → TSDB 存储
                                          ↓
                                    Alertmanager → 钉钉/飞书/邮件
                                          ↓
日志文件 → Promtail/Loki Agent → Loki 存储 → LogQL 查询
                                          ↓
                              Grafana ← 统一展示（指标 + 日志关联）

⚠️ **警告：**不要把 Prometheus 当长期存储使用。默认 TSDB 保留 15 天，超过此范围的历史数据应通过 Remote Write 推送到 Thanos、Cortex 或 VictoriaMetrics。本地磁盘 IOPS 是 Prometheus 最大的性能瓶颈。

🚀 二、从零部署：Docker Compose 一键启动

完整 docker-compose 配置

以下是一套可直接用于开发和测试环境的完整配置。它包含了指标采集、日志聚合、主机监控和容器监控的全部组件，一个 docker compose up -d 即可启动整套体系：

# docker-compose.yml - Prometheus + Grafana + Loki 完整可观测性栈
version: '3.8'

services:
  prometheus:
    image: prom/prometheus:v2.53.0
    container_name: prometheus
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro
      - ./prometheus/alert-rules.yml:/etc/prometheus/alert-rules.yml:ro
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--storage.tsdb.retention.time=15d'
      - '--storage.tsdb.retention.size=10GB'
      - '--web.enable-lifecycle'
    restart: unless-stopped

  grafana:
    image: grafana/grafana:11.1.0
    container_name: grafana
    ports:
      - "3001:3000"
    environment:
      - GF_SECURITY_ADMIN_USER=admin
      - GF_SECURITY_ADMIN_PASSWORD=changeme
      - GF_USERS_ALLOW_SIGN_UP=false
    volumes:
      - grafana_data:/var/lib/grafana
      - ./grafana/provisioning:/etc/grafana/provisioning:ro
    restart: unless-stopped

  loki:
    image: grafana/loki:3.1.0
    container_name: loki
    ports:
      - "3100:3100"
    volumes:
      - ./loki/loki-config.yml:/etc/loki/local-config.yaml:ro
      - loki_data:/loki
    command: -config.file=/etc/loki/local-config.yaml
    restart: unless-stopped

  promtail:
    image: grafana/promtail:3.1.0
    container_name: promtail
    volumes:
      - ./promtail/promtail-config.yml:/etc/promtail/config.yml:ro
      - /var/log:/var/log:ro
      - /var/lib/docker/containers:/var/lib/docker/containers:ro
    command: -config.file=/etc/promtail/config.yml
    restart: unless-stopped

  node-exporter:
    image: prom/node-exporter:v1.8.1
    container_name: node-exporter
    ports:
      - "9100:9100"
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    command:
      - '--path.procfs=/host/proc'
      - '--path.sysfs=/host/sys'
      - '--path.rootfs=/rootfs'
    restart: unless-stopped

  cadvisor:
    image: gcr.io/cadvisor/cadvisor:v0.49.1
    container_name: cadvisor
    ports:
      - "8080:8080"
    volumes:
      - /:/rootfs:ro
      - /var/run:/var/run:ro
      - /sys:/sys:ro
      - /var/lib/docker/:/var/lib/docker:ro
    restart: unless-stopped

volumes:
  prometheus_data:
  grafana_data:
  loki_data:

📌 **记住：**Grafana 默认端口是 3000，这里映射到 3001 是为了不与本地开发服务冲突。生产环境建议通过 Nginx 反代并加上 HTTPS，密码务必改成强密码。

Prometheus 采集配置

# prometheus/prometheus.yml - Prometheus 主配置文件
global:
  scrape_interval: 15s      # 全局采集间隔，核心指标建议 15s
  evaluation_interval: 15s   # 告警规则评估间隔
  scrape_timeout: 10s        # 单次采集超时

rule_files:
  - "alert-rules.yml"

alerting:
  alertmanagers:
    - static_configs:
        - targets: []        # 生产环境配置 Alertmanager 地址

scrape_configs:
  # Prometheus 自身监控
  - job_name: "prometheus"
    static_configs:
      - targets: ["localhost:9090"]

  # 主机指标（CPU、内存、磁盘、网络）
  - job_name: "node-exporter"
    static_configs:
      - targets: ["node-exporter:9100"]

  # Docker 容器指标（CPU、内存、网络 IO）
  - job_name: "cadvisor"
    static_configs:
      - targets: ["cadvisor:8080"]

  # 应用服务示例（Spring Boot / Express / FastAPI）
  - job_name: "my-app"
    metrics_path: "/actuator/prometheus"  # Spring Boot 默认路径
    static_configs:
      - targets: ["host.docker.internal:8080"]
        labels:
          env: "production"
          service: "my-app"

启动并验证：

# 启动全部服务
docker compose up -d

# 检查所有容器是否正常运行
docker compose ps

# 验证 Prometheus 是否能正常采集目标
curl -s http://localhost:9090/api/v1/targets | python3 -m json.tool

# 访问 Grafana（默认账号 admin/changeme）
open http://localhost:3001

🎯 三、应用埋点实战：让你的服务暴露 Prometheus 指标

光部署基础设施还不够，真正的可观测性需要从应用层采集指标。以下介绍两种主流技术栈的接入方式。

Spring Boot 应用接入

Spring Boot 通过 Micrometer 框架原生支持 Prometheus，只需添加依赖即可自动暴露 JVM、HTTP 请求、数据库连接池等核心指标：

<!-- pom.xml 添加 Micrometer Prometheus 依赖 -->
<dependency>
    <groupId>io.micrometer</groupId>
    <artifactId>micrometer-registry-prometheus</artifactId>
</dependency>
<dependency>
    <groupId>org.springframework.boot</groupId>
    <artifactId>spring-boot-starter-actuator</artifactId>
</dependency>

# application.yml - 暴露 Prometheus 端点并开启延迟直方图
management:
  endpoints:
    web:
      exposure:
        include: prometheus,health,info
  metrics:
    tags:
      application: ${spring.application.name}
    distribution:
      percentiles-histogram:
        http.server.requests: true  # 开启延迟直方图，用于计算 P99
      sla:
        http.server.requests: 100ms,200ms,500ms,1s,2s  # 自定义 SLA 桶

💡 提示：percentiles-histogram.http.server.requests: true 非常关键。不开这个配置，Prometheus 只能拿到平均延迟，无法计算 P50、P99 等百分位。这是新手最常踩的坑——没有直方图数据，延迟告警就形同虚设。

Node.js 应用接入

Node.js 生态中 prom-client 是最成熟的 Prometheus 客户端库，以下是完整的 Express 中间件接入示例：

// node-prometheus.js - Express 应用接入 Prometheus 指标采集
const express = require('express');
const client = require('prom-client');

const app = express();

// 创建默认指标收集器（CPU、内存、事件循环延迟等）
const collectDefaultMetrics = client.collectDefaultMetrics;
collectDefaultMetrics({ prefix: 'my_app_' });

// 自定义 Counter：HTTP 请求总数（按方法、路径、状态码分组）
const httpRequestsTotal = new client.Counter({
  name: 'http_requests_total',
  help: 'Total number of HTTP requests',
  labelNames: ['method', 'path', 'status'],
});

// 自定义 Histogram：请求延迟分布
const httpRequestDuration = new client.Histogram({
  name: 'http_request_duration_seconds',
  help: 'HTTP request duration in seconds',
  labelNames: ['method', 'path', 'status'],
  buckets: [0.01, 0.05, 0.1, 0.2, 0.5, 1, 2, 5],
});

// 中间件：自动记录每个请求的指标
app.use((req, res, next) => {
  const end = httpRequestDuration.startTimer();
  res.on('finish', () => {
    const labels = {
      method: req.method,
      path: req.route?.path || req.path,  // 使用路由模式而非完整路径
      status: res.statusCode,
    };
    httpRequestsTotal.inc(labels);
    end(labels);
  });
  next();
});

// 暴露 /metrics 端点供 Prometheus 采集
app.get('/metrics', async (req, res) => {
  res.set('Content-Type', client.register.contentType);
  res.end(await client.register.metrics());
});

app.listen(3000, () => console.log('Server running on port 3000'));

⚠️ **警告：**不要在 Histogram 的 buckets 里放太多桶（建议不超过 15 个），也不要使用自动生成的路径作为标签值。/api/users/12345 和 /api/users/67890 会被当作两个不同的时间序列，导致基数爆炸。路径中的动态参数应统一为占位符，如 /api/users/:id。

📈 四、PromQL 实战：从入门到高级查询

PromQL（Prometheus Query Language）是 Prometheus 的核心能力。掌握它，你才能真正发挥监控的威力。以下按场景分类整理了最实用的查询模式。

基础资源指标查询

# 查询 CPU 使用率（排除空闲态，按实例分组）
100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# 查询内存使用率
(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100

# 查询磁盘使用率（排除 tmpfs 和 overlay 文件系统）
(1 - node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"} / node_filesystem_size_bytes) * 100

# 查询容器 CPU 使用率（cAdvisor 采集）
rate(container_cpu_usage_seconds_total{name!=""}[5m]) * 100

# 查询 HTTP 请求 QPS（按状态码分组，最实用的业务指标）
sum by(status) (rate(http_requests_total[5m]))

# 查询 P99 延迟（需要应用开启了 histogram）
histogram_quantile(0.99, sum by(le) (rate(http_request_duration_seconds_bucket[5m])))

高级查询技巧

# 🔥 错误率突增检测：5 分钟内错误率超过 5%
sum(rate(http_requests_total{status=~"5.."}[5m]))
  /
sum(rate(http_requests_total[5m])) > 0.05

# 🔥 磁盘空间预测：按当前趋势 4 小时内是否会写满
predict_linear(node_filesystem_avail_bytes{fstype!~"tmpfs"}[1h], 4*3600) < 0

# 🔥 同比分析：对比今天与昨天同时段的 QPS
sum(rate(http_requests_total[5m]))
  /
sum(rate(http_requests_total[5m] offset 1d))

# 🔥 Apdex 评分计算（满意阈值 0.5s，容忍阈值 2s）
(
  sum(rate(http_request_duration_seconds_bucket{le="0.5"}[5m]))
  + sum(rate(http_request_duration_seconds_bucket{le="2"}[5m]))
) / 2 / sum(rate(http_request_duration_seconds_count[5m]))

# 🔥 容器重启次数统计（排查不稳定服务）
increase(kube_pod_container_status_restarts_total[1h]) > 0

💡 提示：rate() 用于 Counter 类型指标（单调递增），计算的是时间窗口内的平均速率。irate() 则使用最后两个样本点计算瞬时速率，更敏感但噪声更大。Dashboard 图表用 rate() 更平滑，告警规则用 irate() 但需配合 for 持续时间避免抖动。

告警规则配置

# prometheus/alert-rules.yml - 生产级告警规则
groups:
  - name: host-alerts
    rules:
      # CPU 使用率持续 5 分钟超过 85%
      - alert: HighCpuUsage
        expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "CPU 使用率过高 ({{ $labels.instance }})"
          description: "CPU 使用率 {{ $value | printf \"%.1f\" }}%，已持续 5 分钟"

      # 内存使用率超过 90%
      - alert: HighMemoryUsage
        expr: (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 > 90
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "内存使用率过高 ({{ $labels.instance }})"
          description: "内存使用率 {{ $value | printf \"%.1f\" }}%，可能触发 OOM"

      # 磁盘预计 4 小时内将满
      - alert: DiskWillFull
        expr: predict_linear(node_filesystem_avail_bytes{fstype!~"tmpfs"}[1h], 4*3600) < 0
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "磁盘空间即将耗尽 ({{ $labels.instance }})"
          description: "按当前趋势，{{ $labels.mountpoint }} 将在 4 小时内写满"

  - name: app-alerts
    rules:
      # HTTP 5xx 错误率超过 5%
      - alert: HighErrorRate
        expr: |
          sum by(job) (rate(http_requests_total{status=~"5.."}[5m]))
          / sum by(job) (rate(http_requests_total[5m])) > 0.05
        for: 3m
        labels:
          severity: critical
        annotations:
          summary: "HTTP 5xx 错误率过高 ({{ $labels.job }})"
          description: "5xx 错误率 {{ $value | printf \"%.2f\" }}%，超过 5% 阈值"

      # P99 延迟超过 2 秒
      - alert: HighLatency
        expr: histogram_quantile(0.99, sum by(le, job) (rate(http_request_duration_seconds_bucket[5m]))) > 2
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "P99 延迟过高 ({{ $labels.job }})"
          description: "P99 延迟 {{ $value | printf \"%.2f\" }}s，超过 2s 阈值"

🔧 五、Grafana Dashboard 与日志关联

数据源自动配置

通过 Grafana Provisioning 机制，启动时自动注册数据源，无需手动在 UI 上操作。这对基础设施即代码（IaC）的工作流非常重要：

# grafana/provisioning/datasources/datasources.yml
apiVersion: 1

datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true
    editable: true

  - name: Loki
    type: loki
    access: proxy
    url: http://loki:3100
    editable: true
    jsonData:
      derivedFields:
        # 从日志中提取 traceID，实现指标→日志→链路的关联
        - datasourceUid: tempo
          matcherRegex: "traceID=(\\w+)"
          name: TraceID
          url: "$${__value.raw}"

指标与日志关联查询

Grafana 最强大的能力之一是指标与日志的关联。在 Dashboard 中可以同时展示 Prometheus 指标面板和 Loki 日志面板，通过共享时间范围实现联动。当 CPU 突然飙高时，点击图表上的时间点，日志面板会自动跳转到对应时间段——这就是可观测性的核心价值。

LogQL 查询示例：

# 查询某容器的所有错误日志（使用管道过滤）
{container="my-app"} |= "error" | logfmt | level="error"

# 统计每分钟错误日志数量（用于在 Grafana 中绘制错误趋势图）
sum(rate({container="my-app"} |= "error" [1m])) by (container)

# 提取 JSON 日志中的特定字段并按状态码过滤
{job="my-app"} | json | status >= 500 | line_format "{{.method}} {{.path}} {{.status}}"

# 按 HTTP 状态码统计日志条数
sum by (status) (count_over_time({job="my-app"} | json | __error__="" [5m]))

📌 **记住：**Loki 的设计理念是"标签索引，内容不索引"。这意味着不要给日志打太多标签（label），否则会导致基数爆炸（cardinality explosion）。标签数量控制在 10 个以内，高基数字段（如用户 ID、请求 ID）用 LogQL 的管道过滤表达式处理，而不是作为标签。

Alertmanager 通知集成

告警规则检测到问题后，需要通过 Alertmanager 送到正确的人手里。以下配置实现了按严重级别路由到不同的通知渠道：

# alertmanager/alertmanager.yml - 告警路由与通知配置
global:
  resolve_timeout: 5m

route:
  receiver: 'default-webhook'
  group_by: ['alertname', 'instance']
  group_wait: 30s          # 首次告警等待 30s，合并同组告警
  group_interval: 5m       # 同组告警最小间隔
  repeat_interval: 4h      # 重复告警间隔
  routes:
    # critical 级别告警走紧急通道
    - match:
        severity: critical
      receiver: 'urgent-webhook'
      repeat_interval: 30m

receivers:
  - name: 'default-webhook'
    webhook_configs:
      - url: 'http://dingtalk-webhook:8060/dingtalk/ops/send'
        send_resolved: true

  - name: 'urgent-webhook'
    webhook_configs:
      - url: 'http://dingtalk-webhook:8060/dingtalk/urgent/send'
        send_resolved: true

# 告警抑制：critical 触发时抑制同实例的 warning 告警
inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'instance']

📌 记住：group_wait 和 group_interval 的区别容易混淆。group_wait 是首次告警的等待时间，用于合并同一时间窗口内的多条告警；group_interval 是同组后续告警的最小间隔。设得太短会导致通知轰炸，设得太长会延误告警。生产环境推荐 group_wait: 30s、group_interval: 5m。

⚡ 六、生产环境调优与避坑指南

Prometheus 性能调优参数

参数	默认值	推荐值	说明
`scrape_interval`	15s	15-30s	非核心指标可放宽到 60s，减少采集压力
`storage.tsdb.retention.time`	15d	15-30d	超过用 Remote Write 推送到长期存储
`storage.tsdb.retention.size`	0（无限）	磁盘 70%	防止磁盘写满导致服务崩溃
`query.max-samples`	5000万	5000万	复杂查询可能需要调高
`query.timeout`	2m	2m	防止慢查询拖垮 Prometheus
`--web.enable-lifecycle`	关闭	开启	允许通过 HTTP API 热重载配置

三大常见坑点

❌ 坑点 1：标签基数爆炸 — 把 user_id、request_id、ip_address 作为标签会导致时间序列数激增。1 万个用户 × 10 个接口 = 10 万条时间序列，Prometheus 内存很快 OOM。

✅ 正确做法：高基数数据走日志系统（Loki），指标只保留聚合维度。标签值的组合数控制在 1 万以内。

❌ 坑点 2：rate() 时间窗口太短 — rate(metric[1m]) 在采集间隔 15s 时只有 4 个样本点，噪声极大，图表抖动严重。

✅ 正确做法：rate() 的时间窗口至少是采集间隔的 4 倍。采集间隔 15s 时，推荐使用 [5m] 或更长。

❌ 坑点 3：忘记处理容器重启导致 Counter 重置 — 容器重启后 Counter 归零，rate() 会计算出一个巨大的瞬时值，触发误告警。

✅ 正确做法：rate() 和 increase() 函数本身已经内置了对 Counter 重置的处理（检测到值变小会自动修正），但告警规则务必配合 for 持续时间（至少 2-5 分钟），避免单次异常触发误报。

Loki 存储与查询优化

# 生产环境 Loki 调优关键配置
limits_config:
  max_query_length: 721h              # 最多查询 30 天的日志
  max_query_parallelism: 4            # 并行查询数，控制查询资源消耗
  max_entries_limit_per_query: 5000   # 单次查询最多返回的日志条数
  ingestion_rate_mb: 10               # 每租户每秒写入速率限制
  ingestion_burst_size_mb: 20         # 写入速率突发上限
  max_label_names_per_series: 30      # 每条流最多 30 个标签
  max_streams_per_user: 10000         # 单租户最大日志流数量（防基数爆炸）

成本与方案选型对比

方案	部署复杂度	存储成本	查询性能	适合规模
Prometheus 单机	⭐ 低	本地 SSD	快	< 500 个采集目标
Prometheus + Thanos	⭐⭐⭐ 高	对象存储（S3/MinIO）	中等	500-5000 个目标
Prometheus + VictoriaMetrics	⭐⭐ 中	本地/对象存储	快	1000+ 个目标
Grafana Cloud	⭐ 低	按量付费（$8/10K 指标）	快	任意规模，预算充足
云厂商托管（阿里云 ARMS）	⭐ 低	按量付费	快	不想自建运维的团队

⚠️ **警告：**不要在生产环境用单机 Prometheus 做长期存储。当时间序列超过 100 万条时，查询性能会急剧下降。超过 500 个采集目标就应该考虑分片或引入 Thanos/VictoriaMetrics 作为长期存储后端。

💡 七、总结与最佳实践

搭建生产级可观测性体系，核心要点如下：

✅ 指标、日志、链路追踪是可观测性的三大支柱，Prometheus + Loki + Tempo 是最成熟的开源组合
✅ Pull 模式是 Prometheus 的核心设计优势，善用服务发现实现自动采集
✅ 标签设计是 Prometheus 最关键的决策，高基数字段绝对不要作为标签
✅ 应用埋点要开启 Histogram 直方图，否则无法计算 P50/P99 等百分位延迟
✅ 告警规则必须配合 for 持续时间，避免单次异常触发误报导致告警疲劳
✅ 生产环境务必配置磁盘容量限制（retention.size）和日志保留策略
✅ 指标与日志关联是可观测性的核心价值，善用 Grafana 的 Derived Fields 实现联动

推荐学习路径：Prometheus 官方文档 → PromQL 实战练习 → Grafana 社区 Dashboard 模板（推荐 ID：1860 Node Exporter、893 Docker、12740 Spring Boot）→ Loki 日志接入 → Alertmanager 通知集成。