### Install kube-state-metrics Source: https://cairry.github.io/docs/Exporter/kubernetes-resource/kubernetes.html Kubernetes manifests for installing kube-state-metrics, including ServiceAccount, ClusterRole, ClusterRoleBinding, Service, and Deployment. ```yaml apiVersion: v1 kind: ServiceAccount metadata: name: kube-state-metrics namespace: kube-system --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: kube-state-metrics rules: - apiGroups: [""] resources: ["nodes", "pods", "services", "resourcequotas", "replicationcontrollers", "limitranges", "persistentvolumeclaims", "persistentvolumes", "namespaces", "endpoints"] verbs: ["list", "watch"] - apiGroups: ["extensions"] resources: ["daemonsets", "deployments", "replicasets"] verbs: ["list", "watch"] - apiGroups: ["apps"] resources: ["statefulsets"] verbs: ["list", "watch"] - apiGroups: ["batch"] resources: ["cronjobs", "jobs"] verbs: ["list", "watch"] - apiGroups: ["autoscaling"] resources: ["horizontalpodautoscalers"] verbs: ["list", "watch"] --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: name: kube-state-metrics roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: kube-state-metrics subjects: - kind: ServiceAccount name: kube-state-metrics namespace: kube-system --- apiVersion: v1 kind: Service metadata: annotations: prometheus.io/scrape: 'true' labels: app.kubernetes.io/name: kube-state-metrics app.kubernetes.io/version: v2.2.1 name: kube-state-metrics namespace: kube-system spec: clusterIP: None ports: - name: http-metrics port: 8080 targetPort: http-metrics - name: telemetry port: 8081 targetPort: telemetry selector: app.kubernetes.io/name: kube-state-metrics --- apiVersion: apps/v1 kind: Deployment metadata: labels: app.kubernetes.io/name: kube-state-metrics app.kubernetes.io/version: v2.2.1 name: kube-state-metrics namespace: kube-system spec: replicas: 1 selector: matchLabels: app.kubernetes.io/name: kube-state-metrics template: metadata: labels: app.kubernetes.io/name: kube-state-metrics app.kubernetes.io/version: v2.2.1 spec: containers: - image: registry.cn-shenzhen.aliyuncs.com/starsl/kube-state-metrics:v2.2.1 livenessProbe: httpGet: path: /healthz port: 8080 initialDelaySeconds: 5 timeoutSeconds: 5 name: kube-state-metrics ports: - containerPort: 8080 name: http-metrics - containerPort: 8081 name: telemetry readinessProbe: httpGet: path: / port: 8081 initialDelaySeconds: 5 timeoutSeconds: 5 nodeSelector: beta.kubernetes.io/os: linux serviceAccountName: kube-state-metrics ``` -------------------------------- ### Describe Method Implementation Source: https://cairry.github.io/docs/Exporter/basic.html An example implementation of the Describe method for a custom metric, which sends metric descriptors to a channel. This informs Prometheus about the available metrics. ```go func (m Monitor) Describe(descs chan<- *prometheus.Desc) { descs <- m.InterfaceStatusCode descs <- m.SSLCertRemainingRime } ``` -------------------------------- ### Install VictoriaMetrics Operator with Helm Source: https://cairry.github.io/docs/VictoriaMetrics/index.html Installs the VictoriaMetrics Operator using Helm. Ensure the Helm repository is added and updated before installation. ```bash # helm repo add vm https://victoriametrics.github.io/helm-charts # helm repo update # helm install victoria-operator vm/victoria-metrics-operator ``` -------------------------------- ### Kubernetes Service Discovery Configuration Example Source: https://cairry.github.io/docs/Prometheus/ServiceDiscover.html Example Prometheus configuration for Kubernetes service discovery, demonstrating global settings and scrape configurations. It highlights the use of roles like 'endpoints' for discovering services. ```yaml global: # 间隔时间 scrape_interval: 30s # 超时时间 scrape_timeout: 10s # 另一个独立的规则周期,对告警规则做定期计算 evaluation_interval: 30s # 外部系统标签 external_labels: prometheus: monitoring/k8s prometheus_replica: prometheus-k8s-1 # 抓取服务端点,整个这个任务都是用来发现node-exporter和kube-state-metrics-service的,这里用的是endpoints角色,这是通过这两者的service来发现 ``` -------------------------------- ### Example Target JSON File for File-based SD Source: https://cairry.github.io/docs/Prometheus/ServiceDiscover.html Defines targets and their associated labels for Prometheus to discover. Each entry includes a list of targets and a labels object. ```json [ { "targets": [ "172.16.0.96:19100" ], "labels": { "project_name": "项目测试20200413", "env_name": "开发环境", "soft_name": "测试应用20200413", "template_name": "测试模板20200413", "template_type": "主机模板" } }, { "targets": [ "172.16.0.96:30013", "172.16.0.96:30015" ], "labels": { "project_name": "项目测试20200413", "env_name": "开发环境", "soft_name": "测试应用20200413", "template_name": "测试模板20200413", "template_type": "主机模板" } }, { "metrics_path": "/_prometheus/metrics", "targets": [ "172.16.0.96:9200" ], "labels": { "project_name": "项目测试20200413", "env_name": "开发环境", "soft_name": "测试应用20200413", "template_name": "测试模板20200413", "template_type": "主机模板" } } ] ``` -------------------------------- ### Collect Method Implementation Source: https://cairry.github.io/docs/Exporter/basic.html An example implementation of the Collect method, responsible for gathering actual metric data. It iterates through configured domains, spawns goroutines to collect metrics, and waits for them. ```go func (m Monitor) Collect(metrics chan<- prometheus.Metric) { for srvName, domainName := range config.DomainMap { // 探测 Domain 状态 wg.Add(1) lock.Lock() go Gauge(srvName, domainName) lock.Unlock() wg.Wait() } } ``` -------------------------------- ### Prometheus Data Format Example Source: https://cairry.github.io/docs/Exporter/basic.html Illustrates the structure of a Prometheus metric, including metric name, labels, value, and timestamp. ```text http_requests_total{method="GET", handler="/api"} 1027 1626568200 ``` -------------------------------- ### Testing Metrics Endpoint Source: https://cairry.github.io/docs/Exporter/kubernetes-controlplane/controller-manager.html Example command to test the Controller Manager's metrics endpoint using curl. Requires client certificates for authentication. ```bash [root@master01 manifests]# curl -s -k --cert ./client-cert.pem --key ./client-key.pem https://localhost:10259/metrics | head -n 5 # HELP apiserver_audit_event_total [ALPHA] Counter of audit events generated and sent to the audit backend. # TYPE apiserver_audit_event_total counter apiserver_audit_event_total 0 # HELP apiserver_audit_requests_rejected_total [ALPHA] Counter of apiserver requests rejected due to an error in audit logging backend. # TYPE apiserver_audit_requests_rejected_total counter ``` -------------------------------- ### MySQL Exporter Metric Example Source: https://cairry.github.io/docs/Exporter/mysql.html The `mysql_up` metric indicates the success of connecting to and collecting data from a MySQL server. A value of 1 signifies successful data collection. ```text # HELP mysql_up Whether the MySQL server is up. # TYPE mysql_up gauge mysql_up 1 ``` -------------------------------- ### Prometheus Alerting Rule Example Source: https://cairry.github.io/docs/Prometheus/index.html An example Prometheus alerting rule that triggers when an exporter component is down (up == 0) for 2 minutes. It includes severity labels and custom summary/description annotations. ```yaml groups: # --- Node - name: NodeStatus # 报警规则组名称 rules: - alert: Exporter Componen is Down expr: up == 0 for: 2m #持续时间,表示持续30秒获取不到信息,则触发报警 labels: severity: serious # 自定义标签 严重的 annotations: summary: "节点: {{ $labels.instance }} Exporter 程序" # 自定义摘要 description: "节点: {{ $labels.instance }} Exporter程序异常 请及时处理!." # 自定义具体描述 ``` -------------------------------- ### Analyze Error Log Growth Rate Source: https://cairry.github.io/docs/Monitor/business/index.html Analyzes the growth rate of error logs over a specific period. The example checks if the increase in error logs over the last 5 minutes is at least 30% compared to a previous period. ```PromQL increase(l2m_level_info{level="ERROR"}[5m]) > 50 and (increase(l2m_level_info{level="ERROR"}[5m]) - increase(l2m_level_info{level="ERROR"}[20m]) offset 15m) / increase(l2m_level_info{level="ERROR"}[20m] offset 15m) * 100 >= 30 ``` -------------------------------- ### Kubernetes Deployment for MySQL Exporter Source: https://cairry.github.io/docs/Exporter/mysql.html These Kubernetes manifests define a ConfigMap for MySQL credentials, a Deployment for the mysqld-exporter container, and a Service to expose the metrics endpoint. This setup allows Prometheus to scrape metrics from the exporter. ```yaml apiVersion: v1 kind: ConfigMap metadata: name: mysql-exporter namespace: monitoring data: my.cnf: |- [client] user=exporter password=exporter_2024 --- apiVersion: apps/v1 kind: Deployment metadata: name: mysql-exporter namespace: monitoring spec: replicas: 1 selector: matchLabels: app: mysql-exporter template: metadata: labels: app: mysql-exporter spec: containers: - name: mysql-exporter image: prom/mysqld-exporter:v0.16.0 ports: - name: metrics containerPort: 9104 command: - mysqld_exporter - --config.my-cnf=/cfg/my.cnf volumeMounts: - mountPath: /cfg/my.cnf name: mysql-exporter subPath: my.cnf volumes: - configMap: defaultMode: 420 name: mysql-exporter name: mysql-exporter --- apiVersion: v1 kind: Service metadata: name: mysql-exporter namespace: monitoring spec: ports: - port: 9104 targetPort: 9104 protocol: TCP name: metrics selector: app: mysql-exporter ``` -------------------------------- ### Get or Create Metric from Hash Source: https://cairry.github.io/docs/Exporter/start.html Retrieves a metric using its hash and label values. If the metric does not exist, it creates a new one and adds it to the map. Uses read-write locks for concurrent access. ```go func (m *metricMap) getOrCreateMetricWithLabelValues( hash uint64, lvs []string, curry []curriedLabelValue, ) Metric { m.mtx.RLock() // 如果获取到直接返回 metric, ok := m.getMetricWithHashAndLabelValues(hash, lvs, curry) m.mtx.RUnlock() if ok { return metric } m.mtx.Lock() defer m.mtx.Unlock() // 获取不到则创建 metric, ok = m.getMetricWithHashAndLabelValues(hash, lvs, curry) if !ok { inlinedLVs := inlineLabelValues(lvs, curry) // 创建metric,如xx{aa=bb...} metric = m.newMetric(inlinedLVs...) // 对hash后的label值做完索引进行赋值。 m.metrics[hash] = append(m.metrics[hash], metricWithLabelValues{values: inlinedLVs, metric: metric}) } return metric } ``` -------------------------------- ### Prometheus Configuration for MySQL Exporter Source: https://cairry.github.io/docs/Exporter/mysql.html Configure Prometheus to scrape MySQL targets using the MySQL exporter. This setup includes relabeling rules to correctly map addresses and parameters. ```yaml * mysql_up 1::代表成功连接MySQL并采集到指标数据 ## Prometheus 端点配置 ``` - job_name: 'MySQL' static_configs: - targets: - mysql1.com:3306 - mysql2.com:3306 relabel_configs: - source_labels: [__address__] target_label: __param_target - source_labels: [__param_target] target_label: instance action: replace - target_label: __address__ replacement: mysql-exporter.monitoring:9104 ``` ``` -------------------------------- ### Reduce Large Scope Queries Source: https://cairry.github.io/docs/Prometheus/optimization_memory.html Example of a query that might consume significant memory. It's recommended to minimize the scope of such queries in Grafana dashboards. ```promql rate(up[1m]) ``` -------------------------------- ### 安装Prometheus Client Source: https://cairry.github.io/docs/Exporter/start.html 使用go get命令安装Prometheus client库。 ```bash go get -u github.com/prometheus/client_golang/prometheus/promhttp ``` -------------------------------- ### 暴露默认Metrics Source: https://cairry.github.io/docs/Exporter/start.html 通过`promhttp.Handler()`注册一个HTTP处理器来暴露默认的Go运行时和promhttp相关的指标。默认监听在8005端口。 ```go import ( "fmt" "github.com/prometheus/client_golang/prometheus/promhttp" "net/http" ) func main() { http.Handle("/metrics", promhttp.Handler()) http.ListenAndServe(":8005", nil) } ``` -------------------------------- ### 使用WithLabelValues注册Metrics Source: https://cairry.github.io/docs/Exporter/start.html 使用`WithLabelValues`方法注册带有动态标签的Metrics。适用于复杂指标、动态值或大量标签,提供更大的灵活性。 ```go metric := pgc.HTTPLatencyHistogramCollect.WithLabelValues(name, strconv.Itoa(statusCode), requestPath) metric.Observe(latency) metric.Collect(ch) ``` -------------------------------- ### Enable Etcd Metrics Interface Source: https://cairry.github.io/docs/Exporter/kubernetes-controlplane/etcd.html Modify etcd configuration to expose metrics on a specific port. Ensure the --listen-metrics-urls flag is set. ```yaml - etcd - --advertise-client-urls=https://192.168.1.176:2379 - --cert-file=/etc/kubernetes/pki/etcd/server.crt - --client-cert-auth=true ... - --trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt - --listen-metrics-urls=http://0.0.0.0:2381 ### 添加此配置 image: registry.aliyuncs.com/google_containers/etcd:3.4.13-0 ``` -------------------------------- ### Get or Create Metric with Label Values Source: https://cairry.github.io/docs/Exporter/start.html WithLabelValues is used for metrics that require label hashing for efficient storage and retrieval. It panics if an error occurs during metric retrieval. ```go func (v *GaugeVec) WithLabelValues(lvs ...string) Gauge { g, err := v.GetMetricWithLabelValues(lvs...) if err != nil { panic(err) } return g } ``` -------------------------------- ### Filesystem Reads per Pod in DaemonSet Source: https://cairry.github.io/docs/Exporter/kubernetes-resource/kubernetes.html Retrieves the filesystem read activity for each pod within a DaemonSet. Aids in performance optimization and capacity planning. ```PromQL group(kube_pod_info{created_by_kind="DaemonSet",pod_ip!=""}) by (pod)* on(pod) group_right() max(rate(container_fs_reads_bytes_total{container!="POD", container!=""}[5m])) by (pod,container) ``` -------------------------------- ### Configure WAL Segment Size and Compression Source: https://cairry.github.io/docs/Prometheus/optimization_memory.html Command-line arguments to reduce the WAL segment size to 64MB and enable WAL compression, which can decrease disk space usage. ```bash prometheus --storage.tsdb.wal-segment-size=64MB --storage.tsdb.wal-compression ``` -------------------------------- ### Get Last Value Over Time in PromQL Source: https://cairry.github.io/docs/Prometheus/promql.html Use last_over_time to retrieve the most recent value from a time series within a given time window. This is ideal for monitoring the latest state of a metric. ```PromQL last_over_time(http_request_duration_seconds{instance="server1"}[5m]) ``` -------------------------------- ### Registering Metrics with MustRegister Source: https://cairry.github.io/docs/Exporter/basic.html Demonstrates how to register multiple collectors with a registry, panicking if any registration fails. This is part of the metrics registration process for custom Exporters. ```go // MustRegister implements Registerer. func (r *Registry) MustRegister(cs ...Collector) { for _, c := range cs { if err := r.Register(c); err != nil { panic(err) } } } ``` -------------------------------- ### Collector Interface Methods Source: https://cairry.github.io/docs/Exporter/basic.html Shows the essential methods required for implementing the Prometheus Collector interface: Describe and Collect. These methods are crucial for exposing custom metrics. ```go type Collector interface { ··· Describe(chan<- *Desc) ··· Collect(chan<- Metric) } ``` -------------------------------- ### Drop Metrics with Specific Name Pattern Source: https://cairry.github.io/docs/Prometheus/optimization_memory.html Prometheus relabeling configuration to drop all metrics whose names start with 'temp_'. This is useful for reducing memory consumption by eliminating unwanted metrics. ```yaml scrape_configs: - job_name: 'example' static_configs: - targets: ['localhost:9090'] metric_relabel_configs: # 删除掉所有以"temp_"开头的标签 - source_labels: [__name__] action: drop regex: 'temp_.*' ``` -------------------------------- ### File-based Service Discovery Configuration Source: https://cairry.github.io/docs/Prometheus/ServiceDiscover.html Configure Prometheus to discover targets from JSON files. Specify the job name, file paths, and refresh interval. ```yaml scrape_configs: - job_name: 'file_ds' file_sd_configs: - files: - targets/*.json refresh_interval: 5m ``` -------------------------------- ### Calculate Filesystem Usage Rate Source: https://cairry.github.io/docs/Exporter/node.html Calculates the filesystem usage rate for ext and xfs file systems. This helps prevent disk full issues and ensures normal file system operation. ```PromQL max((node_filesystem_size_bytes{fstype=~"ext.?|xfs"}-node_filesystem_free_bytes{fstype=~"ext.?|xfs"}) *100/(node_filesystem_avail_bytes {fstype=~"ext.?|xfs"}+(node_filesystem_size_bytes{fstype=~"ext.?|xfs"}-node_filesystem_free_bytes{fstype=~"ext.?|xfs"})))by(ecs_cname,instance,service_id) ``` -------------------------------- ### CPU Usage per Pod in DaemonSet Source: https://cairry.github.io/docs/Exporter/kubernetes-resource/kubernetes.html Retrieves the CPU usage for each pod within a DaemonSet. Aids in performance monitoring and optimization of DaemonSet pods. ```PromQL group(kube_pod_info{created_by_kind="DaemonSet",pod_ip!=""}) by (pod)* on(pod) group_right() max(rate(container_cpu_usage_seconds_total{container!="",container!="POD"}[5m])) by (pod, container) ``` -------------------------------- ### Configure Redis Exporter with Environment Variables (Single Instance) Source: https://cairry.github.io/docs/Exporter/redis.html This configuration uses environment variables to specify the Redis address and password for the exporter. This method is suitable for monitoring a single Redis instance. ```yaml env: - name: TZ value: "Asia/Shanghai" - name: REDIS_ADDR value: "redis://redis-ztest.infra:6379" - name: REDIS_PASSWORD value: "xxxx:xxxx" ``` -------------------------------- ### Create Etcd Service for VM Discovery Source: https://cairry.github.io/docs/Exporter/kubernetes-controlplane/etcd.html Define a Kubernetes Service to expose the etcd metrics endpoint, enabling automatic discovery for VMs. ```yaml kind: Service apiVersion: v1 metadata: name: etcd namespace: kube-system labels: component: etcd spec: selector: component: etcd ports: - name: metrics port: 2381 ``` -------------------------------- ### Ready Pods Count for DaemonSet Source: https://cairry.github.io/docs/Exporter/kubernetes-resource/kubernetes.html Retrieves the number of ready pods for a specified DaemonSet. Essential for monitoring the health status of DaemonSets. ```PromQL kube_daemonset_status_number_ready{daemonset="$daemonset",namespace="$namespace"} ``` -------------------------------- ### Prometheus Data Directory Structure Source: https://cairry.github.io/docs/Prometheus/optimization_memory.html Illustrates the directory structure for Prometheus data, differentiating between in-memory blocks with WAL files and persistent blocks with chunk and index files. ```text ./data/01BKGV7JBM69T2G1BGBGM6KB12 ./data/01BKGV7JBM69T2G1BGBGM6KB12/meta.json ./data/01BKGV7JBM69T2G1BGBGM6KB12/wal/000002 ./data/01BKGV7JBM69T2G1BGBGM6KB12/wal/000001 ``` ```text ./data/01BKGV7JC0RY8A6MACW02A2PJD ./data/01BKGV7JC0RY8A6MACW02A2PJD/meta.json ./data/01BKGV7JC0RY8A6MACW02A2PJD/index ./data/01BKGV7JC0RY8A6MACW02A2PJD/chunks ./data/01BKGV7JC0RY8A6MACW02A2PJD/chunks/000001 ./data/01BKGV7JC0RY8A6MACW02A2PJD/tombstones ``` -------------------------------- ### Controller Manager Monitoring Metrics (PromQL) Source: https://cairry.github.io/docs/Exporter/kubernetes-controlplane/controller-manager.html PromQL queries for monitoring key Controller Manager metrics, including workqueue rates, depth, latency, and API server request QPS. ```PromQL sum(rate(workqueue_adds_total{job="kubernetes-controller-manager"}[$interval])) by (name) ``` ```PromQL sum(rate(workqueue_depth{job="kubernetes-controller-manager"}[$interval])) by (name) ``` ```PromQL histogram_quantile($quantile, sum(rate(workqueue_queue_duration_seconds_bucket{job="ack-cloud-controller-manager"}[5m])) by (name, le)) ``` ```PromQL sum(rate(rest_client_requests_total{job="kubernetes-controller-manager",code=~"2.."}[$interval])) by (method,code) ``` ```PromQL sum(rate(rest_client_requests_total{job="kubernetes-controller-manager",code!~"2.."}[$interval])) by (method,code) ``` -------------------------------- ### Prometheus Configuration for Kubelet Monitoring Source: https://cairry.github.io/docs/Exporter/kubernetes-controlplane/cluster.html This Prometheus configuration targets Kubernetes nodes to scrape Kubelet metrics. It uses service discovery and relabeling to dynamically discover nodes and set the correct metrics path. ```yaml job_name: 'kubernetes-kubelets' scheme: https tls_config: insecure_skip_verify: true bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token kubernetes_sd_configs: - role: node relabel_configs: - action: labelmap regex: __meta_kubernetes_node_label_(.+) - source_labels: [__meta_kubernetes_node_name] regex: (.+) target_label: __metrics_path__ replacement: /api/v1/nodes/${1}/proxy/metrics - target_label: __address__ replacement: kubernetes.default.svc:443 ``` -------------------------------- ### Monitor Process Open Files Source: https://cairry.github.io/docs/Exporter/node.html Retrieves process-level open file count information, excluding containerd-shim processes. Monitor this to prevent issues caused by exceeding system limits. ```PromQL describe_node_process_openfiles_info{name!="containerd-shim", name!="containerd-shim-runc-v2"} ``` -------------------------------- ### Top 10 CPU Usage by Container Source: https://cairry.github.io/docs/Exporter/kubernetes-resource/kubernetes.html Calculates the top 10 containers with the highest CPU usage over the past minute. Useful for identifying high-consuming workloads for optimization. ```PromQL topk(10, sum(irate(container_cpu_usage_seconds_total{container!="",container!="POD",pod!=""}[1m]) * 100) by (container,pod,namespace)or on() vector(0)) ``` -------------------------------- ### Deploy Node-Process-Exporter DaemonSet Source: https://cairry.github.io/docs/Exporter/node.html This Kubernetes DaemonSet configuration deploys the node-process-exporter to each node in the cluster. It requires privileged access and host networking. ```yaml apiVersion: apps/v1 kind: DaemonSet metadata: labels: app: node-process-exporter name: node-process-exporter namespace: monitoring spec: selector: matchLabels: app: node-process-exporter template: metadata: labels: app: node-process-exporter spec: containers: - image: cairry/node-process-exporter:latest imagePullPolicy: IfNotPresent name: node-process-exporter ports: - containerPort: 9002 hostPort: 9002 protocol: TCP resources: limits: cpu: "1" memory: 1Gi requests: cpu: 250m memory: 512Mi securityContext: privileged: true hostIPC: true hostNetwork: true hostPID: true restartPolicy: Always tolerations: - effect: NoSchedule operator: Exists --- apiVersion: v1 kind: Service metadata: name: node-process-exporter namespace: monitoring spec: ports: - port: 9002 protocol: TCP targetPort: 9002 selector: app: node-process-exporter sessionAffinity: None type: ClusterIP ``` -------------------------------- ### Set Data Retention Time and Size Source: https://cairry.github.io/docs/Prometheus/optimization_memory.html Command-line arguments to configure Prometheus to retain data for 30 days and limit the total storage size to 50GB. ```bash prometheus --storage.tsdb.retention.time=30d --storage.tsdb.retention.size=50GB ``` -------------------------------- ### Kubernetes Service Discovery Annotations Source: https://cairry.github.io/docs/Prometheus/ServiceDiscover.html Add annotations to Kubernetes Services or Pods to enable Prometheus service discovery. Configure the scrape port and enable scraping. ```yaml annotations: # 新增如下配置 prometheus.io/port: "3002" # 可自动注册的端口 prometheus.io/scrape: "true" # 是否自动注册 ``` -------------------------------- ### Prometheus Server Configuration Source: https://cairry.github.io/docs/Prometheus/index.html Global and scrape configurations for Prometheus Server. Includes scrape intervals, evaluation intervals, global labels, Alertmanager targets, rule file paths, and scrape jobs for Prometheus itself and node exporters. ```yaml global: scrape_interval: 5s evaluation_interval: 5s # 全局标签组 # 通过本实例采集的数据都会叠加下面的标签 external_labels: account: "huawei-main" region: "beijing" alerting: alertmanagers: - static_configs: - targets: - '172.17.84.238:9093' rule_files: - "/etc/prometheus/rules/first_rules.yml" scrape_configs: - job_name: "prometheus" static_configs: - targets: ["localhost:9090"] - targets: ["172.17.84.238:9100"] ``` -------------------------------- ### Network Receive Bytes per Pod in DaemonSet Source: https://cairry.github.io/docs/Exporter/kubernetes-resource/kubernetes.html Retrieves the network receive traffic for each pod within a DaemonSet. Helps identify potential network bottlenecks. ```PromQL sum(kube_pod_info{created_by_kind="DaemonSet",pod_ip!=""}) by (pod)* on(pod) group_right() max(rate(container_network_receive_bytes_total{}[5m])) by (pod) ``` -------------------------------- ### 自定义Collector接口声明 Source: https://cairry.github.io/docs/Exporter/start.html Collector接口定义了Prometheus如何从自定义指标收集器中获取指标描述和数据。 ```go type Collector interface { // 指标的一些描述信息, 就是# 标识的那部分 // 注意这里使用的是指针, 因为描述信息 全局存储一份就可以了 Describe(chan<- *Desc) // 指标的数据, 比如 promhttp_metric_handler_errors_total{cause="gathering"} 0 // 这里没有使用指针, 因为每次采集的值都是独立的 Collect(chan<- Metric) } ``` -------------------------------- ### Verify Redis Exporter Metrics Source: https://cairry.github.io/docs/Exporter/redis.html This command uses `curl` to fetch metrics from the Redis Exporter and `grep` to filter for the `redis_up` metric. A value of '1' indicates a successful connection and data collection. ```bash [root@iZ2zeh5cd0wu2m4o1m5xjaZ ~]# curl 10.15.1.2:9121/metrics -s | grep redis_up # HELP redis_up Information about the Redis instance # TYPE redis_up gauge redis_up 1 ``` -------------------------------- ### Calculate CPU Usage Rate Source: https://cairry.github.io/docs/Exporter/node.html Calculates the CPU usage rate over the last 5 minutes. Use this to monitor CPU load and identify high-load situations for capacity planning and performance tuning. ```PromQL 100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance,service_id,ecs_cname) * 100) ``` -------------------------------- ### Top 10 Socket Usage by Container Source: https://cairry.github.io/docs/Exporter/kubernetes-resource/kubernetes.html Calculates the socket usage for each container and identifies the top 10 containers with the highest usage. Useful for detecting potential network bottlenecks. ```PromQL topk(10, sum(container_sockets{container!="",pod!=""}) by (container,pod,namespace)or on() vector(0)) ``` -------------------------------- ### Create MySQL Exporter User and Grant Permissions Source: https://cairry.github.io/docs/Exporter/mysql.html This SQL script creates a dedicated user for the MySQL exporter and grants it the necessary privileges to access monitoring information. Ensure this user has appropriate permissions before deploying the exporter. ```sql CREATE USER 'exporter'@'%' IDENTIFIED BY 'exporter_2024'; GRANT PROCESS, REPLICATION CLIENT, SELECT ON *.* TO 'exporter'@'%'; FLUSH PRIVILEGES; ``` -------------------------------- ### Controller Manager Manifest Configuration Source: https://cairry.github.io/docs/Exporter/kubernetes-controlplane/controller-manager.html Configuration snippet for the kube-controller-manager manifest, showing command-line arguments. Adjust the bind-address if necessary. ```yaml - command: - kube-controller-manager - --allocate-node-cidrs=true - --authentication-kubeconfig=/etc/kubernetes/controller-manager.conf - --authorization-kubeconfig=/etc/kubernetes/controller-manager.conf - --bind-address=0.0.0.0 ### 调整此配置 ... - --use-service-account-credentials=true image: registry.aliyuncs.com/google_containers/kube-controller-manager:v1.20.4 ``` -------------------------------- ### Prometheus Server Docker Compose Configuration Source: https://cairry.github.io/docs/Prometheus/index.html This configuration sets up a Prometheus Server using Docker Compose. It defines volumes for data and rules, exposes the Prometheus port, and configures retention policies and command-line arguments. ```yaml version: "3" services: prometheus: container_name: prometheus image: registry.js.design/prometheus/prometheus:v2.32.1 ports: - 9090:9090 volumes: - /opt/apps/monitoring/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml - /opt/apps/monitoring/prometheus/data:/prometheus - /opt/apps/monitoring/prometheus/rules/:/etc/prometheus/rules/ - /etc/localtime:/etc/localtime:ro restart: always command: - '--config.file=/etc/prometheus/prometheus.yml' - '--web.enable-admin-api' - '--web.enable-lifecycle' - '--storage.tsdb.retention=15d' - '--storage.tsdb.path=/prometheus' networks: - monitor networks: monitor: driver: bridge ``` -------------------------------- ### Deploy VictoriaMetrics Single-Node Source: https://cairry.github.io/docs/VictoriaMetrics/index.html This Kubernetes deployment configuration sets up a single instance of VictoriaMetrics, including persistent storage and service exposure. ```yaml apiVersion: v1 kind: PersistentVolumeClaim metadata: name: victoria-metrics-data namespace: observability spec: storageClassName: "local-path" accessModes: - ReadWriteOnce resources: requests: storage: 100Gi --- apiVersion: apps/v1 kind: Deployment metadata: name: victoria-metrics namespace: observability spec: selector: matchLabels: app: victoria-metrics template: metadata: labels: app: victoria-metrics spec: containers: - name: vm image: victoriametrics/victoria-metrics:v1.79.8 imagePullPolicy: IfNotPresent args: - -storageDataPath=/var/lib/victoria-metrics-data - -retentionPeriod=2w - -promscrape.config=/etc/prometheus/prometheus.yaml ports: - containerPort: 8428 name: http volumeMounts: - mountPath: /var/lib/victoria-metrics-data name: storage - mountPath: "/etc/prometheus/" name: prometheus-config readOnly: true volumes: - name: prometheus-config secret: secretName: vm-agent-target items: - key: "prometheus.yaml" path: "prometheus.yaml" - name: storage persistentVolumeClaim: claimName: victoria-metrics-data --- apiVersion: v1 kind: Service metadata: name: victoria-metrics namespace: observability spec: type: NodePort ports: - port: 8428 selector: app: victoria-metrics ``` -------------------------------- ### 实现自定义Collector Source: https://cairry.github.io/docs/Exporter/start.html 通过实现`Collector`接口来创建自定义指标收集器。`NewMonitorMetrics`用于初始化指标规范,`Describe`提供指标描述,`Collect`提供指标数据。默认监听在8001端口。 ```go import ( "net/http" "github.com/prometheus/client_golang/prometheus" "github.com/prometheus/client_golang/prometheus/promhttp" ) // EmptyRegistry 空指标注册表 var ( EmptyRegistry = prometheus.NewRegistry() ) // Monitor 创建采集器 type Monitor struct { InterfaceStatusCode *prometheus.Desc } // NewMonitorMetrics 创建采集器指标注册规范 func NewMonitorMetrics() *Monitor { return &Monitor{ InterfaceStatusCode: prometheus.NewDesc( "url_interface_status_code", // 指标名称 "url 接口状态码", // 描述信息 []string{"app", "url"}, // 动态指标 nil, // 静态指标 ), } } // Describe 收集描述信息 func (m Monitor) Describe(desc chan<- *prometheus.Desc) { desc <- m.InterfaceStatusCode } // Collect 收集指标数据 func (m Monitor) Collect(metrics chan<- prometheus.Metric) { metrics <- prometheus.MustNewConstMetric( m.InterfaceStatusCode, prometheus.GaugeValue, float64(100), "test", "http://url", ) } func TestRegisterer() { // 注册采集器 EmptyRegistry.MustRegister(NewMonitorMetrics()) http.HandleFunc("/metrics", func(writer http.ResponseWriter, request *http.Request) { promhttp.HandlerFor(EmptyRegistry, promhttp.HandlerOpts{ErrorHandling: promhttp.ContinueOnError}).ServeHTTP(writer, request) }) _ = http.ListenAndServe(":8001", nil) } func main() { TestRegisterer() } ``` -------------------------------- ### Monitor Disk Write Rate Source: https://cairry.github.io/docs/Exporter/node.html Monitors the number of completed disk writes per second. Use this to identify disk performance bottlenecks. ```PromQL avg(rate(node_disk_writes_completed_total{}[1m])) ``` -------------------------------- ### Calculate Memory Usage Rate Source: https://cairry.github.io/docs/Exporter/node.html Calculates the memory usage rate. Use this to monitor system memory usage and identify potential memory shortage issues. ```PromQL 100 - (node_memory_MemAvailable_bytes{} / node_memory_MemTotal_bytes{} * 100) ``` -------------------------------- ### 使用MustNewConstMetric注册Metrics Source: https://cairry.github.io/docs/Exporter/start.html 使用`MustNewConstMetric`方法注册Metrics,需要先定义`Desc`结构。适用于简单指标、固定值和少量常量标签,创建效率更高,代码可读性更好。 ```go pid := strconv.Itoa(os.Getpid()) cmdline := os.Args[0] user := os.Getenv("USER") cpuUsage := 50.0 // Percentage desc := prometheus.NewDesc( "cpu_usage", "CPU usage of process", []string{"pid", "cmdline", "user"}, nil, ) metric := prometheus.MustNewConstMetric( desc, prometheus.GaugeValue, cpuUsage, pid, cmdline, user, ) metric.Collect(ch) ``` -------------------------------- ### Prometheus Configuration for Nginx Exporter Source: https://cairry.github.io/docs/Exporter/nginx.html This Prometheus configuration uses Kubernetes service discovery to find and scrape metrics from the nginx-vts-exporter. It filters by the 'monitor' namespace and ensures correct address handling. ```yaml - job_name: 'nginx-vts-exporter' kubernetes_sd_configs: - role: endpoints namespaces: names: - monitor relabel_configs: - source_labels: [__meta_kubernetes_service_name] target_label: service_name action: replace - source_labels: [__address__] regex: '(.*):9913' target_label: __address__ action: keep ``` -------------------------------- ### Calculate P95 Interface Latency Source: https://cairry.github.io/docs/Monitor/business/index.html Calculates the 95th percentile of interface response times over a 5-minute interval. Useful for identifying tail latency issues. ```PromQL sort_desc(histogram_quantile(0.95, sum(rate(http_server_duration_milliseconds_bucket{job=~"$app", http_route!=""}[5m]))by (le, http_route))) ``` -------------------------------- ### Calculate Average Interface Latency Source: https://cairry.github.io/docs/Monitor/business/index.html Calculates the average response time for interface requests over a 5-minute interval. Use this to analyze overall system response trends. ```PromQL sort_desc(rate(http_server_duration_milliseconds_sum{job=~"$app", http_route=~"$route", http_status_code!=""}[5m]) / rate(http_server_duration_milliseconds_count{job=~"$app", http_route=~"$route", http_status_code!=""}[5m])) ``` -------------------------------- ### Configure Consul Service Discovery in Prometheus Source: https://cairry.github.io/docs/Prometheus/ServiceDiscover.html Use `consul_sd_configs` to enable Consul service discovery. Specify the Consul server address and optionally filter services. ```yaml ... - job_name: 'consul-prometheus' consul_sd_configs: - server: '192.168.1.177:8500' services: [] ``` -------------------------------- ### Memory Usage per Pod in DaemonSet Source: https://cairry.github.io/docs/Exporter/kubernetes-resource/kubernetes.html Retrieves the memory usage for each pod within a DaemonSet. Useful for resource management and optimization of DaemonSet pods. ```PromQL group(kube_pod_info{created_by_kind="DaemonSet"}) by (pod)* on(pod) group_right() max(container_memory_usage_bytes{container!="POD",container!=""}) by(container, pod) ``` -------------------------------- ### Analyze Error Log Percentage Source: https://cairry.github.io/docs/Monitor/business/index.html Calculates the percentage of error logs relative to total requests for each service over a 10-minute interval. Use this to identify services with a high proportion of errors. ```PromQL sum(increase(l2m_level_info{level="ERROR"}[10m])) by (service) / sum(increase(l2m_level_info[10m])) by (service) * 100 > 1 ``` -------------------------------- ### Calculate Overall Request Success Rate (2xx) Source: https://cairry.github.io/docs/Monitor/business/index.html Calculates the percentage of successful requests (2xx status codes) out of all requests to services. This metric reflects overall service quality. ```PromQL sum(http_server_duration_milliseconds_count{job=~"service",http_status_code=~"2.*|1.*|3.*"}) / sum(http_server_duration_milliseconds_count{job=~"service"}) ``` -------------------------------- ### Configure Redis Exporter with Environment Variable for Password (Multiple Instances) Source: https://cairry.github.io/docs/Exporter/redis.html When monitoring multiple Redis instances via Prometheus targets, the password must be injected via environment variables into the Redis Exporter. The password in the target URL itself is not effective. ```yaml - name: REDIS_PASSWORD value: "xxx:xxx" ``` -------------------------------- ### Deploy DCGM Exporter DaemonSet Source: https://cairry.github.io/docs/Exporter/gpu.html Deploys the DCGM exporter as a DaemonSet in Kubernetes. It configures the exporter to listen on port 9400 and scrape metrics, utilizing the host network and NVIDIA's container runtime. ```yaml apiVersion: apps/v1 kind: DaemonSet metadata: name: "dcgm-exporter" namespace: monitor labels: app.kubernetes.io/name: "dcgm-exporter" app.kubernetes.io/version: "2.4.0" annotations: prometheus.io/scrape: "true" prometheus.io/port: "9400" spec: updateStrategy: type: RollingUpdate selector: matchLabels: app.kubernetes.io/name: "dcgm-exporter" app.kubernetes.io/version: "2.4.0" template: metadata: labels: app.kubernetes.io/name: "dcgm-exporter" app.kubernetes.io/version: "2.4.0" name: "dcgm-exporter" spec: runtimeClassName: nvidia hostNetwork: true containers: - image: "nvcr.io/nvidia/k8s/dcgm-exporter:2.2.9-2.4.0-ubuntu18.04" env: - name: "DCGM_EXPORTER_LISTEN" value: ":9400" - name: "DCGM_EXPORTER_KUBERNETES" value: "true" name: "dcgm-exporter" ports: - name: "metrics" containerPort: 9400 ``` -------------------------------- ### Deploy Fping-Exporter Deployment Source: https://cairry.github.io/docs/Exporter/node.html This Kubernetes Deployment configuration deploys the fping-exporter. It exposes metrics on port 9605. ```yaml apiVersion: apps/v1 kind: Deployment metadata: name: fping-exporter namespace: monitoring labels: app: fping-exporter spec: replicas: 1 selector: matchLabels: app: fping-exporter template: metadata: labels: app: fping-exporter spec: containers: - name: fping-exporter image: joaorua/fping-exporter ports: - containerPort: 9605 --- apiVersion: v1 kind: Service metadata: name: fping-exporter namespace: monitoring labels: app: fping-exporter spec: type: ClusterIP ports: - port: 9605 targetPort: 9605 selector: app: fping-exporter ```