### Install kube-state-metrics

Source: https://cairry.github.io/docs/Exporter/kubernetes-resource/kubernetes.html

Kubernetes manifests for installing kube-state-metrics, including ServiceAccount, ClusterRole, ClusterRoleBinding, Service, and Deployment.

```yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  name: kube-state-metrics
  namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: kube-state-metrics
rules:
- apiGroups: [""]
  resources: ["nodes", "pods", "services", "resourcequotas", "replicationcontrollers", "limitranges", "persistentvolumeclaims", "persistentvolumes", "namespaces", "endpoints"]
  verbs: ["list", "watch"]
- apiGroups: ["extensions"]
  resources: ["daemonsets", "deployments", "replicasets"]
  verbs: ["list", "watch"]
- apiGroups: ["apps"]
  resources: ["statefulsets"]
  verbs: ["list", "watch"]
- apiGroups: ["batch"]
  resources: ["cronjobs", "jobs"]
  verbs: ["list", "watch"]
- apiGroups: ["autoscaling"]
  resources: ["horizontalpodautoscalers"]
  verbs: ["list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: kube-state-metrics
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: kube-state-metrics
subjects:
- kind: ServiceAccount
  name: kube-state-metrics
  namespace: kube-system

---
apiVersion: v1
kind: Service
metadata:
  annotations:
    prometheus.io/scrape: 'true'
  labels:
    app.kubernetes.io/name: kube-state-metrics
    app.kubernetes.io/version: v2.2.1
  name: kube-state-metrics
  namespace: kube-system
spec:
  clusterIP: None
  ports:
  - name: http-metrics
    port: 8080
    targetPort: http-metrics
  - name: telemetry
    port: 8081
    targetPort: telemetry
  selector:
    app.kubernetes.io/name: kube-state-metrics

---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app.kubernetes.io/name: kube-state-metrics
    app.kubernetes.io/version: v2.2.1
  name: kube-state-metrics
  namespace: kube-system
  
spec:
  replicas: 1
  selector:
    matchLabels:
      app.kubernetes.io/name: kube-state-metrics
      
  template:
    metadata:
      labels:
        app.kubernetes.io/name: kube-state-metrics
        app.kubernetes.io/version: v2.2.1
        
    spec:
      containers:
      - image: registry.cn-shenzhen.aliyuncs.com/starsl/kube-state-metrics:v2.2.1
        livenessProbe:
          httpGet:
            path: /healthz
            port: 8080
          initialDelaySeconds: 5
          timeoutSeconds: 5
        name: kube-state-metrics
        ports:
        - containerPort: 8080
          name: http-metrics
        - containerPort: 8081
          name: telemetry
        readinessProbe:
          httpGet:
            path: /
            port: 8081
          initialDelaySeconds: 5
          timeoutSeconds: 5
      nodeSelector:
        beta.kubernetes.io/os: linux
      serviceAccountName: kube-state-metrics

```

--------------------------------

### Describe Method Implementation

Source: https://cairry.github.io/docs/Exporter/basic.html

An example implementation of the Describe method for a custom metric, which sends metric descriptors to a channel. This informs Prometheus about the available metrics.

```go
func (m Monitor) Describe(descs chan<- *prometheus.Desc) {
	descs <- m.InterfaceStatusCode
	descs <- m.SSLCertRemainingRime
}
```

--------------------------------

### Install VictoriaMetrics Operator with Helm

Source: https://cairry.github.io/docs/VictoriaMetrics/index.html

Installs the VictoriaMetrics Operator using Helm. Ensure the Helm repository is added and updated before installation.

```bash
# helm repo add vm https://victoriametrics.github.io/helm-charts
# helm repo update
# helm install victoria-operator vm/victoria-metrics-operator

```

--------------------------------

### Kubernetes Service Discovery Configuration Example

Source: https://cairry.github.io/docs/Prometheus/ServiceDiscover.html

Example Prometheus configuration for Kubernetes service discovery, demonstrating global settings and scrape configurations. It highlights the use of roles like 'endpoints' for discovering services.

```yaml
global:
  # 间隔时间
  scrape_interval: 30s
  # 超时时间
  scrape_timeout: 10s
  # 另一个独立的规则周期，对告警规则做定期计算
  evaluation_interval: 30s
  # 外部系统标签
  external_labels:
    prometheus: monitoring/k8s
    prometheus_replica: prometheus-k8s-1

# 抓取服务端点，整个这个任务都是用来发现node-exporter和kube-state-metrics-service的，这里用的是endpoints角色，这是通过这两者的service来发现
```

--------------------------------

### Example Target JSON File for File-based SD

Source: https://cairry.github.io/docs/Prometheus/ServiceDiscover.html

Defines targets and their associated labels for Prometheus to discover. Each entry includes a list of targets and a labels object.

```json
[
   {
      "targets": [
         "172.16.0.96:19100"
      ],
      "labels": {
         "project_name": "项目测试20200413",
         "env_name": "开发环境",
         "soft_name": "测试应用20200413",
         "template_name": "测试模板20200413",
         "template_type": "主机模板"
      }
   },
 	 {
      "targets": [
         "172.16.0.96:30013",
         "172.16.0.96:30015"
      ],
      "labels": {
         "project_name": "项目测试20200413",
         "env_name": "开发环境",
         "soft_name": "测试应用20200413",
         "template_name": "测试模板20200413",
         "template_type": "主机模板"
      }
   },
	 {
      "metrics_path": "/_prometheus/metrics",
      "targets": [
         "172.16.0.96:9200"
      ],
      "labels": {
         "project_name": "项目测试20200413",
         "env_name": "开发环境",
         "soft_name": "测试应用20200413",
         "template_name": "测试模板20200413",
         "template_type": "主机模板"
      }
   }
]
```

--------------------------------

### Collect Method Implementation

Source: https://cairry.github.io/docs/Exporter/basic.html

An example implementation of the Collect method, responsible for gathering actual metric data. It iterates through configured domains, spawns goroutines to collect metrics, and waits for them.

```go
func (m Monitor) Collect(metrics chan<- prometheus.Metric) {

	for srvName, domainName := range config.DomainMap {
		// 探测 Domain 状态
		wg.Add(1)
		lock.Lock()
		go Gauge(srvName, domainName)
		lock.Unlock()
		wg.Wait()
	}

}
```

--------------------------------

### Prometheus Data Format Example

Source: https://cairry.github.io/docs/Exporter/basic.html

Illustrates the structure of a Prometheus metric, including metric name, labels, value, and timestamp.

```text
http_requests_total{method="GET", handler="/api"} 1027 1626568200
```

--------------------------------

### Testing Metrics Endpoint

Source: https://cairry.github.io/docs/Exporter/kubernetes-controlplane/controller-manager.html

Example command to test the Controller Manager's metrics endpoint using curl. Requires client certificates for authentication.

```bash
[root@master01 manifests]# curl -s -k --cert ./client-cert.pem --key ./client-key.pem https://localhost:10259/metrics | head -n 5
# HELP apiserver_audit_event_total [ALPHA] Counter of audit events generated and sent to the audit backend.
# TYPE apiserver_audit_event_total counter
apiserver_audit_event_total 0
# HELP apiserver_audit_requests_rejected_total [ALPHA] Counter of apiserver requests rejected due to an error in audit logging backend.
# TYPE apiserver_audit_requests_rejected_total counter
```

--------------------------------

### MySQL Exporter Metric Example

Source: https://cairry.github.io/docs/Exporter/mysql.html

The `mysql_up` metric indicates the success of connecting to and collecting data from a MySQL server. A value of 1 signifies successful data collection.

```text
# HELP mysql_up Whether the MySQL server is up.
# TYPE mysql_up gauge
mysql_up 1
```

--------------------------------

### Prometheus Alerting Rule Example

Source: https://cairry.github.io/docs/Prometheus/index.html

An example Prometheus alerting rule that triggers when an exporter component is down (up == 0) for 2 minutes. It includes severity labels and custom summary/description annotations.

```yaml
groups:
# --- Node
- name: NodeStatus   # 报警规则组名称
  rules:
  - alert: Exporter Componen is Down
    expr: up == 0
    for: 2m  #持续时间,表示持续30秒获取不到信息，则触发报警
    labels:
      severity: serious  # 自定义标签 严重的
    annotations:
      summary: "节点: {{ $labels.instance }} Exporter 程序" # 自定义摘要
      description: "节点: {{ $labels.instance }} Exporter程序异常 请及时处理！." # 自定义具体描述
```

--------------------------------

### Analyze Error Log Growth Rate

Source: https://cairry.github.io/docs/Monitor/business/index.html

Analyzes the growth rate of error logs over a specific period. The example checks if the increase in error logs over the last 5 minutes is at least 30% compared to a previous period.

```PromQL
increase(l2m_level_info{level="ERROR"}[5m]) > 50 and (increase(l2m_level_info{level="ERROR"}[5m]) - increase(l2m_level_info{level="ERROR"}[20m]) offset 15m) / increase(l2m_level_info{level="ERROR"}[20m] offset 15m) * 100 >= 30
```

--------------------------------

### Kubernetes Deployment for MySQL Exporter

Source: https://cairry.github.io/docs/Exporter/mysql.html

These Kubernetes manifests define a ConfigMap for MySQL credentials, a Deployment for the mysqld-exporter container, and a Service to expose the metrics endpoint. This setup allows Prometheus to scrape metrics from the exporter.

```yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: mysql-exporter
  namespace: monitoring
data:
  my.cnf: |-
    [client]
    user=exporter
    password=exporter_2024

---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: mysql-exporter
  namespace: monitoring
spec:
  replicas: 1
  selector:
    matchLabels:
      app: mysql-exporter
  template:
    metadata:
      labels:
        app: mysql-exporter
    spec:
      containers:
        - name: mysql-exporter
          image: prom/mysqld-exporter:v0.16.0
          ports:
            - name: metrics
              containerPort: 9104
          command:
            - mysqld_exporter
            - --config.my-cnf=/cfg/my.cnf
          volumeMounts:
            - mountPath: /cfg/my.cnf
              name: mysql-exporter
              subPath: my.cnf
      volumes:
        - configMap:
            defaultMode: 420
            name: mysql-exporter
          name: mysql-exporter

---
apiVersion: v1
kind: Service
metadata:
  name: mysql-exporter
  namespace: monitoring
spec:
  ports:
    - port: 9104
      targetPort: 9104
      protocol: TCP
      name: metrics
  selector:
    app: mysql-exporter
```

--------------------------------

### Get or Create Metric from Hash

Source: https://cairry.github.io/docs/Exporter/start.html

Retrieves a metric using its hash and label values. If the metric does not exist, it creates a new one and adds it to the map. Uses read-write locks for concurrent access.

```go
func (m *metricMap) getOrCreateMetricWithLabelValues(
	hash uint64, lvs []string, curry []curriedLabelValue,
) Metric {
	m.mtx.RLock()
  // 如果获取到直接返回
	metric, ok := m.getMetricWithHashAndLabelValues(hash, lvs, curry)
	m.mtx.RUnlock()
	if ok {
		return metric
	}

	m.mtx.Lock()
	defer m.mtx.Unlock()
  // 获取不到则创建
	metric, ok = m.getMetricWithHashAndLabelValues(hash, lvs, curry)
	if !ok {
		inlinedLVs := inlineLabelValues(lvs, curry)
    // 创建metric，如xx{aa=bb...}
		metric = m.newMetric(inlinedLVs...)
    // 对hash后的label值做完索引进行赋值。
		m.metrics[hash] = append(m.metrics[hash], metricWithLabelValues{values: inlinedLVs, metric: metric})
	}
	return metric
}
```

--------------------------------

### Prometheus Configuration for MySQL Exporter

Source: https://cairry.github.io/docs/Exporter/mysql.html

Configure Prometheus to scrape MySQL targets using the MySQL exporter. This setup includes relabeling rules to correctly map addresses and parameters.

```yaml
  * mysql_up 1：：代表成功连接MySQL并采集到指标数据


## Prometheus 端点配置 
```
    - job_name: 'MySQL'
      static_configs:
        - targets:
          - mysql1.com:3306
          - mysql2.com:3306
      relabel_configs:
        - source_labels: [__address__]
          target_label: __param_target
        - source_labels: [__param_target]
          target_label: instance
          action: replace
        - target_label: __address__
          replacement: mysql-exporter.monitoring:9104
```
```

--------------------------------

### Reduce Large Scope Queries

Source: https://cairry.github.io/docs/Prometheus/optimization_memory.html

Example of a query that might consume significant memory. It's recommended to minimize the scope of such queries in Grafana dashboards.

```promql
rate(up[1m])
```

--------------------------------

### 安装Prometheus Client

Source: https://cairry.github.io/docs/Exporter/start.html

使用go get命令安装Prometheus client库。

```bash
go get -u github.com/prometheus/client_golang/prometheus/promhttp
```

--------------------------------

### 暴露默认Metrics

Source: https://cairry.github.io/docs/Exporter/start.html

通过`promhttp.Handler()`注册一个HTTP处理器来暴露默认的Go运行时和promhttp相关的指标。默认监听在8005端口。

```go
import (
	"fmt"
	"github.com/prometheus/client_golang/prometheus/promhttp"
	"net/http"
)

func main() {
	http.Handle("/metrics", promhttp.Handler())
	http.ListenAndServe(":8005", nil)
}
```

--------------------------------

### 使用WithLabelValues注册Metrics

Source: https://cairry.github.io/docs/Exporter/start.html

使用`WithLabelValues`方法注册带有动态标签的Metrics。适用于复杂指标、动态值或大量标签，提供更大的灵活性。

```go
metric := pgc.HTTPLatencyHistogramCollect.WithLabelValues(name, strconv.Itoa(statusCode), requestPath)
metric.Observe(latency)

metric.Collect(ch)
```

--------------------------------

### Enable Etcd Metrics Interface

Source: https://cairry.github.io/docs/Exporter/kubernetes-controlplane/etcd.html

Modify etcd configuration to expose metrics on a specific port. Ensure the --listen-metrics-urls flag is set.

```yaml
- etcd
    - --advertise-client-urls=https://192.168.1.176:2379
    - --cert-file=/etc/kubernetes/pki/etcd/server.crt
    - --client-cert-auth=true
    ...
    - --trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
    - --listen-metrics-urls=http://0.0.0.0:2381         ### 添加此配置
    image: registry.aliyuncs.com/google_containers/etcd:3.4.13-0
```

--------------------------------

### Get or Create Metric with Label Values

Source: https://cairry.github.io/docs/Exporter/start.html

WithLabelValues is used for metrics that require label hashing for efficient storage and retrieval. It panics if an error occurs during metric retrieval.

```go
func (v *GaugeVec) WithLabelValues(lvs ...string) Gauge {
	g, err := v.GetMetricWithLabelValues(lvs...)
	if err != nil {
		panic(err)
	}
	return g
}
```

--------------------------------

### Filesystem Reads per Pod in DaemonSet

Source: https://cairry.github.io/docs/Exporter/kubernetes-resource/kubernetes.html

Retrieves the filesystem read activity for each pod within a DaemonSet. Aids in performance optimization and capacity planning.

```PromQL
group(kube_pod_info{created_by_kind="DaemonSet",pod_ip!=""}) by (pod)* on(pod) group_right() max(rate(container_fs_reads_bytes_total{container!="POD", container!=""}[5m])) by (pod,container)
```

--------------------------------

### Configure WAL Segment Size and Compression

Source: https://cairry.github.io/docs/Prometheus/optimization_memory.html

Command-line arguments to reduce the WAL segment size to 64MB and enable WAL compression, which can decrease disk space usage.

```bash
prometheus --storage.tsdb.wal-segment-size=64MB --storage.tsdb.wal-compression
```

--------------------------------

### Get Last Value Over Time in PromQL

Source: https://cairry.github.io/docs/Prometheus/promql.html

Use last_over_time to retrieve the most recent value from a time series within a given time window. This is ideal for monitoring the latest state of a metric.

```PromQL
last_over_time(http_request_duration_seconds{instance="server1"}[5m])
```

--------------------------------

### Registering Metrics with MustRegister

Source: https://cairry.github.io/docs/Exporter/basic.html

Demonstrates how to register multiple collectors with a registry, panicking if any registration fails. This is part of the metrics registration process for custom Exporters.

```go
// MustRegister implements Registerer.
func (r *Registry) MustRegister(cs ...Collector) {
	for _, c := range cs {
		if err := r.Register(c); err != nil {
			panic(err)
		}
	}
}
```

--------------------------------

### Collector Interface Methods

Source: https://cairry.github.io/docs/Exporter/basic.html

Shows the essential methods required for implementing the Prometheus Collector interface: Describe and Collect. These methods are crucial for exposing custom metrics.

```go
type Collector interface {
···
	Describe(chan<- *Desc)
···
	Collect(chan<- Metric)
}
```

--------------------------------

### Drop Metrics with Specific Name Pattern

Source: https://cairry.github.io/docs/Prometheus/optimization_memory.html

Prometheus relabeling configuration to drop all metrics whose names start with 'temp_'. This is useful for reducing memory consumption by eliminating unwanted metrics.

```yaml
scrape_configs:
  - job_name: 'example'
    static_configs:
      - targets: ['localhost:9090']

    metric_relabel_configs:
      # 删除掉所有以"temp_"开头的标签
      - source_labels: [__name__]
        action: drop
        regex: 'temp_.*'
```

--------------------------------

### File-based Service Discovery Configuration

Source: https://cairry.github.io/docs/Prometheus/ServiceDiscover.html

Configure Prometheus to discover targets from JSON files. Specify the job name, file paths, and refresh interval.

```yaml
scrape_configs:
  - job_name: 'file_ds'
    file_sd_configs:
      - files:
        - targets/*.json
        refresh_interval: 5m
```

--------------------------------

### Calculate Filesystem Usage Rate

Source: https://cairry.github.io/docs/Exporter/node.html

Calculates the filesystem usage rate for ext and xfs file systems. This helps prevent disk full issues and ensures normal file system operation.

```PromQL
max((node_filesystem_size_bytes{fstype=~"ext.?|xfs"}-node_filesystem_free_bytes{fstype=~"ext.?|xfs"}) *100/(node_filesystem_avail_bytes {fstype=~"ext.?|xfs"}+(node_filesystem_size_bytes{fstype=~"ext.?|xfs"}-node_filesystem_free_bytes{fstype=~"ext.?|xfs"})))by(ecs_cname,instance,service_id)
```

--------------------------------

### CPU Usage per Pod in DaemonSet

Source: https://cairry.github.io/docs/Exporter/kubernetes-resource/kubernetes.html

Retrieves the CPU usage for each pod within a DaemonSet. Aids in performance monitoring and optimization of DaemonSet pods.

```PromQL
group(kube_pod_info{created_by_kind="DaemonSet",pod_ip!=""}) by (pod)* on(pod) group_right() max(rate(container_cpu_usage_seconds_total{container!="",container!="POD"}[5m])) by (pod, container)
```

--------------------------------

### Configure Redis Exporter with Environment Variables (Single Instance)

Source: https://cairry.github.io/docs/Exporter/redis.html

This configuration uses environment variables to specify the Redis address and password for the exporter. This method is suitable for monitoring a single Redis instance.

```yaml
env:
- name: TZ
  value: "Asia/Shanghai"
- name: REDIS_ADDR
  value: "redis://redis-ztest.infra:6379"
- name: REDIS_PASSWORD
  value: "xxxx:xxxx"
```

--------------------------------

### Create Etcd Service for VM Discovery

Source: https://cairry.github.io/docs/Exporter/kubernetes-controlplane/etcd.html

Define a Kubernetes Service to expose the etcd metrics endpoint, enabling automatic discovery for VMs.

```yaml
kind: Service
apiVersion: v1
metadata:
  name: etcd
  namespace: kube-system
  labels:
    component: etcd
spec:
  selector:
    component: etcd
  ports:
  - name: metrics
    port: 2381
```

--------------------------------

### Ready Pods Count for DaemonSet

Source: https://cairry.github.io/docs/Exporter/kubernetes-resource/kubernetes.html

Retrieves the number of ready pods for a specified DaemonSet. Essential for monitoring the health status of DaemonSets.

```PromQL
kube_daemonset_status_number_ready{daemonset="$daemonset",namespace="$namespace"}
```

--------------------------------

### Prometheus Data Directory Structure

Source: https://cairry.github.io/docs/Prometheus/optimization_memory.html

Illustrates the directory structure for Prometheus data, differentiating between in-memory blocks with WAL files and persistent blocks with chunk and index files.

```text
./data/01BKGV7JBM69T2G1BGBGM6KB12
./data/01BKGV7JBM69T2G1BGBGM6KB12/meta.json
./data/01BKGV7JBM69T2G1BGBGM6KB12/wal/000002
./data/01BKGV7JBM69T2G1BGBGM6KB12/wal/000001
```

```text
./data/01BKGV7JC0RY8A6MACW02A2PJD
./data/01BKGV7JC0RY8A6MACW02A2PJD/meta.json
./data/01BKGV7JC0RY8A6MACW02A2PJD/index
./data/01BKGV7JC0RY8A6MACW02A2PJD/chunks
./data/01BKGV7JC0RY8A6MACW02A2PJD/chunks/000001
./data/01BKGV7JC0RY8A6MACW02A2PJD/tombstones
```

--------------------------------

### Controller Manager Monitoring Metrics (PromQL)

Source: https://cairry.github.io/docs/Exporter/kubernetes-controlplane/controller-manager.html

PromQL queries for monitoring key Controller Manager metrics, including workqueue rates, depth, latency, and API server request QPS.

```PromQL
sum(rate(workqueue_adds_total{job="kubernetes-controller-manager"}[$interval])) by (name)
```

```PromQL
sum(rate(workqueue_depth{job="kubernetes-controller-manager"}[$interval])) by (name)
```

```PromQL
histogram_quantile($quantile, sum(rate(workqueue_queue_duration_seconds_bucket{job="ack-cloud-controller-manager"}[5m])) by (name, le))
```

```PromQL
sum(rate(rest_client_requests_total{job="kubernetes-controller-manager",code=~"2.."}[$interval])) by (method,code)
```

```PromQL
sum(rate(rest_client_requests_total{job="kubernetes-controller-manager",code!~"2.."}[$interval])) by (method,code)
```

--------------------------------

### Prometheus Configuration for Kubelet Monitoring

Source: https://cairry.github.io/docs/Exporter/kubernetes-controlplane/cluster.html

This Prometheus configuration targets Kubernetes nodes to scrape Kubelet metrics. It uses service discovery and relabeling to dynamically discover nodes and set the correct metrics path.

```yaml
job_name: 'kubernetes-kubelets'
scheme: https
tls_config:
  insecure_skip_verify: true
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
kubernetes_sd_configs:
  - role: node
relabel_configs:
  - action: labelmap
    regex: __meta_kubernetes_node_label_(.+)
  - source_labels: [__meta_kubernetes_node_name]
    regex: (.+)
    target_label: __metrics_path__
    replacement: /api/v1/nodes/${1}/proxy/metrics
  - target_label: __address__
    replacement: kubernetes.default.svc:443

```

--------------------------------

### Monitor Process Open Files

Source: https://cairry.github.io/docs/Exporter/node.html

Retrieves process-level open file count information, excluding containerd-shim processes. Monitor this to prevent issues caused by exceeding system limits.

```PromQL
describe_node_process_openfiles_info{name!="containerd-shim", name!="containerd-shim-runc-v2"}
```

--------------------------------

### Top 10 CPU Usage by Container

Source: https://cairry.github.io/docs/Exporter/kubernetes-resource/kubernetes.html

Calculates the top 10 containers with the highest CPU usage over the past minute. Useful for identifying high-consuming workloads for optimization.

```PromQL
topk(10, sum(irate(container_cpu_usage_seconds_total{container!="",container!="POD",pod!=""}[1m]) * 100) by (container,pod,namespace)or on() vector(0))
```

--------------------------------

### Deploy Node-Process-Exporter DaemonSet

Source: https://cairry.github.io/docs/Exporter/node.html

This Kubernetes DaemonSet configuration deploys the node-process-exporter to each node in the cluster. It requires privileged access and host networking.

```yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
  labels:
    app: node-process-exporter
  name: node-process-exporter
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: node-process-exporter
  template:
    metadata:
      labels:
        app: node-process-exporter
    spec:
      containers:
      - image: cairry/node-process-exporter:latest
        imagePullPolicy: IfNotPresent
        name: node-process-exporter
        ports:
        - containerPort: 9002
          hostPort: 9002
          protocol: TCP
        resources:
          limits:
            cpu: "1"
            memory: 1Gi
          requests:
            cpu: 250m
            memory: 512Mi
        securityContext:
          privileged: true

      hostIPC: true
      hostNetwork: true
      hostPID: true
      restartPolicy: Always
      tolerations:
      - effect: NoSchedule
        operator: Exists

---
apiVersion: v1
kind: Service
metadata:
  name: node-process-exporter
  namespace: monitoring
spec:
  ports:
  - port: 9002
    protocol: TCP
    targetPort: 9002
  selector:
    app: node-process-exporter
  sessionAffinity: None
  type: ClusterIP
```

--------------------------------

### Set Data Retention Time and Size

Source: https://cairry.github.io/docs/Prometheus/optimization_memory.html

Command-line arguments to configure Prometheus to retain data for 30 days and limit the total storage size to 50GB.

```bash
prometheus --storage.tsdb.retention.time=30d --storage.tsdb.retention.size=50GB
```

--------------------------------

### Kubernetes Service Discovery Annotations

Source: https://cairry.github.io/docs/Prometheus/ServiceDiscover.html

Add annotations to Kubernetes Services or Pods to enable Prometheus service discovery. Configure the scrape port and enable scraping.

```yaml
annotations:
	# 新增如下配置
    prometheus.io/port: "3002"	# 可自动注册的端口
    prometheus.io/scrape: "true"	# 是否自动注册
```

--------------------------------

### Prometheus Server Configuration

Source: https://cairry.github.io/docs/Prometheus/index.html

Global and scrape configurations for Prometheus Server. Includes scrape intervals, evaluation intervals, global labels, Alertmanager targets, rule file paths, and scrape jobs for Prometheus itself and node exporters.

```yaml
global:
  scrape_interval: 5s 
  evaluation_interval: 5s
  # 全局标签组
  # 通过本实例采集的数据都会叠加下面的标签
  external_labels:
    account: "huawei-main"
    region: "beijing"

alerting:
  alertmanagers:
  - static_configs:
    - targets:
      - '172.17.84.238:9093'

rule_files:
  - "/etc/prometheus/rules/first_rules.yml"

scrape_configs:
  - job_name: "prometheus"
    static_configs:
      - targets: ["localhost:9090"]
      - targets: ["172.17.84.238:9100"]
```

--------------------------------

### Network Receive Bytes per Pod in DaemonSet

Source: https://cairry.github.io/docs/Exporter/kubernetes-resource/kubernetes.html

Retrieves the network receive traffic for each pod within a DaemonSet. Helps identify potential network bottlenecks.

```PromQL
sum(kube_pod_info{created_by_kind="DaemonSet",pod_ip!=""}) by (pod)* on(pod) group_right() max(rate(container_network_receive_bytes_total{}[5m])) by (pod)
```

--------------------------------

### 自定义Collector接口声明

Source: https://cairry.github.io/docs/Exporter/start.html

Collector接口定义了Prometheus如何从自定义指标收集器中获取指标描述和数据。

```go
type Collector interface {
	// 指标的一些描述信息, 就是# 标识的那部分
	// 注意这里使用的是指针, 因为描述信息 全局存储一份就可以了
	Describe(chan<- *Desc)
	// 指标的数据, 比如 promhttp_metric_handler_errors_total{cause="gathering"} 0
	// 这里没有使用指针, 因为每次采集的值都是独立的
	Collect(chan<- Metric)
}
```

--------------------------------

### Verify Redis Exporter Metrics

Source: https://cairry.github.io/docs/Exporter/redis.html

This command uses `curl` to fetch metrics from the Redis Exporter and `grep` to filter for the `redis_up` metric. A value of '1' indicates a successful connection and data collection.

```bash
[root@iZ2zeh5cd0wu2m4o1m5xjaZ ~]# curl 10.15.1.2:9121/metrics -s | grep redis_up
# HELP redis_up Information about the Redis instance
# TYPE redis_up gauge
redis_up 1
```

--------------------------------

### Calculate CPU Usage Rate

Source: https://cairry.github.io/docs/Exporter/node.html

Calculates the CPU usage rate over the last 5 minutes. Use this to monitor CPU load and identify high-load situations for capacity planning and performance tuning.

```PromQL
100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance,service_id,ecs_cname) * 100)
```

--------------------------------

### Top 10 Socket Usage by Container

Source: https://cairry.github.io/docs/Exporter/kubernetes-resource/kubernetes.html

Calculates the socket usage for each container and identifies the top 10 containers with the highest usage. Useful for detecting potential network bottlenecks.

```PromQL
topk(10, sum(container_sockets{container!="",pod!=""}) by (container,pod,namespace)or on() vector(0))
```

--------------------------------

### Create MySQL Exporter User and Grant Permissions

Source: https://cairry.github.io/docs/Exporter/mysql.html

This SQL script creates a dedicated user for the MySQL exporter and grants it the necessary privileges to access monitoring information. Ensure this user has appropriate permissions before deploying the exporter.

```sql
CREATE USER 'exporter'@'%' IDENTIFIED BY 'exporter_2024';
GRANT PROCESS, REPLICATION CLIENT, SELECT ON *.* TO 'exporter'@'%';
FLUSH PRIVILEGES;
```

--------------------------------

### Controller Manager Manifest Configuration

Source: https://cairry.github.io/docs/Exporter/kubernetes-controlplane/controller-manager.html

Configuration snippet for the kube-controller-manager manifest, showing command-line arguments. Adjust the bind-address if necessary.

```yaml
- command:
    - kube-controller-manager
    - --allocate-node-cidrs=true
    - --authentication-kubeconfig=/etc/kubernetes/controller-manager.conf
    - --authorization-kubeconfig=/etc/kubernetes/controller-manager.conf
    - --bind-address=0.0.0.0            ### 调整此配置
    ...
    - --use-service-account-credentials=true
  image: registry.aliyuncs.com/google_containers/kube-controller-manager:v1.20.4
```

--------------------------------

### Prometheus Server Docker Compose Configuration

Source: https://cairry.github.io/docs/Prometheus/index.html

This configuration sets up a Prometheus Server using Docker Compose. It defines volumes for data and rules, exposes the Prometheus port, and configures retention policies and command-line arguments.

```yaml
version: "3"
services:
  prometheus:
    container_name: prometheus
    image: registry.js.design/prometheus/prometheus:v2.32.1
    ports:
      - 9090:9090
    volumes:
      - /opt/apps/monitoring/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
      - /opt/apps/monitoring/prometheus/data:/prometheus
      - /opt/apps/monitoring/prometheus/rules/:/etc/prometheus/rules/
      - /etc/localtime:/etc/localtime:ro
    restart: always
    command: 
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--web.enable-admin-api'
      - '--web.enable-lifecycle'
      - '--storage.tsdb.retention=15d'
      - '--storage.tsdb.path=/prometheus'
    networks:
      - monitor

networks:
  monitor:
    driver: bridge
```

--------------------------------

### Deploy VictoriaMetrics Single-Node

Source: https://cairry.github.io/docs/VictoriaMetrics/index.html

This Kubernetes deployment configuration sets up a single instance of VictoriaMetrics, including persistent storage and service exposure.

```yaml
apiVersion: v1 
kind: PersistentVolumeClaim 
metadata: 
  name: victoria-metrics-data
  namespace: observability
spec: 
  storageClassName: "local-path" 
  accessModes: 
    - ReadWriteOnce
  resources: 
    requests: 
      storage: 100Gi 
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: victoria-metrics
  namespace: observability
spec:
  selector:
    matchLabels:
      app: victoria-metrics
  template:
    metadata:
      labels:
        app: victoria-metrics
    spec:
      containers:
        - name: vm
          image: victoriametrics/victoria-metrics:v1.79.8
          imagePullPolicy: IfNotPresent
          args:
            - -storageDataPath=/var/lib/victoria-metrics-data
            - -retentionPeriod=2w
            - -promscrape.config=/etc/prometheus/prometheus.yaml
          ports:
            - containerPort: 8428
              name: http
          volumeMounts:
            - mountPath: /var/lib/victoria-metrics-data
              name: storage
            - mountPath: "/etc/prometheus/"
              name: prometheus-config
              readOnly: true
      volumes:
        - name: prometheus-config
          secret:
            secretName: vm-agent-target
            items:
            - key: "prometheus.yaml"
              path: "prometheus.yaml"
        - name: storage
          persistentVolumeClaim:
            claimName: victoria-metrics-data
---
apiVersion: v1
kind: Service
metadata:
  name: victoria-metrics
  namespace: observability
spec:
  type: NodePort
  ports:
    - port: 8428
  selector:
    app: victoria-metrics

```

--------------------------------

### 实现自定义Collector

Source: https://cairry.github.io/docs/Exporter/start.html

通过实现`Collector`接口来创建自定义指标收集器。`NewMonitorMetrics`用于初始化指标规范，`Describe`提供指标描述，`Collect`提供指标数据。默认监听在8001端口。

```go
import (
	"net/http"
	"github.com/prometheus/client_golang/prometheus"
	"github.com/prometheus/client_golang/prometheus/promhttp"
)

// EmptyRegistry 空指标注册表
var (
	EmptyRegistry = prometheus.NewRegistry()
)

// Monitor 创建采集器
type Monitor struct {
	InterfaceStatusCode *prometheus.Desc
}

// NewMonitorMetrics 创建采集器指标注册规范
func NewMonitorMetrics() *Monitor {
	return &Monitor{
		InterfaceStatusCode: prometheus.NewDesc(
			"url_interface_status_code", // 指标名称
			"url 接口状态码",                 // 描述信息
			[]string{"app", "url"},      // 动态指标
			nil,                         // 静态指标
		),
	}
}

// Describe 收集描述信息
func (m Monitor) Describe(desc chan<- *prometheus.Desc) {
	desc <- m.InterfaceStatusCode
}

// Collect 收集指标数据
func (m Monitor) Collect(metrics chan<- prometheus.Metric) {
	metrics <- prometheus.MustNewConstMetric(
		m.InterfaceStatusCode,
		prometheus.GaugeValue,
		float64(100),
		"test",
		"http://url",
	)
}

func TestRegisterer() {
	// 注册采集器
	EmptyRegistry.MustRegister(NewMonitorMetrics())
	http.HandleFunc("/metrics", func(writer http.ResponseWriter, request *http.Request) {
		promhttp.HandlerFor(EmptyRegistry,
			promhttp.HandlerOpts{ErrorHandling: promhttp.ContinueOnError}).ServeHTTP(writer, request)
	})
	_ = http.ListenAndServe(":8001", nil)
}

func main() {

	TestRegisterer()

}
```

--------------------------------

### Monitor Disk Write Rate

Source: https://cairry.github.io/docs/Exporter/node.html

Monitors the number of completed disk writes per second. Use this to identify disk performance bottlenecks.

```PromQL
avg(rate(node_disk_writes_completed_total{}[1m]))
```

--------------------------------

### Calculate Memory Usage Rate

Source: https://cairry.github.io/docs/Exporter/node.html

Calculates the memory usage rate. Use this to monitor system memory usage and identify potential memory shortage issues.

```PromQL
100 - (node_memory_MemAvailable_bytes{} / node_memory_MemTotal_bytes{} * 100)
```

--------------------------------

### 使用MustNewConstMetric注册Metrics

Source: https://cairry.github.io/docs/Exporter/start.html

使用`MustNewConstMetric`方法注册Metrics，需要先定义`Desc`结构。适用于简单指标、固定值和少量常量标签，创建效率更高，代码可读性更好。

```go
pid := strconv.Itoa(os.Getpid())
cmdline := os.Args[0]
user := os.Getenv("USER")
cpuUsage := 50.0 // Percentage

desc := prometheus.NewDesc(
    "cpu_usage",
    "CPU usage of process",
    []string{"pid", "cmdline", "user"},
    nil,
)

metric := prometheus.MustNewConstMetric(
    desc,
    prometheus.GaugeValue,
    cpuUsage,
    pid,
    cmdline,
    user,
)

metric.Collect(ch)
```

--------------------------------

### Prometheus Configuration for Nginx Exporter

Source: https://cairry.github.io/docs/Exporter/nginx.html

This Prometheus configuration uses Kubernetes service discovery to find and scrape metrics from the nginx-vts-exporter. It filters by the 'monitor' namespace and ensures correct address handling.

```yaml
- job_name: 'nginx-vts-exporter'
  kubernetes_sd_configs:
    - role: endpoints
      namespaces:
        names:
        - monitor
  relabel_configs:
    - source_labels: [__meta_kubernetes_service_name]
      target_label: service_name
      action: replace
    - source_labels: [__address__]
      regex: '(.*):9913'
      target_label: __address__
      action: keep
```

--------------------------------

### Calculate P95 Interface Latency

Source: https://cairry.github.io/docs/Monitor/business/index.html

Calculates the 95th percentile of interface response times over a 5-minute interval. Useful for identifying tail latency issues.

```PromQL
sort_desc(histogram_quantile(0.95, sum(rate(http_server_duration_milliseconds_bucket{job=~"$app", http_route!=""}[5m]))by (le, http_route)))
```

--------------------------------

### Calculate Average Interface Latency

Source: https://cairry.github.io/docs/Monitor/business/index.html

Calculates the average response time for interface requests over a 5-minute interval. Use this to analyze overall system response trends.

```PromQL
sort_desc(rate(http_server_duration_milliseconds_sum{job=~"$app", http_route=~"$route", http_status_code!=""}[5m]) / rate(http_server_duration_milliseconds_count{job=~"$app", http_route=~"$route", http_status_code!=""}[5m]))
```

--------------------------------

### Configure Consul Service Discovery in Prometheus

Source: https://cairry.github.io/docs/Prometheus/ServiceDiscover.html

Use `consul_sd_configs` to enable Consul service discovery. Specify the Consul server address and optionally filter services.

```yaml
...
- job_name: 'consul-prometheus'
  consul_sd_configs:
  - server: '192.168.1.177:8500'
    services: []

```

--------------------------------

### Memory Usage per Pod in DaemonSet

Source: https://cairry.github.io/docs/Exporter/kubernetes-resource/kubernetes.html

Retrieves the memory usage for each pod within a DaemonSet. Useful for resource management and optimization of DaemonSet pods.

```PromQL
group(kube_pod_info{created_by_kind="DaemonSet"}) by (pod)* on(pod) group_right() max(container_memory_usage_bytes{container!="POD",container!=""}) by(container, pod)
```

--------------------------------

### Analyze Error Log Percentage

Source: https://cairry.github.io/docs/Monitor/business/index.html

Calculates the percentage of error logs relative to total requests for each service over a 10-minute interval. Use this to identify services with a high proportion of errors.

```PromQL
sum(increase(l2m_level_info{level="ERROR"}[10m])) by (service) / sum(increase(l2m_level_info[10m])) by (service) * 100 > 1
```

--------------------------------

### Calculate Overall Request Success Rate (2xx)

Source: https://cairry.github.io/docs/Monitor/business/index.html

Calculates the percentage of successful requests (2xx status codes) out of all requests to services. This metric reflects overall service quality.

```PromQL
sum(http_server_duration_milliseconds_count{job=~"service",http_status_code=~"2.*|1.*|3.*"}) / sum(http_server_duration_milliseconds_count{job=~"service"})
```

--------------------------------

### Configure Redis Exporter with Environment Variable for Password (Multiple Instances)

Source: https://cairry.github.io/docs/Exporter/redis.html

When monitoring multiple Redis instances via Prometheus targets, the password must be injected via environment variables into the Redis Exporter. The password in the target URL itself is not effective.

```yaml
- name: REDIS_PASSWORD
  value: "xxx:xxx"
```

--------------------------------

### Deploy DCGM Exporter DaemonSet

Source: https://cairry.github.io/docs/Exporter/gpu.html

Deploys the DCGM exporter as a DaemonSet in Kubernetes. It configures the exporter to listen on port 9400 and scrape metrics, utilizing the host network and NVIDIA's container runtime.

```yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: "dcgm-exporter"
  namespace: monitor
  labels:
    app.kubernetes.io/name: "dcgm-exporter"
    app.kubernetes.io/version: "2.4.0"
  annotations:
    prometheus.io/scrape: "true"
    prometheus.io/port: "9400"
spec:
  updateStrategy:
    type: RollingUpdate
  selector:
    matchLabels:
      app.kubernetes.io/name: "dcgm-exporter"
      app.kubernetes.io/version: "2.4.0"
  template:
    metadata:
      labels:
        app.kubernetes.io/name: "dcgm-exporter"
        app.kubernetes.io/version: "2.4.0"
      name: "dcgm-exporter"
    spec:
      runtimeClassName: nvidia
      hostNetwork: true
      containers:
      - image: "nvcr.io/nvidia/k8s/dcgm-exporter:2.2.9-2.4.0-ubuntu18.04"
        env:
        - name: "DCGM_EXPORTER_LISTEN"
          value: ":9400"
        - name: "DCGM_EXPORTER_KUBERNETES"
          value: "true"
        name: "dcgm-exporter"
        ports:
        - name: "metrics"
          containerPort: 9400

```

--------------------------------

### Deploy Fping-Exporter Deployment

Source: https://cairry.github.io/docs/Exporter/node.html

This Kubernetes Deployment configuration deploys the fping-exporter. It exposes metrics on port 9605.

```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: fping-exporter
  namespace: monitoring
  labels:
    app: fping-exporter
spec:
  replicas: 1
  selector:
    matchLabels:
      app: fping-exporter
  template:
    metadata:
      labels:
        app: fping-exporter
    spec:
      containers:
      - name: fping-exporter
        image: joaorua/fping-exporter
        ports:
        - containerPort: 9605
---
apiVersion: v1
kind: Service
metadata:
  name: fping-exporter
  namespace: monitoring
  labels:
    app: fping-exporter
spec:
  type: ClusterIP
  ports:
  - port: 9605
    targetPort: 9605
  selector:
    app: fping-exporter
```