Building a Comprehensive Container Monitoring Infrastructure in 2025
Effective monitoring is critical in containerized environments where systems are distributed, dynamic, and ephemeral. This guide walks through building a complete monitoring infrastructure using modern tools and techniques to provide full visibility into your container ecosystem.
Building a Comprehensive Container Monitoring Infrastructure in 2025
The Observability Challenge in Container Environments
Containerized infrastructures present unique monitoring challenges:
- Ephemeral Workloads: Containers start, stop, and reschedule frequently
- Dynamic Addressing: IP addresses and ports change constantly
- Distributed Systems: Applications are spread across multiple nodes
- Resource Sharing: Multiple containers share underlying resources
- Layered Dependencies: Issues can originate at the container, host, or orchestration layer
Modern observability requires capturing three core types of telemetry:
- Metrics: Numerical time-series data (CPU, memory, request rates, etc.)
- Logs: Textual records of events and application output
- Traces: The flow of requests through distributed services
In this guide, we’ll build a complete monitoring stack that addresses all three pillars of observability.
Architecture Overview
Our monitoring infrastructure consists of these components:

Metrics Collection and Analysis
- Prometheus: Time-series database that scrapes and stores metrics
- Grafana: Dashboard and visualization platform
- Node Exporter: Collects host-level metrics
- cAdvisor: Gathers container metrics
- Alertmanager: Handles alerting based on metric thresholds
Log Collection and Analysis
- Elasticsearch: Distributed search and analytics engine
- Fluentd: Log collection, processing, and forwarding
- Kibana: Log visualization and exploration
Distributed Tracing
- OpenTelemetry Collector: Collects and processes trace data
- Jaeger: Distributed tracing platform
- Tempo: Grafana’s trace backend for unified observability
Implementation Guide
Prerequisites
- Docker and Docker Compose (or Kubernetes cluster)
- At least 8GB of RAM available for the monitoring stack
- Basic familiarity with containerization concepts
Step 1: Create the Project Structure
Start by creating a directory structure for our monitoring infrastructure:
mkdir container-monitoring
cd container-monitoring
mkdir -p prometheus/config grafana/provisioning/{datasources,dashboards} \
elasticsearch/config kibana/config fluentd/config \
jaeger opentelemetry/config sample-app
Create a base docker-compose.yml file:
touch docker-compose.yml
touch .env
Step 2: Set Up Prometheus and Node Exporter
First, let’s create the Prometheus configuration:
cat > prometheus/config/prometheus.yml << EOF
global:
scrape_interval: 15s
evaluation_interval: 15s
alerting:
alertmanagers:
- static_configs:
- targets: ['alertmanager:9093']
rule_files:
- "rules/*.yml"
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'node-exporter'
static_configs:
- targets: ['node-exporter:9100']
- job_name: 'cadvisor'
static_configs:
- targets: ['cadvisor:8080']
- job_name: 'sample-app'
static_configs:
- targets: ['sample-app:8080']
EOF
Add Prometheus and Node Exporter to the Docker Compose file:
version: '3.8'
services:
prometheus:
image: prom/prometheus:v2.49.0
volumes:
- ./prometheus/config:/etc/prometheus
- prometheus_data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--web.console.libraries=/etc/prometheus/console_libraries'
- '--web.console.templates=/etc/prometheus/consoles'
- '--web.enable-lifecycle'
ports:
- "9090:9090"
restart: unless-stopped
networks:
- monitoring
node-exporter:
image: prom/node-exporter:v1.7.0
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
command:
- '--path.procfs=/host/proc'
- '--path.rootfs=/rootfs'
- '--path.sysfs=/host/sys'
- '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
ports:
- "9100:9100"
restart: unless-stopped
networks:
- monitoring
cadvisor:
image: gcr.io/cadvisor/cadvisor:v0.47.2
volumes:
- /:/rootfs:ro
- /var/run:/var/run:ro
- /sys:/sys:ro
- /var/lib/docker/:/var/lib/docker:ro
- /dev/disk/:/dev/disk:ro
ports:
- "8080:8080"
restart: unless-stopped
privileged: true
networks:
- monitoring
networks:
monitoring:
driver: bridge
volumes:
prometheus_data:
Step 3: Configure Grafana
Create Grafana datasource configuration:
cat > grafana/provisioning/datasources/datasource.yml << EOF
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus:9090
isDefault: true
editable: false
- name: Elasticsearch
type: elasticsearch
access: proxy
database: "[logstash-]YYYY.MM.DD"
url: http://elasticsearch:9200
jsonData:
esVersion: 8.0.0
timeField: "@timestamp"
editable: false
- name: Tempo
type: tempo
access: proxy
url: http://tempo:3200
editable: false
EOF
Create a Grafana dashboard configuration:
cat > grafana/provisioning/dashboards/dashboard.yml << EOF
apiVersion: 1
providers:
- name: 'Default'
orgId: 1
folder: ''
type: file
disableDeletion: false
editable: true
options:
path: /etc/grafana/provisioning/dashboards
EOF
Add Grafana to the Docker Compose file:
grafana:
image: grafana/grafana:10.2.0
volumes:
- ./grafana/provisioning:/etc/grafana/provisioning
- grafana_data:/var/lib/grafana
environment:
- GF_SECURITY_ADMIN_USER=admin
- GF_SECURITY_ADMIN_PASSWORD=admin
- GF_USERS_ALLOW_SIGN_UP=false
ports:
- "3000:3000"
restart: unless-stopped
networks:
- monitoring
volumes:
grafana_data:
Step 4: Set Up Alertmanager
Create an Alertmanager configuration:
mkdir -p prometheus/config/alertmanager
cat > prometheus/config/alertmanager/alertmanager.yml << EOF
global:
resolve_timeout: 5m
route:
group_by: ['alertname', 'job']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: 'default-receiver'
receivers:
- name: 'default-receiver'
webhook_configs:
- url: 'http://alertmanager-webhook:9095'
EOF
Add Alertmanager to the Docker Compose file:
alertmanager:
image: prom/alertmanager:v0.26.0
volumes:
- ./prometheus/config/alertmanager:/etc/alertmanager
command:
- '--config.file=/etc/alertmanager/alertmanager.yml'
- '--storage.path=/alertmanager'
ports:
- "9093:9093"
restart: unless-stopped
networks:
- monitoring
Create alert rules:
mkdir -p prometheus/config/rules
cat > prometheus/config/rules/alerts.yml << EOF
groups:
- name: example
rules:
- alert: HighCPULoad
expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU load (instance {{ \$labels.instance }})"
description: "CPU load is > 80%\n VALUE = {{ \$value }}\n LABELS: {{ \$labels }}"
- alert: MemoryUsageHigh
expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 80
for: 5m
labels:
severity: warning
annotations:
summary: "High memory usage (instance {{ \$labels.instance }})"
description: "Memory usage is > 80%\n VALUE = {{ \$value }}\n LABELS: {{ \$labels }}"
EOF
Step 5: Set Up Elasticsearch, Fluentd, and Kibana (EFK Stack)
Create Fluentd configuration:
cat > fluentd/config/fluent.conf << EOF
<source>
@type forward
port 24224
bind 0.0.0.0
</source>
<match *.**>
@type copy
<store>
@type elasticsearch
host elasticsearch
port 9200
logstash_format true
logstash_prefix fluentd
logstash_dateformat %Y%m%d
include_tag_key true
type_name access_log
tag_key @log_name
flush_interval 1s
</store>
<store>
@type stdout
</store>
</match>
EOF
Create a Fluentd Dockerfile:
cat > fluentd/Dockerfile << EOF
FROM fluent/fluentd:v1.16-1
USER root
RUN apk add --no-cache --update --virtual .build-deps \
sudo build-base ruby-dev \
&& sudo gem install fluent-plugin-elasticsearch \
&& sudo gem sources --clear-all \
&& apk del .build-deps \
&& rm -rf /tmp/* /var/tmp/* /usr/lib/ruby/gems/*/cache/*.gem
USER fluent
EOF
Add EFK stack to the Docker Compose file:
elasticsearch:
image: docker.elastic.co/elasticsearch/elasticsearch:8.11.0
environment:
- discovery.type=single-node
- bootstrap.memory_lock=true
- "ES_JAVA_OPTS=-Xms512m -Xmx512m"
- xpack.security.enabled=false
ulimits:
memlock:
soft: -1
hard: -1
volumes:
- es_data:/usr/share/elasticsearch/data
ports:
- "9200:9200"
networks:
- monitoring
kibana:
image: docker.elastic.co/kibana/kibana:8.11.0
environment:
- ELASTICSEARCH_HOSTS=http://elasticsearch:9200
ports:
- "5601:5601"
depends_on:
- elasticsearch
networks:
- monitoring
fluentd:
build: ./fluentd
volumes:
- ./fluentd/config:/fluentd/etc
ports:
- "24224:24224"
- "24224:24224/udp"
depends_on:
- elasticsearch
networks:
- monitoring
volumes:
es_data:
Step 6: Set Up Distributed Tracing with OpenTelemetry, Jaeger, and Tempo
Create OpenTelemetry Collector configuration:
cat > opentelemetry/config/otel-collector-config.yaml << EOF
receivers:
otlp:
protocols:
grpc:
http:
processors:
batch:
exporters:
jaeger:
endpoint: jaeger:14250
tls:
insecure: true
prometheus:
endpoint: "0.0.0.0:8889"
tempo:
endpoint: tempo:4317
insecure: true
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch]
exporters: [jaeger, tempo]
metrics:
receivers: [otlp]
processors: [batch]
exporters: [prometheus]
EOF
Add tracing components to the Docker Compose file:
otel-collector:
image: otel/opentelemetry-collector-contrib:0.91.0
command: ["--config=/etc/otel-collector-config.yaml"]
volumes:
- ./opentelemetry/config/otel-collector-config.yaml:/etc/otel-collector-config.yaml
ports:
- "4317:4317" # OTLP gRPC
- "4318:4318" # OTLP HTTP
- "8889:8889" # Prometheus metrics
networks:
- monitoring
jaeger:
image: jaegertracing/all-in-one:1.47
environment:
- COLLECTOR_OTLP_ENABLED=true
ports:
- "16686:16686" # Jaeger UI
- "14250:14250" # Collector
networks:
- monitoring
tempo:
image: grafana/tempo:2.3.0
command: ["-config.file=/etc/tempo.yaml"]
volumes:
- ./tempo-data:/tmp/tempo
ports:
- "3200:3200" # Tempo UI
- "4317:4317" # OTLP gRPC
networks:
- monitoring
volumes:
tempo-data:
Create a simple Tempo configuration file:
mkdir -p tempo
cat > tempo/tempo.yaml << EOF
server:
http_listen_port: 3200
distributor:
receivers:
otlp:
protocols:
grpc:
storage:
trace:
backend: local
local:
path: /tmp/tempo/traces
EOF
Step 7: Create a Sample Application with Instrumentation
Create a simple Go application with metrics, logging, and tracing:
cat > sample-app/main.go << EOF
package main
import (
"context"
"fmt"
"log"
"math/rand"
"net/http"
"time"
"go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp"
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/exporters/otlp/otlptrace"
"go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"
"go.opentelemetry.io/otel/sdk/resource"
sdktrace "go.opentelemetry.io/otel/sdk/trace"
semconv "go.opentelemetry.io/otel/semconv/v1.4.0"
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promauto"
"github.com/prometheus/client_golang/prometheus/promhttp"
)
// Define Prometheus metrics
var (
requestCounter = promauto.NewCounterVec(
prometheus.CounterOpts{
Name: "sample_app_requests_total",
Help: "Total number of requests",
},
[]string{"path", "status"},
)
requestDuration = promauto.NewHistogramVec(
prometheus.HistogramOpts{
Name: "sample_app_request_duration_seconds",
Help: "Duration of HTTP requests",
Buckets: prometheus.DefBuckets,
},
[]string{"path"},
)
)
func initTracer() func() {
ctx := context.Background()
res, err := resource.New(ctx,
resource.WithAttributes(
semconv.ServiceNameKey.String("sample-app"),
),
)
if err != nil {
log.Fatalf("Failed to create resource: %v", err)
}
// Set up a connection to the OTLP collector
ctx, cancel := context.WithTimeout(ctx, time.Second)
defer cancel()
conn, err := otlptracegrpc.DialContext(ctx, "otel-collector:4317")
if err != nil {
log.Fatalf("Failed to create gRPC connection: %v", err)
}
// Set up a trace exporter
traceExporter, err := otlptrace.New(ctx, otlptracegrpc.NewClient(
otlptracegrpc.WithGRPCConn(conn),
))
if err != nil {
log.Fatalf("Failed to create trace exporter: %v", err)
}
// Register the trace exporter with a TracerProvider
bsp := sdktrace.NewBatchSpanProcessor(traceExporter)
tracerProvider := sdktrace.NewTracerProvider(
sdktrace.WithSampler(sdktrace.AlwaysSample()),
sdktrace.WithResource(res),
sdktrace.WithSpanProcessor(bsp),
)
otel.SetTracerProvider(tracerProvider)
return func() {
// Shutdown will flush any remaining spans
if err := tracerProvider.Shutdown(ctx); err != nil {
log.Fatalf("Failed to shutdown TracerProvider: %v", err)
}
}
}
func handler(w http.ResponseWriter, r *http.Request) {
start := time.Now()
path := r.URL.Path
// Add artificial delay to simulate processing time
delay := time.Duration(rand.Intn(100)) * time.Millisecond
time.Sleep(delay)
status := http.StatusOK
if rand.Intn(10) == 0 {
status = http.StatusInternalServerError
w.WriteHeader(status)
fmt.Fprintf(w, "Error occurred")
log.Printf("ERROR: Request to %s failed with status %d", path, status)
} else {
fmt.Fprintf(w, "Hello, World!")
log.Printf("INFO: Request to %s completed successfully", path)
}
// Update metrics
requestCounter.WithLabelValues(path, fmt.Sprintf("%d", status)).Inc()
requestDuration.WithLabelValues(path).Observe(time.Since(start).Seconds())
}
func main() {
// Initialize tracing
shutdown := initTracer()
defer shutdown()
// Set up HTTP server with OpenTelemetry instrumentation
http.Handle("/", otelhttp.NewHandler(http.HandlerFunc(handler), "hello"))
http.Handle("/metrics", promhttp.Handler())
log.Println("Starting server on :8080")
log.Fatal(http.ListenAndServe(":8080", nil))
}
EOF
Create a Dockerfile for the sample application:
cat > sample-app/Dockerfile << EOF
FROM golang:1.21 AS builder
WORKDIR /app
COPY go.mod go.sum ./
RUN go mod download
COPY main.go ./
RUN CGO_ENABLED=0 GOOS=linux go build -o app .
FROM alpine:latest
RUN apk --no-cache add ca-certificates
WORKDIR /app
COPY --from=builder /app/app .
EXPOSE 8080
CMD ["./app"]
EOF
Create a go.mod file:
cat > sample-app/go.mod << EOF
module sample-app
go 1.21
require (
github.com/prometheus/client_golang v1.17.0
go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp v0.45.0
go.opentelemetry.io/otel v1.19.0
go.opentelemetry.io/otel/exporters/otlp/otlptrace v1.19.0
go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc v1.19.0
go.opentelemetry.io/otel/sdk v1.19.0
)
EOF
Create an empty go.sum file:
touch sample-app/go.sum
Add the sample app to the Docker Compose file:
sample-app:
build: ./sample-app
ports:
- "8081:8080"
logging:
driver: "fluentd"
options:
fluentd-address: localhost:24224
tag: sample-app
depends_on:
- fluentd
- otel-collector
networks:
- monitoring
Step 8: Launch the Monitoring Infrastructure
Create an environment file:
cat > .env << EOF
# Grafana settings
GF_SECURITY_ADMIN_USER=admin
GF_SECURITY_ADMIN_PASSWORD=admin
GF_USERS_ALLOW_SIGN_UP=false
# Elasticsearch settings
ES_JAVA_OPTS=-Xms512m -Xmx512m
ELASTIC_PASSWORD=changeme
# Sample app settings
OTEL_EXPORTER_OTLP_ENDPOINT=otel-collector:4317
OTEL_RESOURCE_ATTRIBUTES=service.name=sample-app
EOF
Start the entire stack:
docker-compose up -d
Step 9: Access and Configure the Monitoring Tools
Once everything is up and running, you can access:
- Prometheus: http://localhost:9090
- Grafana: http://localhost:3000 (admin/admin)
- Kibana: http://localhost:5601
- Jaeger UI: http://localhost:16686
- Sample App: http://localhost:8081
Creating a Complete Dashboard in Grafana
Let’s create a comprehensive dashboard that includes metrics, logs, and traces:
Log in to Grafana (http://localhost:3000) with admin/admin
Navigate to Dashboards → New → New Dashboard
Click “Add visualization”
Select the Prometheus data source
Add panels for:
- Container CPU Usage:
sum by (name) (rate(container_cpu_usage_seconds_total{image!=""}[5m])) - Container Memory Usage:
sum by (name) (container_memory_usage_bytes{image!=""}) - Application Request Rate:
rate(sample_app_requests_total[5m]) - Application Error Rate:
sum(rate(sample_app_requests_total{status=~"5.."}[5m])) / sum(rate(sample_app_requests_total[5m])) - Application Latency:
histogram_quantile(0.95, sum(rate(sample_app_request_duration_seconds_bucket[5m])) by (le))
- Container CPU Usage:
Add a Logs panel connecting to Elasticsearch with the query:
{ "query": { "match_all": {} }, "sort": [ { "@timestamp": { "order": "desc" } } ] }Add a Traces panel connecting to Tempo
Best Practices for Production Deployments
When transitioning this monitoring stack to production, consider these best practices:
High Availability
- Deploy Prometheus with high availability using Thanos or Prometheus Operator
- Use an Elasticsearch cluster with at least 3 nodes
- Deploy Fluentd as a DaemonSet in Kubernetes
- Consider managed solutions for reduced operational overhead
Security
- Enable TLS for all services
- Use proper authentication and authorization
- Implement network segmentation
- Rotate credentials regularly
- Apply the principle of least privilege
Performance Tuning
- Optimize Prometheus storage retention and compaction
- Configure appropriate JVM heap size for Elasticsearch
- Use efficient log sampling and filtering
- Implement metric cardinality controls
Resource Management
- Set resource limits and requests for all containers
- Implement horizontal pod autoscaling
- Monitor the monitoring infrastructure itself
- Consider storage requirements and implement appropriate retention policies
Alerting Strategy
- Define meaningful alerts with clear thresholds
- Implement alert aggregation and deduplication
- Create escalation policies
- Document alert meaning and resolution steps
- Avoid alert fatigue through proper tuning
Advanced Monitoring Techniques
Service Level Objectives (SLOs) and Error Budgets
Implement SLOs to define reliability targets:
# Prometheus recording rules for SLOs
groups:
- name: slo_rules
rules:
- record: slo:request_availability:ratio
expr: sum(rate(sample_app_requests_total{status!~"5.."}[5m])) / sum(rate(sample_app_requests_total[5m]))
- record: slo:request_latency:ratio
expr: sum(rate(sample_app_request_duration_seconds_bucket{le="0.1"}[5m])) / sum(rate(sample_app_request_duration_seconds_count[5m]))
Anomaly Detection
Use Prometheus Alertmanager for anomaly detection:
groups:
- name: anomaly_detection
rules:
- alert: AbnormalRequestPattern
expr: abs(rate(sample_app_requests_total[5m]) - avg_over_time(rate(sample_app_requests_total[5m])[1h:5m])) > 3 * stddev_over_time(rate(sample_app_requests_total[5m])[1h:5m])
for: 15m
labels:
severity: warning
annotations:
summary: "Abnormal request pattern detected"
description: "Current request rate deviates significantly from historical patterns"
Synthetic Monitoring
Add synthetic monitoring to test end-to-end functionality:
- Install the Prometheus Blackbox Exporter
- Configure HTTP checks:
modules:
http_2xx:
prober: http
timeout: 5s
http:
valid_http_versions: ["HTTP/1.1", "HTTP/2.0"]
valid_status_codes: [200]
method: GET
- Add the synthetic monitoring to Prometheus configuration:
scrape_configs:
- job_name: 'blackbox'
metrics_path: /probe
params:
module: [http_2xx]
static_configs:
- targets:
- http://sample-app:8080/
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: blackbox-exporter:9115
Conclusion
A well-designed monitoring infrastructure provides essential visibility into containerized applications. By combining metrics, logs, and traces into a unified observability platform, you gain the ability to:
- Detect issues quickly before they impact users
- Diagnose root causes across complex distributed systems
- Optimize application performance based on real-world data
- Plan capacity with accurate usage patterns
- Demonstrate compliance with service level objectives
This monitoring stack serves as a foundation that can be extended and customized to meet your specific needs. As your containerized infrastructure grows, your monitoring capabilities can evolve with it, providing continuous insight into your systems’ health and performance.
Remember that monitoring is not just about collecting data—it’s about deriving actionable insights that help you build more reliable, efficient, and user-friendly applications.
The complete code for this monitoring infrastructure is available on GitHub.