Monitoring is a Pain (and We're All Doing it Wrong) // Support Tools

And we’re all doing it wrong (including me).

Monitoring is supposed to make life easier for developers and operators, but it often does the opposite. Despite our best intentions, observability tools frequently fall short, leaving us with brittle systems, ballooning costs, and frustration.

The Problem with Monitoring

Monitoring starts with simplicity: print statements turned into logs, basic metrics, and maybe some traces. But as systems scale, cracks begin to show:

Logs: Endless streams of unstructured data with questionable value.
Metrics: Short-term solutions that don’t scale without significant investment.
Tracing: A promising tool that no one seems to use effectively.

Logs: A Love-Hate Relationship

Logs should provide clarity but often become a source of chaos.

Common Issues

Log Levels Mean Nothing
Different systems (e.g., Python, Syslog, Golang) define levels inconsistently.
Inconsistent Formats
JSON, Common Event Format, Nginx, and GELF logs all compete with no clear winner.
Logs as a Catch-All Tool
Used for debugging, business intelligence, customer support, and auditing—leading to bloated, brittle systems.

Suggestions

Separate Critical Logs: Compliance and audit logs shouldn’t live in the same pipeline as 200-OK responses.
Set a Realistic SLA: If logs aren’t critical, enforce an SLA that reflects that reality (e.g., 99% uptime allows for ~7 hours of downtime/month).
Use Sampling: OpenTelemetry supports log sampling—reduce low-priority logs to avoid overloading your system.

Metrics: Simple Until They’re Not

Metrics start simple but often grow out of control.

Scaling Challenges

Prometheus Limitations
Prometheus isn’t built for high-cardinality, long-term storage, or federated setups.
Business Use Cases
Metrics become critical for everything from customer behavior insights to debugging production issues.

Solutions

Start with Thanos or Cortex: Avoid re-engineering your system later.
- Thanos: Modular, simpler setup for long-term storage.
- Cortex: Better for high-volume, high-cardinality environments.
Cap Retention Periods: Define strict retention policies upfront.
Control Costs: Monitor ingestion rates and cardinality to prevent runaway expenses.

Tracing: The Underrated Hero

Tracing bridges the gap between logs and metrics, offering detailed insights into distributed systems. Yet, it remains underutilized.

Why Tracing Works

Sampling: Built-in sampling reduces data overload.
End-to-End Visibility: Follow requests through load balancers, services, and databases.

The Challenge

Despite its potential, tracing tools like OpenTelemetry and Cloud Trace often see low adoption among developers.

Practical Suggestions for Better Monitoring

Define Ownership
Assign monitoring to a dedicated team or individual.
Set Realistic Expectations
Monitoring isn’t “set and forget.” Plan for ongoing maintenance.
Separate Use Cases
Logs, metrics, and traces serve different purposes—don’t conflate them.
Invest Early
Start with scalable solutions like Thanos or Cortex to avoid future headaches.

Conclusion

Monitoring is essential but often treated as an afterthought. By acknowledging its challenges and investing in better tools and practices, we can build systems that work for us—not against us.