Why Every Project We Ship Includes Monitoring From Day One
There is a moment in every project where someone asks: "Do we really need monitoring right now? We only have 50 users."
The answer is yes. Not because 50 users will break your system, but because you need to understand how your system behaves under normal conditions before you can recognize when something is wrong.
We learned this the hard way.
The incident that changed our process
Early in our history, we deployed an e-commerce platform without proper monitoring. The client had maybe 200 active users. We figured we would add monitoring later, when it "mattered."
Three weeks after launch, the database connection pool started leaking. Not fast — one connection every few hours. It took four days to exhaust the pool. On day four, at peak traffic time, every request started timing out. Users saw blank pages. The client found out from their customers before we found out from our systems.
We spent hours diagnosing what was essentially a one-line fix (a missing finally block closing a connection). But the hours were not spent on the fix — they were spent figuring out what was wrong, because we had no metrics, no dashboards, no alerts. We were flying blind.
After that, monitoring became non-negotiable from the first sprint.
What we actually deploy
Our monitoring stack is not exotic. We use what works and what we can deploy consistently across projects:
Application metrics with Prometheus
Every Spring Boot service we deploy exposes metrics via Actuator endpoints. Prometheus scrapes these every 15 seconds. Out of the box, we get:
- Request count, latency distribution, and error rate per endpoint
- JVM heap usage, thread counts, garbage collection frequency
- Database connection pool usage (active, idle, waiting)
- Custom business metrics (orders processed, users logged in, background jobs completed)
The last one matters more than people think. Technical metrics tell you the system is running. Business metrics tell you the system is doing what it is supposed to do. A system that returns 200 OK to every health check but processes zero orders is broken in a way that technical metrics will not catch.
Dashboards in Grafana
We build a standard dashboard for every project:
- Overview panel: request rate, error rate, P95 latency. Three numbers that tell you if the system is healthy at a glance.
- Resource panel: CPU, memory, disk, database connections. Shows if you are approaching a limit.
- Business panel: key business metrics. Different for every project, but always present.
The overview panel is what goes on the TV in the office. The other panels are for debugging.
Alerting
We alert on symptoms, not causes. This is a distinction that took us a while to get right.
Bad alert: "CPU usage is above 80%." This fires during every deployment, every peak traffic hour, and every time the garbage collector runs. It trains people to ignore alerts.
Good alert: "Error rate has exceeded 1% for 5 minutes." This means something is actually broken for users. It fires rarely enough that when it does fire, people pay attention.
Our standard alerts:
- Error rate above threshold for sustained period
- P95 latency above threshold for sustained period
- Database connection pool utilization above 80%
- Disk usage above 85%
- Health check failures
Each alert includes a link to the relevant Grafana dashboard so the person responding can see the context immediately.
Structured logging
We log in JSON format with consistent fields: timestamp, level, request ID, user ID (if authenticated), and a human-readable message. Every log entry for a given request shares a correlation ID, so you can trace a single request across services.
This is not optional. When you are debugging a production issue at 11 PM, the difference between structured logs with correlation IDs and unstructured text output is the difference between a 10-minute diagnosis and a 2-hour search.
Why "later" never comes
Teams that plan to add monitoring later almost never do. There are three reasons:
First, once the system is running, all development effort goes to new features. Monitoring is never the most urgent feature request. It sits in the backlog forever.
Second, retrofitting monitoring is harder than building it in. Adding Prometheus metrics to a running service means touching every endpoint, every service class, every background job. It is a large, boring refactor that delivers no visible user-facing value. Nobody wants to do it.
Third, you do not know what normal looks like. If you add monitoring after six months of running without it, you have no baseline. Is 200ms P95 latency good or bad for this service? You do not know, because you never measured it.
What it costs
Adding monitoring to a project from day one adds roughly one to two days of work. That is the time to configure Actuator endpoints, set up Prometheus and Grafana, write the standard dashboard, and configure basic alerts.
Diagnosing a production issue without monitoring costs anywhere from hours to days. A single incident costs more than the entire monitoring setup.
The math is not close.
The principle
We treat monitoring the same way we treat testing and security: it is not a feature that gets prioritized against other features. It is part of what "done" means. A service without monitoring is not deployed — it is abandoned.
Every system we ship has dashboards, alerts, and structured logging on day one. Not because we expect problems on day one. Because when problems come — and they always come — we want to find them before our clients' customers do.
Need help building something like this?
We build production-grade systems. Let's talk about your project.
Start a Conversation →