Production Monitoring: Measuring What Actually Hurts
Part 7 explains how this Next.js 16 blog approaches production monitoring with health checks, Redis-backed product metrics, LLM cost tracking, bundle analysis, and clear limits around synthetic metrics.
Monitoring is where performance writing gets dishonest fast.
It is too easy to show a dashboard, point at a green graph, and pretend the system is understood. I have done that. Most engineers with a few years in production have done some version of it. The graph feels comforting right up until the incident channel explains what you forgot to measure.
For this blog, I wanted the clean story: Core Web Vitals, performance budgets, error tracking, maybe a satisfying "we caught a bad bundle before it shipped" anecdote.
The repo is not that neat.
It has real monitoring, but not the perfect kind. Some of it watches CMS health. Some of it watches Redis-backed product behavior. Some of it watches LLM cost, cache hit rate, routing, latency, moderation, and tool execution. Some of it is still unfinished and should not be dressed up as more mature than it is.
That became the actual lesson.
Production monitoring is not "collect all the metrics."
It is asking a colder question:
What can hurt this system, and would I notice before a user tells me?
Real Situation
This project is not just a static blog anymore.
The article pages are static-first, but the surrounding product has moving parts: Strapi content, Next.js rendering, Redis-backed quotas and metrics, Postgres-backed history, Qdrant-backed search, LLM routing, quizzes, comments, and NoteSensei interactions.
The failure modes are not all frontend failure modes.
A pure Core Web Vitals dashboard would miss a lot of what matters here.
The simplest monitoring boundary is the health check:
The split between healthData and publicHealthData matters. In production, the endpoint returns a minimal public payload. In non-production, it can expose more diagnostic detail.
That is the right instinct.
One caveat: database: 'up' is optimistic for the wider product because other routes do use Postgres. I would not treat this endpoint as full dependency health. It is a CMS/cache smoke check with a safe public surface.
Health endpoints are operational tools, but they are also public attack surface unless you lock them down. I want "healthy" or "unhealthy" outside the building, not a guided tour of internals.
That does not mean there is a mature CI performance gate. It means the project has the tool needed to inspect bundle changes when something smells wrong.
I would rather say that plainly than invent a story about a 50 KB import being caught by automation.
Tension
The hardest tension is precision.
Engineers like precise numbers. Product people like precise numbers. Dashboards reward precise numbers.
But a precise-looking number can still be fake.
This part of the admin metrics route is a good example:
avgLatencyMs is grounded in tracked totals. totalResponses is counted from the database. But p95LatencyMs is not a real percentile. errorRate is not a measured error rate.
That is not a reason to delete the whole route. It is a reason to be honest about what the route can and cannot support.
I would use this for a rough admin overview.
I would not page someone from that p95LatencyMs.
Mistake
The mistake was assuming performance monitoring starts with browser metrics.
For a static-first blog, browser metrics matter. But for this system, the nastier regressions are often operational:
LLM routing silently moves too much traffic to an expensive provider
cache misses climb and every question becomes a fresh generation
Redis is down and quota/metrics behavior changes
CMS health is degraded but the public route hides it until the next publish
moderation starts blocking legitimate questions
Those are not Lighthouse-only problems.
They are product behavior problems.
Insight
The useful monitoring model here has three loops.
First: build-time monitoring.
Bundle analyzer is not glamorous, but it gives you a way to inspect what changed when a page suddenly feels heavier. The important part is not running it every day. The important part is having a cheap, repeatable command when suspicion appears.
Second: request-time monitoring.
The health check asks whether the system can reach the CMS and whether cache support is configured:
It is intentionally simple. If the CMS is down, the system is unhealthy. If cache is degraded, that is visible but not always fatal. That distinction matters because not every dependency deserves the same blast radius.
Third: product-time monitoring.
This is where NoteSensei metrics matter more than generic web performance metrics:
That is not fancy observability. It is a small, correct mechanism for a small, specific question: how active is the system right now?
Learning Moment
The learning moment was that monitoring code needs tests as much as business logic does.
The tests do not prove the dashboard is beautiful. They prove operational behavior:
health returns 503 when CMS is missing or down
production health payload stays minimal
HEAD uses the health check path
realtime metrics clean stale Redis entries before reading
activity tracking no-ops without an identifier
tracking failures are logged but swallowed
admin metrics aggregate usage, cache, routing, leaderboard, and latency fields
admin metrics return 401 for non-admin sessions
That is the right shape of confidence.
Not "we have observability."
"The few monitoring paths we do have are hard to accidentally break."
Principle
My rule now is:
Measure the pain path, not the fashionable metric.
Core Web Vitals still matter. RUM still matters. Sentry or another proper error tracker still matters. A real p95 needs real latency samples, not a multiplier. A performance budget should eventually fail CI, not live as a note in a document.
Those are gaps.
But gaps are easier to fix when the current system is described honestly.
For this repo, the monitoring priorities are:
keep health checks small, uncached, and safe to expose
use bundle analysis as an inspection tool before pretending there is a full performance gate
let metrics writes fail without breaking user flows
redact logs so observability does not become data leakage
never put synthetic precision on a chart and call it production truth
The last point is the one I care about most.
A bad metric is worse than no metric because it buys confidence you did not earn.
I would rather have five honest counters than a polished dashboard full of guesses.
That is the uncomfortable, useful version of production monitoring: not the biggest screen in the room, just the shortest path from "something feels wrong" to "this is what changed."