Web Performance

Production Monitoring: Measuring What Actually Hurts

Part 7 explains how this Next.js 16 blog approaches production monitoring with health checks, Redis-backed product metrics, LLM cost tracking, bundle analysis, and clear limits around synthetic metrics.

Jagdish Salgotra

staff eng · writes from production

2026-05-03·5 min read·~1,400 words

← PreviousImage Optimization & CDN Strategy: The Pipeline I Kept Boring

Companion · for this piece

Ask

Ask NoteSensei.

A reading assistant that only knows what's in this article. Sources every answer to a passage you can re-read.

Test

Test your understanding.

Five questions drawn from the piece. Earn a grade. See the passage behind anything you miss.

#web-performance

Written by

Jagdish Salgotra

Software engineer with 15 years work experience. Skills: Java, Spring Boot, Hibernate, SQL, Linux, Python, Telecom, IoT, Autonomous Systems

all posts →

Was this article helpful?

anonymous · no account needed

Keep reading

Comments

to join the discussion

Web Performance

Production Monitoring: Measuring What Actually Hurts

Jagdish Salgotra

staff eng · writes from production

2026-05-03·5 min read·~1,400 words

Monitoring is where performance writing gets dishonest fast.

It is too easy to show a dashboard, point at a green graph, and pretend the system is understood. I have done that. Most engineers with a few years in production have done some version of it. The graph feels comforting right up until the incident channel explains what you forgot to measure.

For this blog, I wanted the clean story: Core Web Vitals, performance budgets, error tracking, maybe a satisfying "we caught a bad bundle before it shipped" anecdote.

The repo is not that neat.

It has real monitoring, but not the perfect kind. Some of it watches CMS health. Some of it watches Redis-backed product behavior. Some of it watches LLM cost, cache hit rate, routing, latency, moderation, and tool execution. Some of it is still unfinished and should not be dressed up as more mature than it is.

That became the actual lesson.

Production monitoring is not "collect all the metrics."

It is asking a colder question:

What can hurt this system, and would I notice before a user tells me?

Real Situation

This project is not just a static blog anymore.

The article pages are static-first, but the surrounding product has moving parts: Strapi content, Next.js rendering, Redis-backed quotas and metrics, Postgres-backed history, Qdrant-backed search, LLM routing, quizzes, comments, and NoteSensei interactions.

The failure modes are not all frontend failure modes.

A pure Core Web Vitals dashboard would miss a lot of what matters here.

The simplest monitoring boundary is the health check:

const healthData: HealthCheckResponse = {
  status: overallStatus,
  timestamp,
  uptime: Math.floor(uptime / 1000), // Convert to seconds
  version: process.env.npm_package_version || '1.0.0',
  environment: process.env.NODE_ENV || 'development',
  services: {
    database: 'up', // Next.js doesn't use traditional database
    cms: cmsStatus,
    cache: cacheStatus,
  },
  performance: {
    responseTime,
    memoryUsage,
  },
};

const publicHealthData: PublicHealthResponse = {
  status: overallStatus,
  timestamp,
};

The split between healthData and publicHealthData matters. In production, the endpoint returns a minimal public payload. In non-production, it can expose more diagnostic detail.

That is the right instinct.

One caveat: database: 'up' is optimistic for the wider product because other routes do use Postgres. I would not treat this endpoint as full dependency health. It is a CMS/cache smoke check with a safe public surface.

Health endpoints are operational tools, but they are also public attack surface unless you lock them down. I want "healthy" or "unhealthy" outside the building, not a guided tour of internals.

The endpoint also refuses caching:

headers: {
  'Cache-Control': 'no-cache, no-store, must-revalidate',
  'Content-Type': 'application/json',
},

That is one of those small details you only appreciate after watching a status endpoint lie because something between you and the service cached it.

What Went Wrong

The first wrong instinct was to make this article about dashboards.

Dashboards are seductive. They let you feel like you are operating a system because numbers are moving on a screen.

But the first monitoring question for this repo is not "where is the dashboard?"

It is:

Can I tell if Strapi is down?
Can I tell if Redis is becoming a dependency I accidentally made critical?
Can I tell if NoteSensei is getting slow or expensive?
Can I tell if cache hit rate is falling?
Can I tell if users are hitting quota boundaries?
Can I tell if tool calls are timing out?
Can I tell if a build change bloated the critical path?

That is a different set of questions than a generic performance checklist.

The repo has bundle analysis wired into the build path:

json

"analyze": "ANALYZE=true pnpm build"

And the Next.js config keeps it explicit:

const bundleAnalyzer = withBundleAnalyzer({
  enabled: process.env.ANALYZE === 'true',
  openAnalyzer: true,
  analyzerMode: 'static',
  reportFilename: './analyze/report.html',
});

export default bundleAnalyzer(nextConfig);

That does not mean there is a mature CI performance gate. It means the project has the tool needed to inspect bundle changes when something smells wrong.

I would rather say that plainly than invent a story about a 50 KB import being caught by automation.

Tension

The hardest tension is precision.

Engineers like precise numbers. Product people like precise numbers. Dashboards reward precise numbers.

But a precise-looking number can still be fake.

This part of the admin metrics route is a good example:

const systemHealth = {
  avgLatencyMs: notesenseiMetrics.avgLatencyMs.total,
  minLatencyMs: notesenseiMetrics.avgLatencyMs.search,
  maxLatencyMs: notesenseiMetrics.avgLatencyMs.llm * 2,
  p95LatencyMs: notesenseiMetrics.avgLatencyMs.total * 1.5,
  totalResponses,
  errorRate: 0,
};

I do not love that.

avgLatencyMs is grounded in tracked totals. totalResponses is counted from the database. But p95LatencyMs is not a real percentile. errorRate is not a measured error rate.

That is not a reason to delete the whole route. It is a reason to be honest about what the route can and cannot support.

I would use this for a rough admin overview.

I would not page someone from that p95LatencyMs.

Mistake

The mistake was assuming performance monitoring starts with browser metrics.

For a static-first blog, browser metrics matter. But for this system, the nastier regressions are often operational:

LLM routing silently moves too much traffic to an expensive provider
cache misses climb and every question becomes a fresh generation
Redis is down and quota/metrics behavior changes
CMS health is degraded but the public route hides it until the next publish
moderation starts blocking legitimate questions

Those are not Lighthouse-only problems.

They are product behavior problems.

Insight

The useful monitoring model here has three loops.

First: build-time monitoring.

Bundle analyzer is not glamorous, but it gives you a way to inspect what changed when a page suddenly feels heavier. The important part is not running it every day. The important part is having a cheap, repeatable command when suspicion appears.

Second: request-time monitoring.

The health check asks whether the system can reach the CMS and whether cache support is configured:

const [cmsStatus, cacheStatus] = await Promise.all([checkCMSHealth(), checkCacheHealth()]);

const responseTime = Date.now() - checkStart;
const uptime = Date.now() - startTime;
const overallStatus = determineOverallHealth(cmsStatus, cacheStatus);

It is intentionally simple. If the CMS is down, the system is unhealthy. If cache is degraded, that is visible but not always fatal. That distinction matters because not every dependency deserves the same blast radius.

Third: product-time monitoring.

This is where NoteSensei metrics matter more than generic web performance metrics:

await redis.hincrby(key, 'total_questions', 1);

if (data.cached) {
  await redis.hincrby(key, 'cached_questions', 1);
  await redis.hincrby(key, `cache_${data.cacheLayer}_hits`, 1);
}

if (data.tier) {
  await redis.hincrby(key, `tier:${data.tier}`, 1);
}

await redis.hincrby(key, `provider:${data.provider}:calls`, 1);
await redis.hincrby(key, `provider:${data.provider}:input_tokens`, data.tokens.input);
await redis.hincrby(key, `provider:${data.provider}:output_tokens`, data.tokens.output);

await redis.hincrby(key, 'latency:search_total', data.latency.search);
await redis.hincrby(key, 'latency:embedding_total', data.latency.embedding);
await redis.hincrby(key, 'latency:llm_total', data.latency.llm);

That is the monitoring I trust most in this codebase because it is tied to the actual expensive behavior.

Questions, cache hits, routing tiers, provider calls, tokens, and latency are not abstract. They explain cost, user experience, and failure pressure.

Surprise

The surprise was that some monitoring should fail quietly.

Realtime metrics are useful, but they should not take down chat. The tracking helpers use Redis sorted sets and TTLs:

export async function trackActiveUser(userId: string | null, sessionId: string | null) {
  const identifier = userId || sessionId;
  if (!identifier) {
    return;
  }

  try {
    await redis.zadd(ACTIVE_USERS_KEY, Date.now(), identifier);
    await redis.expire(ACTIVE_USERS_KEY, 10 * 60); // 10 min TTL as safety net
  } catch (error) {
    logError('[Realtime Metrics] Track user error', {
      name: error instanceof Error ? error.name : 'Unknown',
    });
    // Silently fail - don't break the chat flow
  }
}

That comment is doing real work.

If metrics tracking becomes part of the critical path, monitoring has become the outage.

The read side cleans stale data before reporting active users and questions per minute:

await Promise.all([
  redis.zremrangebyscore(ACTIVE_USERS_KEY, '-inf', fiveMinutesAgo.toString()),
  redis.zremrangebyscore(QUESTIONS_KEY, '-inf', oneMinuteAgo.toString()),
]);

That is not fancy observability. It is a small, correct mechanism for a small, specific question: how active is the system right now?

Learning Moment

The learning moment was that monitoring code needs tests as much as business logic does.

The tests do not prove the dashboard is beautiful. They prove operational behavior:

health returns 503 when CMS is missing or down
production health payload stays minimal
HEAD uses the health check path
realtime metrics clean stale Redis entries before reading
activity tracking no-ops without an identifier
tracking failures are logged but swallowed
admin metrics aggregate usage, cache, routing, leaderboard, and latency fields
admin metrics return 401 for non-admin sessions

That is the right shape of confidence.

Not "we have observability."

"The few monitoring paths we do have are hard to accidentally break."

Principle

My rule now is:

Measure the pain path, not the fashionable metric.

Core Web Vitals still matter. RUM still matters. Sentry or another proper error tracker still matters. A real p95 needs real latency samples, not a multiplier. A performance budget should eventually fail CI, not live as a note in a document.

Those are gaps.

But gaps are easier to fix when the current system is described honestly.

For this repo, the monitoring priorities are:

keep health checks small, uncached, and safe to expose
make CMS and cache dependency health visible
track the expensive product path: LLM calls, tokens, provider routing, cache hits, latency
use bundle analysis as an inspection tool before pretending there is a full performance gate
let metrics writes fail without breaking user flows
redact logs so observability does not become data leakage
never put synthetic precision on a chart and call it production truth

The last point is the one I care about most.

A bad metric is worse than no metric because it buys confidence you did not earn.

I would rather have five honest counters than a polished dashboard full of guesses.

That is the uncomfortable, useful version of production monitoring: not the biggest screen in the room, just the shortest path from "something feels wrong" to "this is what changed."

Cited resources

← PreviousImage Optimization & CDN Strategy: The Pipeline I Kept Boring

Companion · for this piece

Ask

Ask NoteSensei.

A reading assistant that only knows what's in this article. Sources every answer to a passage you can re-read.

Test

Test your understanding.

Five questions drawn from the piece. Earn a grade. See the passage behind anything you miss.

#web-performance

Written by

Jagdish Salgotra

Software engineer with 15 years work experience. Skills: Java, Spring Boot, Hibernate, SQL, Linux, Python, Telecom, IoT, Autonomous Systems

all posts →

Was this article helpful?

anonymous · no account needed

Keep reading

Monitoring is where performance writing gets dishonest fast.

For this blog, I wanted the clean story: Core Web Vitals, performance budgets, error tracking, maybe a satisfying "we caught a bad bundle before it shipped" anecdote.

The repo is not that neat.

That became the actual lesson.

Production monitoring is not "collect all the metrics."

It is asking a colder question:

What can hurt this system, and would I notice before a user tells me?

Real Situation

This project is not just a static blog anymore.

The failure modes are not all frontend failure modes.

A pure Core Web Vitals dashboard would miss a lot of what matters here.

The simplest monitoring boundary is the health check:

const healthData: HealthCheckResponse = {
  status: overallStatus,
  timestamp,
  uptime: Math.floor(uptime / 1000), // Convert to seconds
  version: process.env.npm_package_version || '1.0.0',
  environment: process.env.NODE_ENV || 'development',
  services: {
    database: 'up', // Next.js doesn't use traditional database
    cms: cmsStatus,
    cache: cacheStatus,
  },
  performance: {
    responseTime,
    memoryUsage,
  },
};

const publicHealthData: PublicHealthResponse = {
  status: overallStatus,
  timestamp,
};

The split between healthData and publicHealthData matters. In production, the endpoint returns a minimal public payload. In non-production, it can expose more diagnostic detail.

That is the right instinct.

Health endpoints are operational tools, but they are also public attack surface unless you lock them down. I want "healthy" or "unhealthy" outside the building, not a guided tour of internals.

The endpoint also refuses caching:

headers: {
  'Cache-Control': 'no-cache, no-store, must-revalidate',
  'Content-Type': 'application/json',
},

That is one of those small details you only appreciate after watching a status endpoint lie because something between you and the service cached it.

What Went Wrong

The first wrong instinct was to make this article about dashboards.

Dashboards are seductive. They let you feel like you are operating a system because numbers are moving on a screen.

But the first monitoring question for this repo is not "where is the dashboard?"

It is:

Can I tell if Strapi is down?
Can I tell if Redis is becoming a dependency I accidentally made critical?
Can I tell if NoteSensei is getting slow or expensive?
Can I tell if cache hit rate is falling?
Can I tell if users are hitting quota boundaries?
Can I tell if tool calls are timing out?
Can I tell if a build change bloated the critical path?

That is a different set of questions than a generic performance checklist.

The repo has bundle analysis wired into the build path:

json

"analyze": "ANALYZE=true pnpm build"

And the Next.js config keeps it explicit:

const bundleAnalyzer = withBundleAnalyzer({
  enabled: process.env.ANALYZE === 'true',
  openAnalyzer: true,
  analyzerMode: 'static',
  reportFilename: './analyze/report.html',
});

export default bundleAnalyzer(nextConfig);

That does not mean there is a mature CI performance gate. It means the project has the tool needed to inspect bundle changes when something smells wrong.

I would rather say that plainly than invent a story about a 50 KB import being caught by automation.

Tension

The hardest tension is precision.

Engineers like precise numbers. Product people like precise numbers. Dashboards reward precise numbers.

But a precise-looking number can still be fake.

This part of the admin metrics route is a good example:

const systemHealth = {
  avgLatencyMs: notesenseiMetrics.avgLatencyMs.total,
  minLatencyMs: notesenseiMetrics.avgLatencyMs.search,
  maxLatencyMs: notesenseiMetrics.avgLatencyMs.llm * 2,
  p95LatencyMs: notesenseiMetrics.avgLatencyMs.total * 1.5,
  totalResponses,
  errorRate: 0,
};

I do not love that.

avgLatencyMs is grounded in tracked totals. totalResponses is counted from the database. But p95LatencyMs is not a real percentile. errorRate is not a measured error rate.

That is not a reason to delete the whole route. It is a reason to be honest about what the route can and cannot support.

I would use this for a rough admin overview.

I would not page someone from that p95LatencyMs.

Mistake

The mistake was assuming performance monitoring starts with browser metrics.

For a static-first blog, browser metrics matter. But for this system, the nastier regressions are often operational:

LLM routing silently moves too much traffic to an expensive provider
cache misses climb and every question becomes a fresh generation
Redis is down and quota/metrics behavior changes
CMS health is degraded but the public route hides it until the next publish
moderation starts blocking legitimate questions

Those are not Lighthouse-only problems.

They are product behavior problems.

Insight

The useful monitoring model here has three loops.

First: build-time monitoring.

Second: request-time monitoring.

The health check asks whether the system can reach the CMS and whether cache support is configured:

const [cmsStatus, cacheStatus] = await Promise.all([checkCMSHealth(), checkCacheHealth()]);

const responseTime = Date.now() - checkStart;
const uptime = Date.now() - startTime;
const overallStatus = determineOverallHealth(cmsStatus, cacheStatus);

Third: product-time monitoring.

This is where NoteSensei metrics matter more than generic web performance metrics:

await redis.hincrby(key, 'total_questions', 1);

if (data.cached) {
  await redis.hincrby(key, 'cached_questions', 1);
  await redis.hincrby(key, `cache_${data.cacheLayer}_hits`, 1);
}

if (data.tier) {
  await redis.hincrby(key, `tier:${data.tier}`, 1);
}

await redis.hincrby(key, `provider:${data.provider}:calls`, 1);
await redis.hincrby(key, `provider:${data.provider}:input_tokens`, data.tokens.input);
await redis.hincrby(key, `provider:${data.provider}:output_tokens`, data.tokens.output);

await redis.hincrby(key, 'latency:search_total', data.latency.search);
await redis.hincrby(key, 'latency:embedding_total', data.latency.embedding);
await redis.hincrby(key, 'latency:llm_total', data.latency.llm);

That is the monitoring I trust most in this codebase because it is tied to the actual expensive behavior.

Questions, cache hits, routing tiers, provider calls, tokens, and latency are not abstract. They explain cost, user experience, and failure pressure.

Surprise

The surprise was that some monitoring should fail quietly.

Realtime metrics are useful, but they should not take down chat. The tracking helpers use Redis sorted sets and TTLs:

export async function trackActiveUser(userId: string | null, sessionId: string | null) {
  const identifier = userId || sessionId;
  if (!identifier) {
    return;
  }

  try {
    await redis.zadd(ACTIVE_USERS_KEY, Date.now(), identifier);
    await redis.expire(ACTIVE_USERS_KEY, 10 * 60); // 10 min TTL as safety net
  } catch (error) {
    logError('[Realtime Metrics] Track user error', {
      name: error instanceof Error ? error.name : 'Unknown',
    });
    // Silently fail - don't break the chat flow
  }
}

That comment is doing real work.

If metrics tracking becomes part of the critical path, monitoring has become the outage.

The read side cleans stale data before reporting active users and questions per minute:

await Promise.all([
  redis.zremrangebyscore(ACTIVE_USERS_KEY, '-inf', fiveMinutesAgo.toString()),
  redis.zremrangebyscore(QUESTIONS_KEY, '-inf', oneMinuteAgo.toString()),
]);

That is not fancy observability. It is a small, correct mechanism for a small, specific question: how active is the system right now?

Learning Moment

The learning moment was that monitoring code needs tests as much as business logic does.

The tests do not prove the dashboard is beautiful. They prove operational behavior:

health returns 503 when CMS is missing or down
production health payload stays minimal
HEAD uses the health check path
realtime metrics clean stale Redis entries before reading
activity tracking no-ops without an identifier
tracking failures are logged but swallowed
admin metrics aggregate usage, cache, routing, leaderboard, and latency fields
admin metrics return 401 for non-admin sessions

That is the right shape of confidence.

Not "we have observability."

"The few monitoring paths we do have are hard to accidentally break."

Principle

My rule now is:

Measure the pain path, not the fashionable metric.

Those are gaps.

But gaps are easier to fix when the current system is described honestly.

For this repo, the monitoring priorities are:

keep health checks small, uncached, and safe to expose
make CMS and cache dependency health visible
track the expensive product path: LLM calls, tokens, provider routing, cache hits, latency
use bundle analysis as an inspection tool before pretending there is a full performance gate
let metrics writes fail without breaking user flows
redact logs so observability does not become data leakage
never put synthetic precision on a chart and call it production truth

The last point is the one I care about most.

A bad metric is worse than no metric because it buys confidence you did not earn.

I would rather have five honest counters than a polished dashboard full of guesses.

That is the uncomfortable, useful version of production monitoring: not the biggest screen in the room, just the shortest path from "something feels wrong" to "this is what changed."

Production Monitoring: Measuring What Actually Hurts

Series navigation

Ask NoteSensei.

Test your understanding.

Jagdish Salgotra

Keep reading

Comments

Production Monitoring: Measuring What Actually Hurts

Real Situation

What Went Wrong

Tension

Mistake

Insight

Surprise

Learning Moment

Principle

Cited resources

Series navigation

Ask NoteSensei.

Test your understanding.

Jagdish Salgotra

Keep reading

Real Situation

What Went Wrong

Tension

Mistake

Insight

Surprise

Learning Moment

Principle

Cited resources