Prometheus collects metrics from your application by scraping a /metrics HTTP endpoint. Grafana visualizes those metrics in dashboards and fires alerts when thresholds are crossed. Together they are the most common open-source monitoring stack for production applications, and understanding them will make you a better operator of any backend system.
What You Are Trying to Monitor
Before setting up tools, know what you are measuring. The RED method defines the three signals that matter most for every service:
Rate: how many requests per second is your service handling? A sudden drop is as alarming as a sudden spike.
Errors: what percentage of requests are failing? Track 4xx and 5xx separately: 4xx are usually client errors, 5xx are your bugs.
Duration: how long are requests taking? Track percentiles, not averages. P50 (median), P95, and P99 tell you what most users experience and what the worst-case experience is.
For infrastructure-level monitoring (not covered by RED), track CPU utilization, memory usage, disk I/O, and network throughput.
What Prometheus Is
Prometheus is a time-series database and metric collection system. It works on a pull model: you configure Prometheus with a list of targets (your app instances' /metrics endpoints), and Prometheus scrapes those endpoints on a regular interval (typically every 15-30 seconds) and stores the metrics.
This pull model has an important implication: your application does not need to know about your monitoring system. You expose a /metrics endpoint, and Prometheus finds it. Adding a new metric to your app does not require any coordination with the Prometheus server.
Prometheus stores data in its own time-series database on disk. It is designed for high-cardinality time-series data (many unique combinations of metric labels) and is optimized for fast aggregation queries over time ranges.
Instrumenting a Node.js Application
The prom-client library is the standard Prometheus client for Node.js:
pnpm add prom-client
Set up default metrics (Node.js process metrics: CPU, memory, event loop lag, garbage collection) and a custom HTTP request counter:
import { collectDefaultMetrics, Counter, Histogram, register } from 'prom-client'
collectDefaultMetrics()
export const httpRequestDuration = new Histogram({
name: 'http_request_duration_seconds',
help: 'Duration of HTTP requests in seconds',
labelNames: ['method', 'route', 'status_code'],
buckets: [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5],
})
export const httpRequestTotal = new Counter({
name: 'http_requests_total',
help: 'Total number of HTTP requests',
labelNames: ['method', 'route', 'status_code'],
})
// Expose the /metrics endpoint
app.get('/metrics', async (req, res) => {
res.set('Content-Type', register.contentType)
res.send(await register.metrics())
})
Instrument each request in middleware:
app.use((req, res, next) => {
const end = httpRequestDuration.startTimer()
res.on('finish', () => {
const labels = {
method: req.method,
route: req.route?.path ?? req.path,
status_code: res.statusCode,
}
end(labels)
httpRequestTotal.inc(labels)
})
next()
})
Instrumenting a Next.js Application
Next.js does not have a traditional Express middleware layer, but you can add Prometheus metrics to Next.js API routes using the same prom-client library. Create a /api/metrics route that returns the Prometheus exposition format, and add timing logic to individual routes or to a shared wrapper function.
For App Router, the instrumentation.ts file (Next.js's built-in instrumentation hook) is the right place to initialize Prometheus collectors.
What Grafana Is
Grafana is a visualization platform that connects to data sources (Prometheus, Loki for logs, Tempo for traces, and many others) and lets you build dashboards. Dashboards consist of panels: graphs, stat displays, gauges, tables, and more.
Grafana's query language for Prometheus is PromQL. Example: the request rate over the last 5 minutes:
rate(http_requests_total[5m])
P95 latency by route:
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
Error rate:
rate(http_requests_total{status_code=~"5.."}[5m])
/ rate(http_requests_total[5m])
Grafana's dashboard builder is visual: you write PromQL queries in the panel editor and see the graph render live. Dashboards can be exported as JSON and committed to version control.
Alerting
Grafana Alerting lets you define rules that fire when a metric crosses a threshold. The rule evaluates a PromQL query on a schedule and sends notifications via Slack, email, PagerDuty, or webhooks.
Principles for good alerting:
Alert on symptoms, not causes. Alert when error rate is high (symptom), not when CPU is high (cause). High CPU does not always mean users are affected. High error rate always means users are affected.
Set meaningful thresholds. "Error rate > 1% for 5 minutes" is a meaningful alert. "Any error ever" is noise. "CPU > 80%" by itself is noise.
Alert on what you would wake up for. If an alert fires and you look at it and decide nothing needs to be done, the alert should not exist. Alert fatigue kills monitoring programs.
Hosted Monitoring Alternatives
Grafana Cloud: hosted Prometheus + Grafana, free tier (10,000 metric series, 50GB logs, 14 days retention). The easiest way to run this stack without self-hosting.
Datadog: the most comprehensive commercial monitoring platform. APM, metrics, logs, traces, synthetics, security — all in one. Expensive ($15+/host/month) but the best-in-class experience for organizations that can afford it.
New Relic: similar to Datadog, competitive on pricing for certain tiers.
Better Uptime / UptimeRobot: simpler uptime monitoring (HTTP ping checks, status pages). Not a replacement for Prometheus but solves the "is my site up?" problem for $0.
When Self-Hosted Monitoring Makes Sense
Self-hosted Prometheus + Grafana on the same VPS as your application costs nothing extra and gives you full control over retention and data privacy. For small teams on a budget, this is the pragmatic choice.
Managed monitoring (Grafana Cloud, Datadog) makes sense when: your team does not want to manage infrastructure, you need long-term metric retention, or the time saved on operations is worth the monthly cost.
Keep Reading
- Sentry Error Tracking Guide — application-level error tracking that complements Prometheus metrics
- Coolify vs Fly.io vs Render — deploying your Prometheus and Grafana instances
- We Replaced 6 SaaS Tools With One: What Happened — reducing operational overhead in your toolchain
Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace — chat, projects, time tracking, AI meeting summaries, and invoicing — in one tool. Try it free.