Distributed monitoring system
This contains notes after the research on Prometheus. Google Monarch has a completely different design, see this article for more details.
Architecture
Prometheus server
- Running on single node, in K8S its replica=1.
- No native support for HA. (Ref) But could run two independent instances with same configurations to achieve HA, AlertManager will deduplicate the same alert.
- Pull Model: Scrape/Pull metrics with a configurable interval.
- Support federation: Instance A scrape metrics from Instance B.
PushGateway
- Metrics from short-lived job could be sent to PushGateway.
- Metrics are stored on disk. In K8S, we could enable PV. (Ref)
AlertManager
- Prometheus server pushes alerts to AlertManager.
- Support HA with native clustering support. (Ref)
TSDB within Prometheus Server
The following is based on V3 design. (Prometheus TSDB from scratch)
b-000001
is a block(2h time window). (Ref)chunks/000001
holds the raw data for various series:^ series | | {__name__="request_total", method="GET", ["t1:v1","t2:v2"]} | {__name__="request_total", method="POST", ["t1:v1","t2:v2"]} | {__name__="error_total", method="GET", ["t1:v1","t2:v2"]} | {__name__="error_total", method="POST", ["t1:v1","t2:v2"]} -----------------------------------------------------time---->
- Each block has an inverted index file.
- Each block has an metadata file.
- Most recent block is held in memory with WAL to prevent failure.
- Old blocks could be compacted by a background process.
Good to know
- SSDs are known for fast random writes, they actually can’t modify individual bytes but only write in pages of 4KiB or more. This means writing a 16 byte sample is equivalent to writing a full 4KiB page. This behavior is part of what is known as write amplification.
- mmap(2), a system call that allows us to transparently back a virtual memory region by file contents. This means we can treat all contents of our database as if they were in memory without occupying any physical RAM. Only if we access certain byte ranges in our database files, the operating system lazily loads pages from disk.
Compare prometheus with others
This section is a summary of this doc
VS Graphite
VS Graphite Scope
Prometheus scrape data VS Graphite passively waits data to be sentref.
VS Graphite Data model
Graphite uses statsD aggregated data with dots-separate components in metrics name
stats.api-server.tracks.post.500 -> 93
Prometheus uses label(key-value) and preserve the instance as a dimension
api_server_http_requests_total{method="POST",handler="/tracks",status="500",instance="<sample1>"} -> 34
api_server_http_requests_total{method="POST",handler="/tracks",status="500",instance="<sample2>"} -> 28
api_server_http_requests_total{method="POST",handler="/tracks",status="500",instance="<sample3>"} -> 31
VS Graphite Storage
Graphite stores time series data on local disk in the Whisper format, an RRD-style database that expects samples to arrive at regular intervals.
Prometheus uses different storage mechanism mentioned in above.
VS InfluxDB
VS InfluxDB Scope
Kapacitor together with InfluxDB, as in combination they address the same problem space as Prometheus and the Alertmanager. Kapacitor’s scope is a combination of Prometheus recording rules, alerting rules, and the Alertmanager’s notification functionality.
VS InfluxDB Data model
InfluxDB has a second level of labels called fields, which are more limited in use. InfluxDB supports timestamps with up to nanosecond resolution, and float64, int64, bool, and string data types. Prometheus, by contrast, supports the float64 data type with limited support for strings, and millisecond resolution timestamps.
VS InfluxDB Storage
InfluxDB uses a variant of a log-structured merge tree for storage with a write ahead log, sharded by time.
Prometheus is append-only file per time series.
VS InfluxDB Architecture
Prometheus servers run independently of each other and only rely on their local storage for their core functionality: scraping, rule processing, and alerting. The open source version of InfluxDB is similar.
The commercial InfluxDB offering is, by design, a distributed storage cluster with storage and queries being handled by many nodes at once.
VS InfluxDB Summary
Where InfluxDB is better:
- If you’re doing event logging.
- Commercial option offers clustering for InfluxDB, which is also better for long term data storage.
- Eventually consistent view of data between replicas.
Where Prometheus is better:
- If you’re primarily doing metrics.
- More powerful query language, alerting, and notification functionality.
- Higher availability and uptime for graphing and alerting.
VS OpenTSDB
VS OpenTSDB Scope
The same as mentioned in vs-graphite
VS OpenTSDB Data model
- Almost identical data model: OpenTSDB tags are Prometheus labels
- There are minor differences though: Prometheus allows arbitrary characters in label values, while OpenTSDB is more restrictive.
- OpenTSDB also lacks a full query language, only allowing simple aggregation and math via its API.
VS OpenTSDB Storage
OpenTSDB’s storage is implemented on top of Hadoop and HBase. This means that it is easy to scale OpenTSDB horizontally, but you have to accept the overall complexity of running a Hadoop/HBase cluster from the beginning.
Prometheus will be simpler to run initially, but will require explicit sharding once the capacity of a single node is exceeded.