Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> Either way it beats the days of ganglia or nagios static screenshots of some half baked python metric. Or even kibana where you'd have to run a gigantic ELK stack to get any data out of it.

Holy hell, if you'd mentioned Munin also, I'd swear we worked for the same company. Spent ages making Munin play nice with Nagios.

Ganglia was okay for monitoring Spark jobs running in a Yarn cluster, I guess, back in the early days it the GangliaSink shipped by default, and it was far better than trying to scrape every worker instance via JMX...

And then yeah, we moved to ELK for metrics. Spent more time fixing up indexing and ingestion issues.

Then I introduced Prometheus + Grafana, it was far simpler to run (at least until we got big enough to need Thanos...) and far simpler to create dashboards and write queries in.



Oh man, Munin.. I had forgotten about that!

We are currently running a databricks cluster and avoiding the static-ish ganglia screenshots and injecting prometheus exporters to get data into grafana. Honestly wish the entire data space was more open in that regard.

Really glad things are evolving, it's not perfect but it's sure as hell better than it was!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: