etsy/statsd Monitoring at Spotify: The Story So Far | Labs This is the first in a two-part series about Monitoring at Spotify. In this, I’ll be discussing our history, the challenges we faced, and how they were approached. Operational monitoring at Spotify started its life as a combination of two systems. Zabbix and a homegrown RRD-backed graphing system named “sitemon”, which used Munin for collection. In late 2013, we were starting to put more emphasis on self service and distributed operational responsibility. We tried to bandage up what we could: our Chief Architect hacked together an in-memory sitemon replacement that could hold roughly one month worth of metrics under the current load. Alerting as a service Alerting was the first problem we took a stab at. We considered developing Zabbix further. We found inspiration from attending Monitorama EU where we stumbled upon Riemann. We built a library on top of Riemann called Lyceum. Graphing We went a few rounds here. The difficulties in sharding and rebalancing Graphite became prohibitive. Tags
Monitoring at Spotify: Introducing Heroic | Labs This is the second part in a series about Monitoring at Spotify. In the previous post I discussed our history of operational monitoring. In this part I’ll be presenting Heroic, our scalable time series database which is now free software. Heroic is our in-house time series database. We are aware Elasticsearch has a bad reputation for data safety, so we guard against total failures by having the ability to completely rebuild the index rapidly from our data pipeline or Cassandra. A key feature of Heroic is global federation. Every host in our infrastructure is running ffwd, which is an agent responsible for receiving and forwarding metrics. This setup allows us to rapidly experiment with our service topology. In the backend everything is stored exactly as it was provided to the agent. In using Heroic, we’ve been able to build custom dashboards and alerting systems that make use of the same interface. All parts of Heroic is now free software, feel free to grab the code on Github.
Linux Performance Analysis in 60,000 Milliseconds You login to a Linux server with a performance issue: what do you check in the first minute? At Netflix we have a massive EC2 Linux cloud, and numerous performance analysis tools to monitor and investigate its performance. These include Atlas for cloud-wide monitoring, and Vector for on-demand instance analysis. While those tools help us solve most issues, we sometimes need to login to an instance and run some standard Linux performance tools. In this post, the Netflix Performance Engineering team will show you the first 60 seconds of an optimized performance investigation at the command line, using standard Linux tools you should have available. In 60 seconds you can get a high level idea of system resource usage and running processes by running the following ten commands. uptime dmesg | tail vmstat 1 mpstat -P ALL 1 pidstat 1 iostat -xz 1 free -m sar -n DEV 1 sar -n TCP,ETCP 1 top Some of these commands require the sysstat package installed. 1. uptime 2. dmesg | tail 3. vmstat 1 7. free -m