Ubuntu Enterprise Cloud Monitoring and Graphing
If you're running a production UEC, you're probably curious what's actually going on in your Cloud!
You might want an integrated, instantaneous view of your Cloud's usage, from a cpu/memory/
The current mechanisms for viewing this sort of data in UEC is really primitive, if extant at all. There's a handful of euca2ools commands (euca-describe-*, euca_conf --list-*, etc) that can tell you a bit about what your cloud is doing and where. And byobu can tell you what services are running on the current machine.
But perhaps a real SNMP monitoring system would be useful. Tools like Munin and Nagios are also often found in Linux data centers.
What do we think about integrating something like SNMP, OpenNMS, Munin, or Nagios into UEC? Would this require Eucalyptus changes? What do you want to see, as a UEC administrator?
Blueprint information
- Status:
- Not started
- Approver:
- Jos Boumans
- Priority:
- Medium
- Drafter:
- Dustin Kirkland
- Direction:
- Approved
- Assignee:
- Dave Walker
- Definition:
- Approved
- Series goal:
- Accepted for maverick
- Implementation:
- Deferred
- Milestone target:
- ubuntu-10.10-beta
- Started by
- Completed by
Related branches
- lp://staging/~clint-fewbar/ubuntu/maverick/rrdtool/merge-1.4.3-1
- lp://staging/~clint-fewbar/ubuntu/maverick/libdbi-drivers/merge-upstream-release-0.8.3-1
- lp://staging/~clint-fewbar/ubuntu/maverick/libdbi/latest-upstream-fixes-rrdtool
- lp://staging/~clint-fewbar/ubuntu/maverick/eucalyptus/uec-monitoring
Whiteboard
Status:
Plan defined for alpha3.collectd MIR stalled awaiting guidance from ubuntu-devel. Munin plugin submitted as merge proposal for eucalyptus. collectd postponed, unlikely for maverick given size and complexity of MIR and proximity to feature freeze.
Complexity:
maverick-alpha-3: 4
Work items for maverick-alpha-2:
[kirkland] Fix Bug #595588, package/install eucalyptus extras/* scripts: DONE
Work items for maverick-alpha-3:
integrate said scripts with nagios/
create data abstraction for web view in UEC part 1: POSTPONED
Integrate data into UEC: POSTPONED
[ivanka] update UEC frontend theme to new Ubuntu aubergine branding (find someone in Design team ?): POSTPONED
[clint-fewbar] merge rrdtool >= 1.4 merged, FTBFS until libdbi0 clears MIR and is moved to main (LP: #605871): DONE
[clint-fewbar] MIR libdbi0 (new dependency of rrdtool) (LP: #608552) and (LP: #608556): DONE
[clint-fewbar] update libdbi to latest version to resolve issues with rrdtool 1.4's dbi support: DONE
[clint-fewbar] MIR for collectd: POSTPONED
[clint-fewbar] adapt ganglia script for collectd and/or munin: DONE
[clint-fewbar] update seeds with collectd/libdbi in main: POSTPONED
Work items for ubuntu-10.10-beta:
[clint-fewbar] produce custom templates for eucalyptus plugins (Not doing this, not worth running two copies of munin): DONE
[clint-fewbar] test automatic configuration of plugins on installation (README.Debian file added rather than re-writing plugin script): DONE
20100806: The folowing work items must be re-evaluated at UDS-N (clint-fewbar):
[clint-fewbar] MIR for collectd: INPROGRESS
[clint-fewbar] update seeds with collectd/libdbi in main: TODO
view 20100602:
* Targets should include:
* packaging the provided monitoring/logging scripts from eucalyptus
* Integrate with our nagios/
* Nice to have: frontend available on cloud/cluster controller
Monitoring UEC instances vs monitoring applications running in instances? -- mathiaz
== Cloud monitoring and graphing ==
Use case: optimize the number of instance on a cloud
=== Data collection ===
* Node controller:
* number of instance running
* resources used by each instance: number of core, disk available, memory
* generic stats: network io, disk io, power consumption
* statistics about each instance: kvm information, cpu load
* ksm
* disk io per instances
* Cluster controller:
* network throughput:
- In, Out:
by NC, by security groups, by instance?
* latency: delay added by the CC.
* Storage controller:
* disk io
* network io
* Cloud controller:
* number of instances started/stopped (Counter)
* nb of instances by users, by security groups.
* ebs usage
* reserved ips
all the ressource that a user can create/request.
* Walrus:
Errors messages on each components.
Package/ship scripts
* extras/ganglia.sh
* extras/nagios.sh
=== Collection/
extras/: nagios & GANGLIA scripts.
* nagios script is a statically configured set of passive checks
* might be better as a series of active checks ("active" in the sense of
nagios terminology) which are pulling information about which resources to
check from an authoritative source (at worst the eucalyptus config files,
at best the running eucalyptus itself)
* basically just doing a wget on the various web service components to check
the status, so it's pretty cheap and easy to perform.
Plot against users.
* Instantaneous views
* Views over time
* Error conditions
* with links to documentation about said errors
Graphing:
by users, by security groups
critical ressources that be given out or performance (operated):
- free ip vs allocated ip
- S3/walrus, ebs module
- load, latency
- capacity of instances used vs available instances
-> graphical view of euca-describe-
Alerting:
- passive checks:
Java services (wget) (CC, SC, Walrus, NC)
- active checks:
Cluster controller are still running.
Powersave Scheduler Stats
* which systems are powered on/off
* total time each system has spent on/off, used/unused
* power utilization on running nodes
Location: cloud controller as a tab for the management panel?