System Monitor - Cluster and Infrastructure Monitoring¶

This spec describes StarlingX System Monitor which allows for the persistence and analysis of Cluster and Infrastructure Monitoring data.

https://storyboard.openstack.org/#!/story/2005733

Problem description¶

The monitoring service needs to be Kubernetes Cluster aware so that it can provide monitoring of both the infrastructure and pods/containers.

The logs and metrics of the containerized system are not persisted when a pod is deleted.

Analysis of logs and metrics requires a tool to aid in searching, filtering and aggregations to mine information.

Use Cases¶

A system administrator wants:

Realtime collection, monitoring and analysis is required across the infrastructure, cluster and application
Persistent and scalable storage of both structured (metrics) and unstructured (logs) data
Meta-data enrichment of collected data to ensure context is not lost (i.e. container and host information)
Query and visualization for both realtime and historical data analysis
On-box and off-box deployment options. In the off-box deployment, the collection components would still need to be deployed on the system, but the storage and visualization components would be optional.
Automatic system deployment configuration for different deployment configurations (AIO-SX, AIO-DX, Standard) - e.g. system Helm overrides

Proposed change¶

The Elastic (www.elastic.co) set of software components will be deployed as an optional application in containers to achieve the full stack monitoring.

The Elastic 7.x images are deployed via the “oss” or Apache-2.0 licensed images so that the included components are under Apache-2.0 License.

This release brings improvements, including configurable index lifecycle management and security features over Elasticsearch 6.x

The existing collectd custom metrics and monitoring data will be integrated with Elasticsearch. This would supplant the current storage in influxDB.

The following minimum components are required:

Elasticsearch - https://www.elastic.co/products/elasticsearch
Filebeat - https://www.elastic.co/products/filebeat
Metricbeat - https://www.elastic.co/products/metricbeat
Kibana - https://www.elastic.co/products/kibana
Logstash - https://www.elastic.co/products/logstash For providing off-box data streams and the integration of collectd metrics

Alternatives¶

A method to capture logs as required on demand is available via the ‘collect’ Log Collection Tool, rather than real-time. However, if a pod is deleted, and the data is not captured, the data is lost. This alternative also does not include correlated Kubernetes events such as pod lifecycle events.

Other technologies were considered for metric collection:

fluentd
collectd
prometheus
influxdata

Elastic was chosen because it provides a solution for the full observability spectrum in a common, unified solution.

Elastic is a widely and actively supported Open Source community that provides many integrations and community contributions, making it well suited for a flexible monitoring solution.

Data model impact¶

The sysinv application and system helm-overrides framework is leveraged to allow deployment of the stx-monitor application. The Helm plugins for stx-monitor are added and subclassed from the base Helm class.

REST API impact¶

The deployed Kibana and Elasticsearch containers expose REST API services.

These Elastic Stable APIs are documented with each release, and breaking changes to these APIs should only occur on major versions, and are documented.

Security impact¶

Exposure of Elasticsearch data is via Kibana ingress port. As the logs of the system are collected, a security layer at ingress is provided to restrict to admin.

Leverage 7.1+ security features for Role Based Access Control.

Other end user impact¶

The admin may interact with the System Monitoring via the Kibana GUI
The admin may interact with the System Monitoring via the Elastic Search API

Performance Impact¶

stx-monitor must be applied in order to perform Elastic System Monitoring, so it is inert when the application is not applied.

stx-monitor Helm charts configuration is a sysinv Helm plugin which supports system overrides.

There will be an impact to the management network and management cores due to the overhead in periodically collecting system resources. Additionally, the Elasticsearch indexing and searching will increase memory and cpu usage of the control nodes (or any other nodes labeled to serve Elastic components).

The stx-monitor application is configured by the system to engineered defaults depending upon the system configuration such as the available Elastic master nodes (controllers), storage available on elastic data nodes. The defaults can be overridden via user Helm overrides during stx-monitor deployment.

Other deployer impact¶

The versioned stx-monitor application is built as part of StarlingX build and is available for download on a CENGN server.

The OAM Network is enabled and allows for access to:

docker.elastic.co
k8s.dcr.io

The application is applied via ‘system application-apply’. The nodes on which the application runs is controlled by ‘system host-label-assign’.

Optionally, the administrator may also configure via ‘system helm-override-update’ to customize the Helm application e.g. log and metrics filters and the index lifecycle policies.

Developer impact¶

Developers may apply the stx-monitor application to gain insights via logs and metrics of their developed application.

Developers may create user overrides to customize the Helm charts configuration.

Upgrade impact¶

Non-applicable as this is the initial release of System Monitoring.

stx-monitor is a versioned application tarball which deploys the Elastic Monitoring service via Armada Helm charts as Kubernetes containers.

Implementation¶

Assignee(s)¶

Primary assignee:: john.kung@windriver.com kevin.smith@windriver.com
Architect/contributors:: matt.peters@windriver.com

Repos Impacted¶

List repositories in StarlingX that are impacted by this spec.

starlingx/config
starlingx/tools
starlingx/upstream
starlingx/docs

Work Items¶

Add helm-charts for Elastic components to mirror and build tarball-dl.lst
Add Armada manifest to deploy Elastic components
Add sysinv application handling for Elastic components
Add sysinv application system overrides for Elastic components
Set system engineered defaults for System Monitoring

Dependencies¶

Elasticsearch 7.x stable Helm charts. This feature can be based on updates to the current stable Helm charts (6.7.0) at https://github.com/helm/charts
Kube-State-Metrics - https://k8s.gcr.io/kube-state-metrics

Testing¶

Verify cluster pod logs and lifecycle events are persisted.

Verify infrastructure logs and metrics are persisted.

Verify that the data index lifecycle policy is sufficient to persist data.

Verify system engineering of the configured System Monitoring components.

Verify system configuration for AIO-SX, AIO-DX and Standard deployments.

Documentation Impact¶

This story affects the StarlingX installation, configuration and system engineering documentation.

References¶

StarlingX Metrics https://storyboard.openstack.org/#!/board/145
Elasticsearch documentation https://www.elastic.co/guide/en/elasticsearch/reference/current/index.html
FileBeat https://www.elastic.co/guide/en/beats/filebeat/current/index.html
Kibana https://www.elastic.co/guide/en/kibana/current/index.html
Logstash https://www.elastic.co/guide/en/logstash/current/index.html
MetricBeat https://www.elastic.co/guide/en/beats/metricbeat/current/index.html
Licensing https://www.elastic.co/subscriptions StarlingX stx-monitor application, when optionally applied, deploys the OSS (Apache-2.0) version.

History¶

Revisions¶
Release Name	Description
stx-3.0	Introduced