System Monitor - Cluster and Infrastructure Monitoring

This spec describes StarlingX System Monitor which allows for the persistence and analysis of Cluster and Infrastructure Monitoring data.

https://storyboard.openstack.org/#!/story/2005733

Problem description

The monitoring service needs to be Kubernetes Cluster aware so that it can provide monitoring of both the infrastructure and pods/containers.

The logs and metrics of the containerized system are not persisted when a pod is deleted.

Analysis of logs and metrics requires a tool to aid in searching, filtering and aggregations to mine information.

Use Cases

A system administrator wants:

  • Realtime collection, monitoring and analysis is required across the infrastructure, cluster and application

  • Persistent and scalable storage of both structured (metrics) and unstructured (logs) data

  • Meta-data enrichment of collected data to ensure context is not lost (i.e. container and host information)

  • Query and visualization for both realtime and historical data analysis

  • On-box and off-box deployment options. In the off-box deployment, the collection components would still need to be deployed on the system, but the storage and visualization components would be optional.

  • Automatic system deployment configuration for different deployment configurations (AIO-SX, AIO-DX, Standard) - e.g. system Helm overrides

Proposed change

The Elastic (www.elastic.co) set of software components will be deployed as an optional application in containers to achieve the full stack monitoring.

The Elastic 7.x images are deployed via the “oss” or Apache-2.0 licensed images so that the included components are under Apache-2.0 License.

This release brings improvements, including configurable index lifecycle management and security features over Elasticsearch 6.x

The existing collectd custom metrics and monitoring data will be integrated with Elasticsearch. This would supplant the current storage in influxDB.

The following minimum components are required:

Alternatives

A method to capture logs as required on demand is available via the ‘collect’ Log Collection Tool, rather than real-time. However, if a pod is deleted, and the data is not captured, the data is lost. This alternative also does not include correlated Kubernetes events such as pod lifecycle events.

Other technologies were considered for metric collection:

  • fluentd

  • collectd

  • prometheus

  • influxdata

Elastic was chosen because it provides a solution for the full observability spectrum in a common, unified solution.

Elastic is a widely and actively supported Open Source community that provides many integrations and community contributions, making it well suited for a flexible monitoring solution.

Data model impact

The sysinv application and system helm-overrides framework is leveraged to allow deployment of the stx-monitor application. The Helm plugins for stx-monitor are added and subclassed from the base Helm class.

REST API impact

The deployed Kibana and Elasticsearch containers expose REST API services.

These Elastic Stable APIs are documented with each release, and breaking changes to these APIs should only occur on major versions, and are documented.

Security impact

Exposure of Elasticsearch data is via Kibana ingress port. As the logs of the system are collected, a security layer at ingress is provided to restrict to admin.

Leverage 7.1+ security features for Role Based Access Control.

Other end user impact

  • The admin may interact with the System Monitoring via the Kibana GUI

  • The admin may interact with the System Monitoring via the Elastic Search API

Performance Impact

stx-monitor must be applied in order to perform Elastic System Monitoring, so it is inert when the application is not applied.

stx-monitor Helm charts configuration is a sysinv Helm plugin which supports system overrides.

There will be an impact to the management network and management cores due to the overhead in periodically collecting system resources. Additionally, the Elasticsearch indexing and searching will increase memory and cpu usage of the control nodes (or any other nodes labeled to serve Elastic components).

The stx-monitor application is configured by the system to engineered defaults depending upon the system configuration such as the available Elastic master nodes (controllers), storage available on elastic data nodes. The defaults can be overridden via user Helm overrides during stx-monitor deployment.

Other deployer impact

The versioned stx-monitor application is built as part of StarlingX build and is available for download on a CENGN server.

The OAM Network is enabled and allows for access to:

  • docker.elastic.co

  • k8s.dcr.io

The application is applied via ‘system application-apply’. The nodes on which the application runs is controlled by ‘system host-label-assign’.

Optionally, the administrator may also configure via ‘system helm-override-update’ to customize the Helm application e.g. log and metrics filters and the index lifecycle policies.

Developer impact

Developers may apply the stx-monitor application to gain insights via logs and metrics of their developed application.

Developers may create user overrides to customize the Helm charts configuration.

Upgrade impact

Non-applicable as this is the initial release of System Monitoring.

stx-monitor is a versioned application tarball which deploys the Elastic Monitoring service via Armada Helm charts as Kubernetes containers.

Implementation

Assignee(s)

Primary assignee:

john.kung@windriver.com kevin.smith@windriver.com

Architect/contributors:

matt.peters@windriver.com

Repos Impacted

List repositories in StarlingX that are impacted by this spec.

  • starlingx/config

  • starlingx/tools

  • starlingx/upstream

  • starlingx/docs

Work Items

  • Add helm-charts for Elastic components to mirror and build tarball-dl.lst

  • Add Armada manifest to deploy Elastic components

  • Add sysinv application handling for Elastic components

  • Add sysinv application system overrides for Elastic components

  • Set system engineered defaults for System Monitoring

Dependencies

Testing

Verify cluster pod logs and lifecycle events are persisted.

Verify infrastructure logs and metrics are persisted.

Verify that the data index lifecycle policy is sufficient to persist data.

Verify system engineering of the configured System Monitoring components.

Verify system configuration for AIO-SX, AIO-DX and Standard deployments.

Documentation Impact

This story affects the StarlingX installation, configuration and system engineering documentation.

References

History

Revisions

Release Name

Description

stx-3.0

Introduced