StarlingX Platform Single-Core Tuning

Storyboard: #2010087

The objective of this spec is to identify and make changes required for the StarlingX Platform to enable its operation on a single processor core.

Problem description

Resource usage is very intensive on platforms with multiple cores and processors. Reducing StarlingX resource consumption to just one core allows the system to use the remaining resources for a larger workload, increasing the availability of resources for end user applications.

To identify the required changes to address the usage of a single core platform, we performed a proof-of-concept with minimal required changes. To characterize the system behavior and to identify required product changes, detailed system profiling was performed for key services. The objective was to measure the individual services, but also to identify potential system bottlenecks or performance changes based on the competing CPU resources.

Below, there is a brief analysis of critical CPU-consuming services and their impact on the system’s steady-state operation when running on a single platform core. The object of this spec is to address the issues identified by implementing the changes described in sections Proposed change and Work Items.

Top CPU Consumers

kube-apiserver

kube-apiserver health check requests show a high number of readyz requests (a type of kubernetes endpoint API), indicating some pods could be taking a long time to respond to a request or terminate.

From the investigation done at kube-apiserver, most requests are coming from cert-manager injector which is in a legacy version (v0.15). The requests are due to the leader election process that remains on even in a single process.

sm-watchdog

We executed different test scenarios to analyze the process behavior when pods were created and deleted. During all the tests, we observed periodic CPU spikes every 10 seconds. This periodic task is represented by the SM_WATCHDOG_NFS_CHECK_IN_MS parameter, which defines the cycle to verify NFS and recover it in case of any anomalies. The reason for the high CPU consumption is related to the mechanism used by sm-watchdog to check the NFS. To find all the nfsd threads, the watchdog code looks at every process within the proc file system, which means it scans every folder with a process number looking for a stat file that represents an NFS.

beam.smp

During a test that created and deleted some pods, the beam.smp process (from RabbitMQ) ran with constant CPU usage and some spikes. There are 3 pairs of messages are repeated throughout the logs but the id of the “publish” and “deliver” routes changes. The system logs related to RabbitMQ contain AMQP calls to sync the requests/replies messages generated by the sysinv-conductor. The behavior observed from RabbitMQ indicates it is serving as an RPC service for sysinv.

sysinv-agent

The behavior presented by the sysinv-agent process reflects the sysinv logs and aligns with what was expected. Every minute, the sysinv-agent wakes up to verify if there is a need to modify the system configuration, e.g., memory or storage. In short, the sysinv-agent doesn’t represent a significant concern in matters of overall system performance. One possible optimization to be evaluated is related to the periodic task and its timeframe. Increasing the time between the requests, optimizing the periodic operations, or converting it to an on-demand task may bring some benefits related to CPU time.

sysinv-conductor

In the sysinv-conductor process test, we observed two scenarios: During the first scenario, which was the most frequently observed, the process showed a typical daemon behavior with continuous low CPU usage. In the second scenario, the process showed CPU spikes every 60 seconds. Skimming the source code, we found some periodic task definitions controlled by the audit interval. The overall impact of sysinv-conductor on CPU load is low. Optimizing the code could decrease the spikes during the system’s steady-state operation. One option, whenever possible, is to change the periodic tasks to on-demand tasks. When this approach is not possible, there is still the option to optimize the interval of the periodic tasks, increasing it, after evaluating and concluding it does not impact the system stability.

Use Cases

As an end-user, I want to improve system performance by enabling StarlingX to run only within the compute resources of a single CPU core with the remaining cores for my application workload.

Proposed change

Platform Core Adjustments

The following set of changes must be applied to reduce physical core usage from 2 cores to 1.

System Inventory

Changes to sysinv cpu_utils.py and stx-puppet platform params manifest file are required to allow the platform to be configured with only a single physical core via the “system host-cpu-modify” command.

Scale Down Services

Many platform services have a scaled number of threads or worker processes that are directly proportional to the number of platform cores configured. However, many have a minimum number of threads under the assumption that they must support a minimum scale. Changes to the number of workers logic in stx-puppet platform params manifest file are required to allow these services to use only a single core. Moreover, this change will also reduce the amount of memory allocated to a single service.

The scale down will take place in case of single core allocation respecting existing worker allocation rules. In case of small footprints, the system defines the number of workers based on number of platform cores (in case of AIO) with maximum limit as 2 for AIO-SX and 3 for AIO-DX (linear scalability based on the number of platform cores). The proposed changes would not change the existing rule. With the relaxation of the minimum limit (from 2 to 1), the system would scale down the number of threads to the minimum.

The following services shall be impacted:

Impacted Services

Service

Description

postgres

Object-Relational database management

etcd

Distributed key-value store

containerd

Container Runtime

memcached

Distributed Memory Object Cache

armada

Armada Application Management

keystone

Identity Management

barbican

Secret Management

docker-registry

Docker Container Registry

docker-token-server

Docker Token Server

kube-apiserver

Kubernetes API Server

kubelet

Kubernetes Node Agent

Kubernetes Tuning

These changes adjust some Kubernetes and etcd parameters and enhance the number of parallel requests Kubernetes can handle based on the platform cores allocated. Additional tests may be required to define the best tuning values.

  • kube-apiserver:
    • max-requests-inflight: Limit the number of API calls that will be processed in parallel, which is a great control point of kube-apiserver memory consumption. The API server can be very CPU intensive when processing a lot of requests in parallel.

  • kube-controller-manager, kube-scheduler, kubelet, kube-proxy:
    • kube-api-burst/kube-api-qps: These 2 flags set the normal and burst rate that the controller manager can talk to kube-apiserver.

  • etcd:
    • heartbeat-interval: This is the frequency with which the leader will notify followers that he/she is still the leader.

    • election-timeout: The election timeout should be set based on the heartbeat interval and average round-trip time between members.

    • snapshot-count: ETCD appends all key changes to a log file. This log grows forever and is a complete linear history of every change made to the keys.

Postgres Tuning

During our analysis, we identified many parameters related to parallel workers and the vacuum process as a potential tuning source for Postgres. This change adjusts the overall parameters based on the platform cores allocated. Additional tests may be required to define the best tuning values.

Service Management Watchdog

Enhance the sm-watchdog process on two different fronts:

  • Restrict its use to the required scenarios (avoid sm-watchdog on AIO-SX configuration).

  • Optimize the NFS monitoring to avoid the overhead on the proc file system while looking for NFS.

System Inventory

Periodic and Runtime Tasks

Currently sysinv-conductor and sysinv-agent have many periodic tasks that should be reviewed and, if possible, redesigned. The main focus is to reduce sysinv regular spikes by

  • Refactoring legacy code;

  • Increasing time intervals when possible;

  • Converting periodic tasks to on-demand tasks, when possible.

Remote Procedure Calls

System Inventory Remote Procedure Calls (RPCs) are performed using RabbitMQ as a communication transport layer between the different processes. The target is to convert internal calls from System Inventory RPC using RabbitMQ to a serverless solution ZeroMQ.

Affected sysinv modules:

  • agent

  • api

  • conductor

  • cmd

  • fpga_agent

  • helm

  • Scripts/manage-partitions

Alternatives

There is an alternative to use gRPC instead of ZeroMQ. This solution should be better analyzed if the actual solution is not usable.

Data model impact

None

REST API impact

None

Security impact

None

Other end-user impact

The default configuration for platform cores will be changed to 1 core and system recommendations would be adjusted to comply with minimum required platform cores based on processor/use case. The end user must be aware of hardware requirements and limitations, and configure the system according to his workload scenario.

Performance Impact

To maintain system stability while operating with fewer compute resources, it may be required to adjust the priority of critical system and platform processes during the execution of this spec. If process starvation is occurring, the system may reboot or declare specific services as failed and attempt recovery. If this is experienced, the starved process priority will need to be increased.

Potential Service Impacts

Service

Description

hostwd

Host Watchdog

pmond

Process Monitor

sm

Service Manager

kubelet

Kubernetes Node Agent

hbsAgent

Heartbeat Service Agent

hbsClient

Heartbeat Service Client

mtcAgent

Maintenance Agent

mtcClient

Maintenance Client

In a distributed cloud scenario, some timing impact is expected on subcloud operations due to resource limitation but no impact on scalability is expected.

Another deployer impact

Automated deployment technologies should be aware of the new library ZeroMQ.

Developer impact

We assume that there is no visible developer impact.

Upgrade impact

According to this spec, the new message queue library using ZeroMQ being added into sysinv could impact backup and restore, upgrade, and rollback. Tests should be done to guarantee the new and old behavior.

Implementation

Assignee(s)

Primary assignee:
  • Guilherme Batista Leite (guilhermebatista)

Other contributors:
  • Alexandre Horst (ahorst)

  • Alyson Deives Pereira (adeivesp)

  • Bruno Costa (bdacosta)

  • Caio Cesar Ferreira (ccesarfe)

  • Davi Frossard (dbarrosf)

  • Eduardo Alberti (ealberti)

  • Guilherme Alberici de Santi (galberic)

  • Isac Sacchi e Souza (isouza)

  • Marcos Paulo Oliveira Silva (mpaulool)

  • Romão Martines (rmartine)

  • Thiago Antonio Miranda (tamiranda)

Repos Impacted

List repositories in StarlingX that are impacted by this spec:
  • starlingx/ansible-playboooks

  • starlingx/config

  • starlingx/config-files

  • starlingx/integ

  • starlingx/stx-puppet

  • starlingx/docs

Work Items

Scale Down Services

  • Adjust the following platform services to account for the minimum number of threads/processes based on the system configuration and the number of platform cores: barbican, containerd, docker-registry, docker-token-server, keystone, kube-apiserver, kubelet, memcached, postgres.

System Inventory

  • Adjust sysinv check to allow 1 platform core utilization

  • Change default behavior to 1 platform core utilization

  • Legacy code refactoring

  • Review existing periodic tasks converting them to on-demand if possible

  • Adjust periodic tasks’ timing interval based on each task’s needs

  • Refactor sysinv-fpga-agent to be launched into context only when it is required

  • Cleanup/review of the existing RPC’s to adopt a more consistent RPC usage model and to reduce the number of different calls that need to be supported.

  • Convert internal calls from RPC using RabbitMQ to the service-less solution ZeroMQ.

Kubernetes

  • Adjust overall Kubernetes configuration parameters based on the platform cores allocated

  • Investigate/enhance the number of parallel requests it can handle based on the platform cores allocated.

etcd

  • Adjust etcd configuration parameters based on the platform cores allocated

Postgres

  • Adjust overall Postgres configuration parameters based on the platform cores allocated

  • Evaluate and tweak the vacuum process

Service Management Watchdog

  • Evaluate that the NFS audit condition is still present in our system and that this audit is required before optimizing the solution

  • Restrict its use to the required scenarios (avoid sm-watchdog on AIO-SX configuration) or remove it totally in case its audit is unnecessary

  • Optimize the NFS monitoring (if is required) to avoid the overhead on the proc file system while looking for NFS

Overall Performance Evaluation

  • After all proposed changes are implemented, evaluate the minimum hardware requirements (processor frequency, cache size and number of cores) and its workload scenario to enable StarlingX operation on a single platform core

  • Verify if process starvation is occurring. If that is the case, adjust the priority of critical system and platform processes, as mentioned in Performance Impact

  • Update the documentation with the minimum hardware requirements.

Dependencies

  • Postgres should be up-versioned to 9.4.X or higher

Testing

System Configurations

The system configurations that we are assuming for testing are:

  • Standalone - AIO-SX

  • Standalone - AIO-DX

  • Distributed Cloud

Test Scenarios

We elected some tests which should be defined or changed to cover this spec:

  • The usual unit testing in the impacted code areas

  • Full system regression of all StarlingX applications functionality (system application commands, lifecycle actions, etc)

  • Performance testing to identify and address any performance impacts.

  • Backup and restore tests

  • Upgrade and rollback tests

  • Sysinv RPC communication tests

  • Distributed Cloud evaluation of scalability and parallel operations

  • In addition, this spec changes the way a StarlingX system is installed and configured, which will require changes in existing automated installation and testing tools.

Documentation Impact

The End User documentation will need to be updated to indicate the minimum hardware requirements (number of cores, frequency and cache-sizes) and workload scenarios when using a single platform core for StarlingX. For instance, assuming that the more pods are running, the more CPU processing is needed for their management (processes such as kubelet and containerd-shim), the documentation should be reviewed to state the minimum number of platform CPU cores based on the number of pods.

Documentation should also be reviewed to inform the replacement of RabbitMQ for ZeroMQ for RPC communication between sysinv processes (sysinv-agent, sysinv-api and sysinv-conductor).

In case of any new limitation, recommendation, or anything that needs a requirement update are identified during the development of the proposed changes in this spec, they shall be included in the documentation as well.

References

  1. Armada

  2. FluxCD

  3. RabbitMQ

  4. Firehose

  5. ZeroMQ

  6. gRPC

History

Revisions

Release Name

Description

stx-8.0

Introduced