Alarm ExpiringSoon and Expired Certificates on StarlingX

Storyboard: https://storyboard.openstack.org/#!/story/2008946

This feature introduces alarms using the existing Fault Management (FM) framework for certificates that are expired and about-to-expire.

Problem Description

Expired certificates prevent the proper operation of the platform. The platform currently supports various certificates that are manually created off-platform, installed and updated by user, while some certificates are managed and auto-renewed by cert-manager.

In case of manual installation & management, the certificates have to be closely monitored to avoid expiry. The cert-manager managed certificates will auto-renew, but may fail to do so in case of errors or failure to communicate with external CAs. The user will need an appropriate warning mechanism in such use cases.

Example Use Cases

  • Docker registry certificate is expiring soon and user may be unaware of the expiry date approaching.

  • The ssl certificate has expired and did not auto-renew as expected. User unable to securely communicate via HTTPS.

In the uses cases described above, the end user would’ve been unaware of potential problems on the platform. With this proposed feature, the user will be forewarned, so corrective action may be taken.

Proposed change

A new service called ‘cert-alarm’, will be introduced on the platform for auditing certificates’ expiry dates and communicating with the fault management system to raise & clear alarms. The service will run as a controller service (managed by service manager (sm) as active-standby) on all active controllers.

Certificate management on the platform has the following three methods currently supported: * Using Cert-Manager (a k8s resource) * Using k8s TLS secret, but not managed by cert-manager (a k8s resource) * Using ‘system certificate-install’ command where certificates resides as a PEM files on filesystem and sysinv database (not a k8s resource)

The default will be to alarm all the certificate entities, with user having the ability to opt-out if certificate is a k8s resource. This control will be provided via kubernetes annotations. Kubernetes annotations will also be able to customize some of the alarm settings, see below.

The configurable options will include the ability to * Enable/disable alarm (default=enabled) * Change alarm-before number of days (default=15d) * Change alarm severity (default=depends on alarmtype) * Custom alarm text (default=None)

Customization of alarms for certificates managed by ‘system certificate-install’ (PEM file) will not be allowed/supported. StarlingX intends to move all certificates to be configured via k8s resources in future releases, so customization effort for non-k8s certificate configuration will not be included as part of this feature.

Audit CertExpiry

A full audit on all certificate resources will be performed on service startup, restart and periodically. The periodic timer will run every 24 hours.

  • The auditing mechanism will iterate over the cert-manager managed certificates (and their associated k8s TLS secrets), and will only raise an alarm if the user-configured renew-before of the certificate is past, i.e., cert-manager has attempted renewing the certificate, but failed.

  • cert-alarm will then iterate over all k8s TLS secrets that are not managed by cert-manager.

  • cert-alarm will then process the rest of the non-k8s resources that reside in sysinv DB (stored as PEM files on filesystem).

In case of active alarms, another audit will run every hour only on those entities.

Alternatives

Part of the solution can possibly be implemented with cronjobs and/or KubeCronJobs which can audit the expiry dates, but does not provide enough control and customized coding options.

Another alternative to introducing the cert-alarm service is to introduce a new k8s application as a k8s deployment to perform the monitoring. This is not being pursued here since containerization of StarlingX flock is feature on its own.

For maintaining the list of certificates to monitor, it was also considered updating the database entries and extending the tables for customization. Since the user can have new applications that are unknown to the platform, and those certificates should also be monitored, the proposed solution was chosen to use existing Certificate and TLS Secrets in the k8s etcd database, with annotations for customizing the certificate alarming behaviour. Another model could be to allow the user to pass config parameters at runtime, which is not suitable for the platform.

Data model impact

New alarm types for expiringSoon and expired certificates will be defined in the fault management system to support this feature.

Couple of examples of alarm details are:

Alarm raised after SSL certificate is expired will have
Alarm ID: 255.001
Reason Text: "Certificate 'system certificate-install -mode ssl' expired"
Entity ID: system-certificate=ssl
Severity: Critical
Timestamp: <Timestamp value when alarm is raised>

Alarm raised to warn about docker registry certificate expiring soon
Alarm ID: 260.012
Reason Text: "Certificate 'docker-registry' expiring soon in <X> days, on <date>"
Entity ID: system-certificate=docker.registry
Severity: Major
Timestamp: <Timestamp value when alarm is raised>

REST API impact

None.

Security impact

None. The feature will access a certificate on the platform in order to check expiry dates.

Other end user impact

User will see alarms on the ‘fm alarm-list’ output as certificates approach close to expiry date. Certificates that are expired will see a higher severity alarm alerting the user.

User will need to update annotations in order to change default behavior. Examples of annotation are shown below. New annotations supported marked with comment in the following k8s resource.

Name:         system-restapi-gui-certificate
Namespace:    deployment
Kind:         Secret
Type:         kubernetes.io/tls
Annotations:  cert-manager.io/alt-names
              cert-manager.io/certificate-name: system-restapi-gui-certificate
              cert-manager.io/common-name: 10.10.10.3
              cert-manager.io/ip-sans: 10.10.10.3
              cert-manager.io/issuer-kind: Issuer
              cert-manager.io/issuer-name: my-ica-cert-and-key-issuer
              cert-manager.io/uri-sans:
              starlingx.io/alarm: enabled           # New annotation
              starlingx.io/alarm-befor: 30d         # New annotation
              starlingx.io/alarm-severity: critical # New annotation
              starlingx.io/alarm-text: "foobar"     # New annotation

Performance Impact

In large Distributed Cloud systems, there can be thousands of certificates and TLS secrets (there is a unique DcAdminEpIntermediateCA for each subcloud). In order to scale, the cert-alarm audit algorithm will skip the DcAdminEpIntermediateCA Certificates/Secrets for the subclouds that are present on the SystemController. Since these DcAdminEpIntermediateCA secrets are also avaialable on each subcloud, they will be audited and alarmed on the subcloud.

The full certificate alarm audit is run once every 24 hours, and the optional hourly certificate alarm audit only runs when a certificate alarm is active and only audits alarmed certificates. The frequency of checks is thus low, and not expected to have a performance impact.

Other deployer impact

None.

Developer impact

None.

Upgrade impact

If an alarm is indicated as managementAffecting, this will impact upgrades. The intention of this feature is to mark only expired platform certificates as managementAffecting. In such a case, user will be unable to perform and complete upgrades until the certificate is updated.

Only platform certificates such as ‘Kubernetes-RootCA’, ‘ssl’, ‘docker_registry’ will be managementAffecting (& will carry a Critical severity).

Implementation

Assignee(s)

Primary assignee:

  • Sabeel Ansari

Repos Impacted

  • config

  • fault

  • ha

Work Items

  • Create new service cert-alarm

  • Framework code to support new alarms

  • FM integration to discover existing alarms + publish alarms

  • Unit tests

Dependencies

None

Testing

  • New code introduced must include unit test cases

  • Alarms should be raised as certificates approach expiry dates

  • Alarms should be raised when certificates are expired. The alarm should have higher severity than ExpiringSoon

  • Only an ExpiringSoon or Expired alarm should exist (never both) for a certificate entity

  • Testing should include all platform managed certificates

  • Testing should include all certificate configurations: cert-manager managed certificates, certificates in k8s TLS secrets, and manually installed certificates via ‘system certificate-install’.

  • Testing should be perform on all configurations - AIO-SX, AIO-DX, Standard, Distributed Cloud etc

  • Alarms should be persistent after reboots, upgrades etc. In addition cert-alarm should be able to monitor any newly introduced certificate entities in the N+1 release.

Documentation Impact

End user documentation needs to be updated with all new alarm codes and their respective details and impact. In addition, documentation should also recommend corrective action that needs to be taken by user to address the alarm.

The documentation will also capture details for customizing certificate alarming behavior using k8s annotation.

References

History