News

11.10.2023

Franziska Feldmann

ALASCA Newsletter #3

Foreword

Dear members, dear interested parties,

We look back on 1 year of existence of the association and are very proud of how ALASCA has developed since its foundation on 29 September 2022.

For example, we are very pleased with the positive response ALASCA is receiving in the open source community. In addition, we have already grown and are happy to welcome new members to our ranks. Our ALASCA Tech Talks also continue to grow and we have exciting talks in the pipeline.

Furthermore, we are in very concrete discussions about the inclusion of other open source projects, which will soon - in addition to Yaook - be further promoted under the ALASCA flag.

In the future, in addition to our own formats, we will also support exciting projects from within ALASCA and will co-host and sponsor the SCS Summit at the beginning of November. You can look forward to this and other topics in the third issue of our newsletter.

Have fun reading!

Marius Feldmann
Board of Directors ALASCA e.V.

New members and projects

ALASCA is growing and we are pleased to announce that the ALASCA community has officially grown by two more members. In addition to the 7 founding members, we now also include the 23 Technologies GmbH and Daiteap GmbH & Co KG as well as two private members are among the sponsors of our association. We are pleased to have gained further Kubernetes experts with these two companies, who are helping to bring the topics of digital sovereignty and open source to the public.

Furthermore, we are looking forward to the current developments around the onboarding of new projects. As already mentioned in our last ALASCA Tech Talk with Christian Berendt announced, we are currently examining whether the open source project YAKE meets the requirements for admission to our association.

ALASCA invites to the General Assembly

On 17 November, we invite you to the 1st General Assembly in Dresden and look forward to welcoming all members and their representatives to Dresden. We will use our get-together to review the past year and dive into the planning of the coming association year, define new goals and consider how we want to shape our association life as well as our association activities to strengthen the Digital Sovereignty of Cloud Infrastructures. In a small workshop, we also want to deal with current topics regarding the structure of the association, such as the admission and onboarding process of new projects, the composition of the Technical Steering Committee, etc.. In doing so, we will build on the results of the workshop held in May this year.

ALASCA co-hosts SCS Hackathon

The 3rd SCS Hackathon will take place on 8 November 2023. We are very pleased to welcome the SCS community to the premises of Cloud&Heat Technologies in Dresden and to support this exciting event as co-organisers. In addition to topics around the (further) development of the Sovereign Cloud Stack, the future cooperation between ALASCA and the SCS will also be a topic.

The evening before, we will meet for a relaxed social event with a guided tour of the city, followed by food and drink, to which our ALASCA members are also cordially invited.

We are very much looking forward to welcoming the SCS and its community to Dresden and to an exciting exchange.

ALASCA Tech-Talks

Maybe you have already seen it. If not, we would like to point you to our new ALASCA Tech Talks landing page, which has been part of our website for about a month. Here you can find all information about the ALASCA Tech Talks condensed on one page, easily access the recordings of past Tech Talks and download the corresponding calendar entries and access data. Have a look at at our landing page.

New date

Please note that we do not use the Date of the ALASCA Tech Talks have moved. In future you will find every last Friday of the month at 11 a.m. instead. There's no better way to start the weekend, is there?

In order for us and the open source community to continue to grow, we are very happy about your commitment and about you diligently sharing upcoming tech talks with your network. If you would like to give a tech talk yourself, please feel free to contact us with suggestions.

October Tech Talk

On 27 October 2023 you can look forward to a lecture by Matthias Haag, CEO and founder of UhuruTec AG on the topic of "A virtual environment to develop baremetal provisioning". You can find more about the Tech Talk here: https://alasca.cloud/alasca-tech-talk-10/

Update on Yaook open source project

As usual, in the third issue of our newsletter we would like to highlight the most important current features, bug fixes and updates around our open source project Yaook and give you an outlook on the future direction of the project.

Yaook is an open source project that consists of three sub-projects: Yaook/K8s, Yaook/Operator and Yaook/Baremetal. For a better overview, we have broken down all innovations according to this subdivision below.

Features:

Yaook/K8s:

Helmify Thanos: The Bitnami Helm Chart is now used to set up Thanos. This allows the JSONnet code currently used to be removed and the code base to be simplified. Version 12.13.3 is used as the default Thanos helmet chart. This helmet chart variant will replace the previously used Thanos_v1 in the near future.

Thanos_v2: Metrics were enabled and the obsolete ServiceMonitor was removed.

Terraform now uses "Ubuntu 22.04 LTS" as the default image.
New is the support for Kubernetes v1.26. The versions of the components assigned to the Kubernetes role have been updated to be up to date. Furthermore, the migration to the calico Tigera operator, for upgrades to Kubernetes v1.26, is enforced. This announced step is necessary to eventually get rid of the manifest-based installation method, or when support for Kubernetes v1.25 is discontinued.
snapshot-controller: A base for supporting future versions of the Volume Snapshot Controller. The current version is mapped to the configured Kubernetes version in the k8s-config role and is also updated during Kubernetes upgrades
NodeSelector was added to the Tigera operator. Since the Tigera operator is a system component, it was ensured that it is created on the "control plane" so that it cannot be displaced by "worker node pressure" or similar. This is especially necessary because the helmet chart does not currently allow a PriorityClass to be configured.
From now on we use "towncrier" for semi-automatic documentation and announcement of release notes. It is also used to prevent merge conflicts when manually editing the CHANGELOG. If you are a developer, feel free to check out the Coding Guide for further information.
Nix package manager: Nix is a declarative package manager that supports NixOS but can also be installed as an additional package manager on any other distribution. A flake.nix file has been added to the project that references all necessary dependencies tied to specific versions.

„flake.nix" and "poetry" get pre-commit hooks that are in addition to "direnv" have been added for an autoinstall/setup. Thus, the .envrc file has the ability to automatically load an isolated Python environment (venv) when entering and updating the cluster repository (yk8s; Yaook-Kubernetes). In addition, the requirements.txt file (python) was replaced by poetry (pyproject.toml) to "pin" 3rd party library versions. Poetry allows to set Python dependencies declaratively and thus to pin versions. This ensures that inconsistencies between individual development environments are reduced. The dependencies are stored in the pyproject.toml file and "pinned" via a poetry.lock file.
The current 'nix_direnv' is loaded (sourced) when "flake" (Nix package manager) is used. This makes entering and loading the environment much faster.

It now supports rook v1.8.10. The update can be started by setting the version in config.toml to "version=1.8.10" and then the command on the command line: 'MANAGED_K8S_RELEASE_THE_KRAKEN=true AFLAGS="-diff -tags mk8s-sl/rook" managed-k8s/actions/apply-stage4.sh' is executed.
passwordstore (pass) has been replaced by HashiCorp Vault. In particular, Vault functions are used to manage PKI certificates better than just publishing the private key anywhere. All known existing uses of secrets or credentials have been moved to Vault. A migration script for existing applications is provided as well as documentation about Vault in general and this use of Vault in particular. Compatibility with Metal Controller is only partial, changes are required.
Wireguard Link MTU (Maximum Transmission Unit): The environment variable for the MTU of the wireguard link has been set: In some cases, the MTU for the wireguard link must be set separately for the tunnel to function properly. Therefore, there is now an environment variable for the MTU.
rook_v2: Added support for bare metal configurations.
- default helmet chart version is now v1.11.8
- no marko is used for resource limits/requests
- small pauses before upgrade checks, this gives the operator time to start an upgrade.
- Create Prometheus rules only if monitoring is activated
- Rook chart and Ceph versions are now independent; the code for each new Rook Helm chart no longer needs to be adapted. However, it is recommended to read the release notes beforehand.
New .envrc template. Template for .envrc changed to include only cluster-specific variables and to include the user-specific variables from different sources.

Yaook/Operator:

Switching to new licence/dependency scans: The old licence scan from Gitlab is outdated and the job fails. The dependency scan uses Python 3.10, which is the latest supported Python version.
Persistent libvirt/qemu protocols are now supported.
OpenStack releases "zed" and "antelope" for Glance
Implementation of the "Eviction manager" to minimise the impact of a defective hypervisor. A Kubernetes pod is created that monitors whether all resources are still available for Nova Compute. If a node fails, the pod initiates an eviction and removes the running tasks (evicts). It then puts the affected node into 'Ironic Off' mode.

New state in NovaDeployment: A new state is implemented in the Nova deployment. This state implements an "Eviction Manager". The "Eviction Manager" monitors the Nova API to retrieve a list of hypervisors or compute services. If the Nova API marks a hypervisor as failed, the "Eviction Manager" triggers an eviction process. To prevent eviction of nodes that are actually healthy, due to a network partition or control plane failure, a threshold is implemented to prevent eviction if too many hypervisors have failed at the same time.

Yaook/Baremetal:

Adding a third deployment option to allow generic creation of base.sh. The options:
1. base.sh -> Standard from yaook/k8s
2. "none" -> base.sh is empty
3. "custom base.sh" -> based on a path that points to a file. Currently, it is only possible to deploy nodes with Kubernetes. This makes it possible to specify in Netbox on which node no K8s should be rolled out.
Implement a solution in case a device does not have IPMI (Intelligent Platform Management Interface) functionality. In cases where a node does not have inband IPMI functionality, Yaook needs to be able to handle this situation. In this scenario, Yaook uses the BMC (Baseboard Management Controller) address registered via Dynamic Domain Name System (DDNS) together with the node's serial number to compensate for the lack of IPMI.

Fixes:

Yaook/K8s:

Object clean-up during Thanos migration: Introduction of an additional migration backup task that checks whether the migration has already taken place.
IPsec: Provide the notification script with the correct file extension. This fixes the problem that the Keepalived-Notifier.sh script looks for .sh files while Yaook provides the template 10-swanctl-notify.sh.j2 with the .j2 extension, which causes failures when the Keepalived instance on the primary gateway fails.
thanos_v2: Bug fix for scheduling key templating
monitoring: Bug fix for missing monitoring_scheduling_key variable
containerd: Bug fix for missing node_has_gpu variable
Bug fix: Non-GPU clusters in non-GPU clusters would fail the 'containerd' and 'kubeadm-join' roles due to the variable 'has_gpu' not being defined. This fix changes the order of the condition so that 'has_gpu' is only checked if GPU support is enabled for the cluster. This change can be seen as a kind of workaround for a bug in Ansible. Normally, the variable 'has_gpu' would be set in a dependency of both roles, but Ansible skips dependencies if they have already been skipped before in the play.
Bug fix in the vault.sh script: The script is terminated if the file 'config.hcl' already exists. This fix became necessary due to a change in the behaviour of the '-no-clobber' option in Coreutils version 9.2.
By default, the VRRP ("Virtual Router Redundancy Protocol.") priorities for Space are now further apart. This adjustment is made to reduce the likelihood of race-conditions when two nodes simultaneously attempt to take over the tasks of a failed primary node. Such simultaneous takeovers could lead to service interruptions in certain edge cases.
Volume Snapshot CRDs have been created: The creation of volume snapshots requires volume snapshot CRDs and a snapshot controller, which were previously missing but have now been provided.
Removing Docker remnants: Support for Docker was removed along with Kubernetes versions below 1.24. However, some parts of the code still depended on the unnecessary variable "container_runtime". This change removes the use of this variable in all code and also deletes the documentation on migrating from Docker to Containerd.
Vault fix for clusters without Prometheus: Previously, the Vault role always tried to create ServiceMonitors that required the CRD of Prometheus. With this commit, the creation of ServiceMonitors is now made dependent on whether monitoring is enabled or not.
Rook fix for OpenStack differentiation: Default values now only apply to clusters running on OpenStack and using Cinder CSI as storage classes. This configuration is made optional for clusters on bare metal systems.
IPsec: The passwordstore role is only inserted when IPsec is enabled. IPsec has not yet been fully migrated to Vault and is still dependent on the passwordstore role. If IPsec is not used, initialisation of a passwordstore is not required. This commit solves this problem by including the role via include_role instead of it being a dependency.
Rook: A correction is made when iterating common monitoring labels. The default format must be a dictionary, not an array. The previous condition caused errors when iterating over the list of common monitoring labels when using the default format because elements were not available for arrays.

Yaook/Operator:

Fix: Removing ClusterIP for OVN vSwitchd: The ClusterIP of OVS-vswitchd caused environment variables to be generated in all pods. In large environments, this slowed down bash and thus affected liveness and readiness probes. The problem is discussed in detail in this Kubernetes-Issue described.
Fixing the OVSDB postStart hook: The OVSDB postStart hook previously had a time limit of 10 seconds for starting the OVSDB. This limit was adjusted to the value of the Liveness probe to give the OVSDB more time to initialise. The time limit is increased to 2 minutes for security reasons.
Keystone cache clearing fix: This fix addresses an issue where an exception was thrown when a parallel execution process attempted to clear an entry that had already been removed.
Memory leak fix backport from upstream source: This is a backport or monkey patch for a memory leak fix (https://review.opendev.org/c/openstack/openstacksdk/+/890781). Since OpenStackSDK rarely releases new versions, this patch was applied to fix the memory leak.
Increase the threshold for the liveness probe of the OVSDB for agents: A change would delete the OVSDB after it has been restarted. However, if containers were restarted due to liveness probes, this would break connections to running virtual machines (VMs). To prevent this, a high threshold is set so that the container is only terminated if there is an actual problem.
Fix: Slow upload of Glance images: Due to an update to Python 3.8 and Debian, the location of the trust store changed, which affected the upload speed of Glance images. This fix updates the inclusion of CA certificates to resolve the issue.

Yaook/Baremetal:

Fix: Remove line breaks from the last line of comments: This fix addresses an issue where line breaks were present in the last line of comments. These line breaks, when sent to Netbox, could result in endless loops where the same comment was rewritten in each iteration. Netbox would remove the line breaks anyway, resulting in unnecessary API calls.
Correctly revoke the Vault authentication token: This change ensures that the Vault authentication token is correctly revoked. Previously, the code caused exceptions that interfered with access to Vault for the Metal controller, especially after upgrading the 'hvac' library to version 0.11.2. The relevant documentation for the 'Token.revoke_self' method can be found at here.

Updates:

Documentation improvements: We have further improved the project documentation, making it more comprehensive, accessible and user-friendly. In addition, the documentation has a detailed User Guide which sheds more light on the concepts of Yaook.

Yaook/K8s:

Increase of the Prometheus stack version to 48.1.1
Increasing the Prometheus adapter version to 4.2.0
Adapting vault.sh for use with Rootless Docker/Podman: The 'vault.sh' script is designed to run containers under the current user to ensure that file ownership in the './vault/' directory matches the current user. However, this approach does not work in rootless environments where user IDs (UIDs) are mapped to a sub-UID range within the container. This commit introduces a new user-specific environment variable, 'VAULT_IN_DOCKER_USE_ROOTLESS,'. When set to 'true', 'vault.sh' will run a 'chown' job to adjust file permissions before starting the vault container. A default value for 'VAULT_IN_DOCKER_USE_ROOTLESS' is also added to prevent 'vault.sh' from failing if the variable is not explicitly set.
Monitoring: Adding support for upgrades to > v46: This change adds support for upgrades to versions higher than v46 to the monitoring facility. Version v46 introduced new custom resource definitions (CRDs) that need to be considered during upgrades. The commit ensures that CRDs are applied to all major version steps during upgrades to simplify the process.
Rook: Adding support for arbitrary versions and bare-metal environments: Support for arbitrary versions and bare-metal environments is added to Rook, a storage orchestration platform.
Restarting nvdp pods during Kubernetes upgrades: The 'nvdp' (NVIDIA Validation Daemon Process) marks a GPU as non-functional during 'systemctl' restarts and subsequent 'kubelet' restarts. This is considered an error by 'nvdp'. The pods hosting 'nvdp' do not fail, but the affected GPU becomes unusable. This issue occurs during Kubernetes upgrades, so it is necessary to restart any 'nvdp' pod on Kubernetes nodes with GPUs before removing the cordon restriction. This ensures that the impact on workloads is minimal.
Removed deprecated monitoring_v1 role: The monitoring_v1 role did Prometheus stack deployment via JSONnet magic. This was the default deployment method in the past. Now we have Helm, which does a far better job.

Yaook/Operator

A new section in the "User Guide" documentation https://docs.yaook.cloud/handbook/user-guide.html
Cache keystone access data: Keystone access data is used intensively in the Keystone Resource Operator. In most cases, they are in fact always the same Keystone access data, as they refer to the same Keystone provision. To increase efficiency, these access data are now optimistically cached. In the event of an exception, the caches are cleared. This caching strategy is intended to improve matching times, but may result in a small number of keystone resources having a "BackingOff" state when the credentials change.
Optimisation of _labels in ForeignResourceDependency: This makes it possible to avoid calling the Kubernetes API.
ceilometer-agent-compute is now optional: if it is not needed, it is no longer necessary to iterate over all nodes. This change is particularly beneficial when there are many compute nodes, as it reduces processing time without limiting functionality.

Yaook/Baremetal:

Adding an adjustment interval configuration parameter: The default adjustment interval can put a lot of load on the Netbox in larger environments. This addition offers the possibility to configure the interval via an environment variable.

Outlook:

Looking ahead, the Yaook project aims to address several key challenges and implement exciting new features.

The upcoming core split in the yaook/k8s project continues to be prepared and the first implementations have already started.

Upgrade path for newer OpenStack/OVN releases: We are actively working on developing a seamless upgrade path for the latest OpenStack and OVN releases to ensure that users can effortlessly stay up to date with the latest developments.
Stability of OVN: We are committed to further improving the stability of the OVN component, focusing on eliminating potential problems and improving overall performance.
Node lifecycle operator for updates: To enable smoother updates and maintenance, we are looking into developing a node lifecycle operator. This operator would automate the process of reinstalling nodes during updates, optimising the process for administrators.
Release management: We are currently working to establish a robust routine, workflow and tools for release management of the project. This includes implementing automated release notes and documentation updates. As Yaook/K8s is relatively new in this respect, we will use it as a "playground" to experiment and find a suitable workflow that can later be applied to other project components.
Further development of the Yaook/K8s subproject to improve integration with other Kubernetes features and tools.
Extension of Yaook/Operator functionality to enable automation of routine tasks and improvement of system performance.
Continue work on the Yaook/Baremetal project to extend support for different hardware and infrastructure platforms.
Improve the documentation and usability of the entire Yaook project to facilitate implementation and use for developers and administrators.

We are pleased with the progress made in the Yaook open source project and continue to work on innovative features.

You will receive further updates and news about Yaook in future newsletters! Subscribe below to make sure you don't miss any updates.

Thank you for your support and interest in Yaook!

Subscribe for newsletter

Do you have any questions or comments about our news? Then please contact us via hello@alasca.cloud We look forward to hearing from you.

If you would like to receive the newsletter quarterly directly in your mailbox, you are welcome to sign up for the newsletter distribution list using the contact form below.

Until next time, we wish you a good time.

Would you like to learn more?

Do you have any questions regarding the event or article above or would you like to get in touch? Feel free to contact us. We look forward to hearing from you.

Become part of the community.

Join ALASCA and develops exciting projects in the community.

News

11.10.2023

Franziska Feldmann

ALASCA Newsletter #3

Foreword

Content of our newsletter

New members and projects

ALASCA invites to the General Assembly

ALASCA co-hosts SCS Hackathon

ALASCA Tech-Talks

New date

October Tech Talk

Update on Yaook open source project

Features:

Fixes:

Updates:

Outlook:

Subscribe for newsletter

Would you like to learn more?

Become part of the community.

Directory

Legal

Not a member yet?

Social media