Notes

Notes - notes.io

POC: VALIDATE AMI ROTATION ON DEV (SURGE STRATEGY + 5-NODE SIZE: MEDIUM SCALING)

OBJECTIVE

Validate EKS managed AMI rotation as a replacement for manual SSM-based node patching. Two tests on DEV:

- Test 1 — AMI rotation on the existing 3-node / size: small cluster. Confirm EKS brings up a new node before draining the old one. Measure pod disruption and portal availability.
- Test 2 — Scale DEV to 5 nodes, switch to size: medium, repeat AMI rotation. Confirm the procedure still works on a larger cluster and measure whether expanded HA coverage reduces user-facing impact.

ACCEPTANCE CRITERIA — STATUS

- Both tests completed on DEV — DONE.
- Runbook documented and reusable on KONS and PROD - In progress
- Cost estimate for 5 nodes + size: medium — DONE

TEST 1: AMI ROTATION ON 3 NODES, SIZE: SMALL
Date: 2026-04-16 (re-run with event monitoring; original run 2026-04-13 completed without logging)

SETUP

- 6 multi-replica components on size: small: itom-ingress-controller (2), itom-xruntime-gateway (2), itom-xruntime-platform (2), itom-xruntime-serviceportal (2), itom-xruntime-ui (2), infra-rabbitmq (3)
- PodDisruptionBudgets created (minAvailable: 1) for all 6 components. matchLabels verified from live cluster before applying.
- maxSize bumped 3 → 4 to allow 1 surge node. updateConfig set to maxUnavailable: 1.
- Rotation triggered: aws eks update-nodegroup-version --release-version 1.32.12-20260304

TIMELINE (from rotation_events_test1.log)

16:30:01 Monitoring started — 3 nodes Ready (baseline)
16:31:20 3 new nodes Ready simultaneously (total 6, despite maxSize=4 and maxUnavailable=1)
16:32:37 ALL 3 old nodes cordoned simultaneously
16:32:37 Pod disruption begins — 21 non-Running pods
16:35:16 Node ip-10-0-10-83 replaced (~2m 39s after first cordon)
16:35:16 Second disruption wave — 20 non-Running pods
16:36:35 PORTAL DOWN (HTTP 502)
16:37:56 PORTAL UP — ingress restored (1m 21s outage)
16:40:35 Node ip-10-0-30-221 replaced (~7m 58s after first cordon)
16:54:23 Node ip-10-0-20-7 replaced (~21m 46s after first cordon)
16:54:23 All pods Running — rotation complete

Total rotation time: 24 min 22 sec (16:30:01 → 16:54:23)
Total pod disruption time: 21 min 46 sec (16:32:37 → 16:54:23)
Portal (HTTP ingress) downtime: 1 min 21 sec (16:36:35 → 16:37:56)

PORTAL AVAILABILITY

- Portal reachable (HTTP): degraded — 1m 21s complete outage (HTTP 502)
- Full functionality: degraded for ~22 minutes (singleton services unavailable)
- Safe to perform user work: no

COMPONENT DOWNTIME

- PDB-protected components (6): maintained at least 1 running replica for most of the rotation. Exception: itom-ingress-controller — despite PDB, both replicas were in-flight simultaneously for ~1m 21s due to back-to-back node drains on a 3-node cluster.
- Singleton components (Recreate, no PDB): individual downtime not measured.

TEST 2: AMI ROTATION ON 5 NODES, SIZE: MEDIUM
Date: 2026-04-15

SETUP

- Step 1 — Scale nodegroup 3 → 5 nodes (m5.2xlarge).
- Step 2 — Helm upgrade size: small → size: medium. The profile change modifies immutable StatefulSet fields on 7 objects. A standard helm upgrade fails with "Forbidden: updates to statefulset spec". Solution: delete the 7 affected StatefulSets with --cascade=false (orphan pods, PVCs untouched) before running helm upgrade. Helm recreated them with the medium spec and reattached existing PVCs. Affected: smarta-saw-con, smarta-saw-con-a, smarta-sawarc-con, smarta-sawarc-con-a, smarta-sawmeta-con, smarta-sawmeta-con-a, infra-rabbitmq.
- Step 3 — HA workload count after upgrade: 12 (was 6 on small). Newly promoted to multi-replica: idm (2), itom-bo-ats (2), itom-bo-facade (2), itom-bo-login (2), itom-bo-user (2), smarta-search (3). Already HA, replicas increased: itom-xruntime-platform (2 → 3).
- Step 4 — PDBs created for 6 newly HA components (minAvailable: 1). itom-xruntime-platform already covered by existing PDB from Test 1. Total PDBs before rotation: 12.
- Step 5 — maxSize bumped 5 → 6. Rotation triggered: aws eks update-nodegroup-version --release-version 1.32.7-20250813. The only available rotation target was the previous AMI, as no newer safe AMI exists beyond 1.32.12-20260304 (established in the prior AMI compatibility POC). Despite maxSize=6, EKS launched 4 surge nodes (reaching 9 total) — the same maxSize-exceeding behavior observed in Test 1 (maxSize=4, reached 6).

TIMELINE (from rotation_events_test2.log)

18:21:53 Monitoring started — 5 nodes Ready (baseline)
18:24:29 NEW NODE READY — total 7 (2 surge nodes launched simultaneously)
18:25:34 NEW NODE READY — total 8 (3rd surge node)
18:26:53 NEW NODE READY — total 9 (4th surge node)
18:27:07 ALL 5 old nodes cordoned simultaneously
18:28:09 Pod disruption begins — 20 non-Running pods
18:30:49 Node ip-10-0-10-28 replaced (~3m 42s after first cordon)
18:32:13 Second disruption wave — 14 non-Running pods
18:34:48 Node ip-10-0-10-214 replaced (~7m 41s)
18:37:31 Node ip-10-0-30-162 replaced (~10m 24s)
18:38:54 All pods Running (brief window between node drains)
18:40:12 Third disruption wave — 15 non-Running pods
18:40:12 Node ip-10-0-30-210 replaced (~13m 5s)
18:47:01 Node ip-10-0-20-38 replaced (~19m 54s)
18:49:42 Node ip-10-0-30-211 replaced (~22m 35s)
18:49:57 All pods Running — rotation complete

Total rotation time: 28 min 4 sec (18:21:53 → 18:49:57)
Total pod disruption time: 21 min 48 sec (18:28:09 → 18:49:57)

PORTAL AVAILABILITY

The distinction between "portal reachable" and "portal functional" is critical for interpreting these results.

SMAX has 9 components with Recreate strategy and no PDB — they run as single instances and go down completely while being rescheduled. During the ~22-minute disruption window, any user action depending on one of these components would have failed.

Example: a user fills out a service request form and clicks Submit. The request reaches the gateway (HA, available), but downstream processing touches LCM (singleton, restarting). The submission fails.

- Portal reachable (HTTP): 100% of rotation time
- Full functionality: degraded for ~22 minutes
- Safe to perform user work: no

COMPONENT DOWNTIME

- PDB-protected components (12): all maintained at least 1 running replica throughout. No complete service outage for any of these.
- Singleton components (Recreate, no PDB): individual downtime not measured.

NOTE ON UPDATE STRATEGY: DEFAULT VS SURGE

Task specified SURGE strategy with maxSurge=1, maxUnavailable=0 configured via updateConfig. Test 2 and the original Test 1 were run without setting updateConfig. The Test 1 re-run set updateConfig explicitly to maxUnavailable: 1.

Even with maxUnavailable: 1 set explicitly, EKS:
- Launched all 3 replacement nodes simultaneously (total reached 6, exceeding maxSize=4)
- Cordoned all 3 old nodes at the same time
- Drained them sequentially, one at a time

Additionally, in both tests EKS exceeded the configured maxSize constraint during rotation: Test 1 reached 6 nodes with maxSize=4, Test 2 reached 9 nodes with maxSize=6. EKS managed node group updates provision all replacement nodes upfront before any drain, and this provisioning is not capped by maxSize during the rotation window.

- DEFAULT (what ran in all three rotations): EKS provisions all replacement nodes upfront, cordons all old nodes simultaneously, then drains them sequentially.
- True one-at-a-time behavior: not achievable through updateConfig alone — would require a custom rotation procedure outside of managed node group updates.

This does not affect the main conclusion of the POC. The fundamental finding — that SMAX is not fully functional during AMI rotation due to singleton components — holds regardless of update strategy. A true one-at-a-time strategy would narrow the disruption window per node but singleton services still go down during rescheduling regardless of how the drain is orchestrated.

KEY FINDINGS

1. PDBS REDUCE BUT DO NOT ELIMINATE PORTAL OUTAGE RISK — PDBs ensured at least one replica survived eviction for most components throughout most of the rotation. However, Test 1 recorded a 1m 21s complete HTTP 502 outage on the ingress layer despite a PDB on itom-ingress-controller (minAvailable: 1). With only 2 ingress replicas on a 3-node cluster where all nodes drain in rapid succession, both replicas were in-flight simultaneously during back-to-back drains. Test 2 (5 nodes, 12 HA components) did not experience ingress downtime, indicating that replica spread across more nodes reduces — but does not eliminate — this risk. Without PDBs, the portal would have gone completely down for the full ~22-minute disruption window in both tests.

2. SMALL → MEDIUM PROFILE CHANGE REQUIRES STATEFULSET RECREATION — helm upgrade fails on the small → medium transition due to immutable field changes on 7 StatefulSets. Workaround: kubectl delete statefulset --cascade=false, then helm upgrade. PVCs survive, pods are orphaned (not killed), Helm recreates the StatefulSets and reattaches existing storage. Tested successfully.

3. EKS MANAGED ROTATION IS A VIABLE REPLACEMENT FOR MANUAL PATCHING — Both tests completed successfully without cluster-level failures. Despite the partial functionality degradation during rotation, the procedure is stable and predictable. Compared to manual SSM-based patching:
- Reproducibility: the same runbook executes identically each time, no per-node manual steps, no SSH access required
- Reduced error surface: node drain, replacement, and readiness checks are handled by EKS; operator errors during manual patching are eliminated
- Time: the 3-node rotation completed in a single automated run; manual patching required individual SSM sessions per node

COST ESTIMATE: 5 NODES + SIZE: MEDIUM

Node type: m5.2xlarge (8 vCPU, 32 GiB), region: eu-central-1 (Frankfurt)
Source: AWS pricing calculator (2026-04-16), shared tenancy
Figures reflect the delta for 2 additional nodes only (3-node baseline unchanged).

On-Demand ($0.460/hr per node):
2 nodes x $0.460 x 730h = $671.60/month

Compute Savings Plans — 3yr, No Upfront ($0.278/hr per node):
2 nodes x $0.278 x 730h = $405.88/month

CURRENT DEV STATE

After Test 1 re-run (2026-04-16), DEV is in the following state:
- Helm: size: small
- Nodegroup: 3 nodes, minSize=3, maxSize=4, desiredSize=3 ← revert maxSize to 3
- AMI: 1.32.12-20260304 (containerd 2.1.5)
- PDBs: 6 present ← delete after revert

NEXT STEPS

1. Create production-grade runbooks for KONS and PROD.

2. Awaiting OpenText response on containerd 2.2.x compatibility roadmap and supported procedure for near-zero-downtime SMAX upgrades.

Notes is a web-based application for online taking notes. You can take your notes and share with others people. If you like taking long notes, notes.io is designed for you. To date, over 8,000,000,000+ notes created and continuing...

With notes.io;

* You can take a note from anywhere and any device with internet connection.
* You can share the notes in social platforms (YouTube, Facebook, Twitter, instagram etc.).
* You can quickly share your contents without website, blog and e-mail.
* You don't need to create any Account to share a note. As you wish you can use quick, easy and best shortened notes with sms, websites, e-mail, or messaging services (WhatsApp, iMessage, Telegram, Signal).
* Notes.io has fabulous infrastructure design for a short link and allows you to share the note as an easy and understandable link.

Fast: Notes.io is built for speed and performance. You can take a notes quickly and browse your archive.

Easy: Notes.io doesn’t require installation. Just write and share note!

Short: Notes.io’s url just 8 character. You’ll get shorten link of your note when you want to share. (Ex: notes.io/q )

Free: Notes.io works for 14 years and has been free since the day it was started.

You immediately create your first note and start sharing with the ones you wish. If you want to contact us, you can use the following communication channels;

Email: [email protected]

Twitter: http://twitter.com/notesio

Instagram: http://instagram.com/notes.io

Facebook: http://facebook.com/notesio

Regards;
Notes.io Team

Notes

Notes - notes.io

Shortened Note Link

Long File

Notes