SDDC Manager – ‘SSO Ring Topology’ Error when rebuilding additional VCF 4.x Workload Domains

I’ve recently been working on a large VCF rollout with Dell which comprised of a management cluster and multiple workload domains. For those unfaimilar with VCF, each management and workload cluster has its own vCenter appliance, all created and managed through SDDC manager. To make things more complicated, the vCenters are configured in a single SSO domain (i.e vsphere.local) and configured with linked mode to save you logging into multiple vCenter interfaces.

Whilst VCF is a neat deployment option, it isn’t perfect and often things go wrong mid deployment of which requires a workload domain to be flattened and started again. The removal process of a problematic workload domain in the ‘Activiating’ state involves killing the tasks and removing the job from postgres in the SDDC manager. I will briefly go through the process of killing the domain before moving on to my SSO topology problem.

Remove vCenter and WLD from Postgres

 --Connet to postgres from bash on the SDDC Manager
 psql --host=localhost -U postgres -d platform
 
 -- Run the following queries and identify the problematic vCenter - grab the id. 
 -- In this example, workload vcenter 03 requires removing (wvc03)
 select id,status,type,vm_hostname from vcenter;
                  id                  | status |    type    |                vm_hostname
--------------------------------------+--------+------------+--------------------------------------------
 0ffe1f7d-6e3d-490f-8c0f-acfc65dc59b6 | ACTIVE | MANAGEMENT | tgdevmvc01.tg.local
 da93939e-96fc-4971-8214-949229c4936b | ACTIVE | VI         | tgdevwvc01.tg.local
 5fccdeb7-12cf-4f2f-98d8-eee78795e113 | ACTIVE | VI         | tgdevwvc03.tg.local
 3bd30ad1-c5f3-4a04-8d7b-6916b2e68546 | ACTIVE | VI         | tgdevwvc02.tg.local
 7081da2d-3d7f-4f34-893f-6228d7257c43 | ACTIVE | VI         | tgdevwvc04.tg.local
(5 rows)

platform=# select * from vm_and_vm_type_and_domain;
 id  |              domain_id               |                vm_id                 |         vm_type
-----+--------------------------------------+--------------------------------------+-------------------------
   1 | ae4e39d1-641b-49c4-b9fa-d2d05c91b3ad | f1c28cc9-3f96-4706-bda9-c04cad9cc41e | SDDC_MANAGER_CONTROLLER
   2 | ae4e39d1-641b-49c4-b9fa-d2d05c91b3ad | d52ca14a-9c14-4b88-aa28-f81e4a052d56 | PSC
   3 | ae4e39d1-641b-49c4-b9fa-d2d05c91b3ad | 0ffe1f7d-6e3d-490f-8c0f-acfc65dc59b6 | VCENTER
   6 | ae4e39d1-641b-49c4-b9fa-d2d05c91b3ad | 342be33c-8855-4a5e-bca9-731342428114 | NSXT_CLUSTER
   7 | ae4e39d1-641b-49c4-b9fa-d2d05c91b3ad | fa35772a-d8d4-409c-ad58-9c1ebbfe7fc0 | VXMANAGER
  12 | ae4e39d1-641b-49c4-b9fa-d2d05c91b3ad | 57872f33-28f4-4c7a-8d35-1fea8b91361b | VRSLCM
  93 | 1982d879-a129-4042-87f6-83117317262e | da93939e-96fc-4971-8214-949229c4936b | VCENTER
  94 | f38c4dbd-9d67-44b8-aa7a-fad5b6e11650 | 3bd30ad1-c5f3-4a04-8d7b-6916b2e68546 | VCENTER
  97 | 1982d879-a129-4042-87f6-83117317262e | e260f603-169d-4f35-9e2e-eb3afdbd6935 | VXMANAGER
 118 | 1982d879-a129-4042-87f6-83117317262e | 22e46d2d-13fe-4b0c-9350-a3fa02238962 | NSXT_CLUSTER
 123 | 1e52a225-70a0-4de1-9aee-8baeee7385ff | 5fccdeb7-12cf-4f2f-98d8-eee78795e113 | VCENTER
 126 | f38c4dbd-9d67-44b8-aa7a-fad5b6e11650 | 1fa39e34-c1c7-41f7-9bc8-0c6c987a9136 | VXMANAGER
 147 | f38c4dbd-9d67-44b8-aa7a-fad5b6e11650 | 22e46d2d-13fe-4b0c-9350-a3fa02238962 | NSXT_CLUSTER
 148 | a48c4525-9ceb-4226-a0aa-744b28c02a7e | 7081da2d-3d7f-4f34-893f-6228d7257c43 | VCENTER
(14 rows)

platform=# select * from vcenter_and_psc;
              vcenter_id              |                psc_id
--------------------------------------+--------------------------------------
 0ffe1f7d-6e3d-490f-8c0f-acfc65dc59b6 | d52ca14a-9c14-4b88-aa28-f81e4a052d56
 da93939e-96fc-4971-8214-949229c4936b | d52ca14a-9c14-4b88-aa28-f81e4a052d56
 3bd30ad1-c5f3-4a04-8d7b-6916b2e68546 | d52ca14a-9c14-4b88-aa28-f81e4a052d56
 5fccdeb7-12cf-4f2f-98d8-eee78795e113 | d52ca14a-9c14-4b88-aa28-f81e4a052d56
 7081da2d-3d7f-4f34-893f-6228d7257c43 | d52ca14a-9c14-4b88-aa28-f81e4a052d56
(5 rows)

-- Once the vcenter IDs have been found in the previous 3 tables they can be deleted

platform=# delete from vcenter where id = '5fccdeb7-12cf-4f2f-98d8-eee78795e113';
DELETE 1
platform=# delete from vm_and_vm_type_and_domain where vm_id = '5fccdeb7-12cf-4f2f-98d8-eee78795e113';
DELETE 1
platform=# delete from vcenter_and_psc where vcenter_id ='5fccdeb7-12cf-4f2f-98d8-eee78795e113';
DELETE 1

-- Now run the following select query to find the workload domain ID - in this example WLD3 requires removal
platform=# select id,creation_time,name,status,type from domain;
                  id                  | creation_time | name |   status   |    type
--------------------------------------+---------------+------+------------+------------
 ae4e39d1-641b-49c4-b9fa-d2d05c91b3ad | 1619439518503 | MGMT | ACTIVE     | MANAGEMENT
 1982d879-a129-4042-87f6-83117317262e | 1619702192513 | WLD1 | ACTIVE     | VI
 1e52a225-70a0-4de1-9aee-8baeee7385ff | 1620222985945 | WLD3 | ACTIVATING | VI
 f38c4dbd-9d67-44b8-aa7a-fad5b6e11650 | 1619770686665 | WLD2 | ACTIVE     | VI
 a48c4525-9ceb-4226-a0aa-744b28c02a7e | 1620298875717 | WLD4 | ACTIVATING | VI
(5 rows)

-- Delete the problem workload domain
platform=# delete from domain where id ='1e52a225-70a0-4de1-9aee-8baeee7385ff';
DELETE 1

-- Re-run the select query to ensure it has now been removed
platform=# select id,creation_time,name,status,type from domain;
                  id                  | creation_time | name |   status   |    type
--------------------------------------+---------------+------+------------+------------
 ae4e39d1-641b-49c4-b9fa-d2d05c91b3ad | 1619439518503 | MGMT | ACTIVE     | MANAGEMENT
 1982d879-a129-4042-87f6-83117317262e | 1619702192513 | WLD1 | ACTIVE     | VI
 f38c4dbd-9d67-44b8-aa7a-fad5b6e11650 | 1619770686665 | WLD2 | ACTIVE     | VI
 a48c4525-9ceb-4226-a0aa-744b28c02a7e | 1620298875717 | WLD4 | ACTIVATING | VI
(4 rows)

Deregister vCenter from SSO topology prior to removal

Once this is completed, the following command needs running from one of your active vCenter appliances to deregister the problematic vCenter appliance from SSO. SSH to the appliance and run the following:

root@tgdevmvc01 [ ~ ]# /bin/cmsso-util unregister --node-pnid tgdevwvc03.tg.local --username administrator@vsphere.local --passwd VxR@il123!
Solution users, computer account and service endpoints will be unregistered
2021-05-06T17:18:52.833Z  Running command: ['/usr/lib/vmware-vmafd/bin/dir-cli', 'service', 'list', '--login', 'administrator@vsphere.local']
2021-05-06T17:18:52.863Z  Done running command
2021-05-06T17:18:53.150Z  RC = 1
Stopping all the services ...
All services stopped.
Starting all the services ...
Started all the services.
Success

Reboot the SDDC manager and the workload domain will be removed, allowing you to delete the vCenter appliance and then begin the process of re-adding a new workload domain from vCenter.

SSO Ring Topology Validation Fault

I’m not going to cover creating a new WI in SDDC manager as its a fairly basic wizard to follow – the output of which it will begin a validation process prior to deploying a new vCenter for the WI. This validation phase is where I hit problems…

SDDC Manager – VI Build Validation – SSO Ring Topology Error

This caused me some headaches however I then came across this VMware KB article which pointed me in the right direction..

https://kb.vmware.com/s/article/2127057

To try and determine the problem with SSO I ran the following command on each vCenter server to identify the SSO replication partners that each vCenter server was configured with.

Management Domain vCenter: MVC01

root@tgdevcmvc01 [ /usr/lib/vmware-vmdir/bin ]# ./vdcrepadmin -f showpartners -h tgdevmvc01.tg.local -u administrator
password:
ldap://tgdevwvc04.tg.local

Workload vCenter 1: WVC01

root@tgdevwvc01 [ /usr/lib/vmware-vmdir/bin ]# ./vdcrepadmin -f showpartners -h tgdevwvc01.tg.local -u administrator
password:
ldap://tgdevwvc02.tg.local
ldap://tgdevwvc04.tg.local

Workload vCenter 2: WVC02

root@ndhmgtrcwvc02 [ /usr/lib/vmware-vmdir/bin ]# ./vdcrepadmin -f showpartners -h tgdevwvc02.tg.local -u administrator
password:
ldap://tgdevwvc01.tg.local

Workload vCenter 4: WVC04

root@ndhmgtrcwvc04 [ /usr/lib/vmware-vmdir/bin ]# ./vdcrepadmin -f showpartners -h tgdevmvc01.tg.local -u administrator
password:
ldap://tgdevmvc01.tg.local
ldap://tgdevwvc01.tg.local

Taking the above outputs into account, I drew out the results into the simple diagram below. The partner counts of 1 for MVC01 / WVC02 aligns with the error message from the SDDC manager whereby all the PSCs dont have matching partner counts. My only conclusion can be that when I deregistered the problematic vCenter (WVC03), it ripped out the partner agreements with MVC01 / WVC02 leaving it in this unequal state.

SSO Topology Partner configuration

At this point I decided to manually create a new replication agreement between the two problem nodes in the hope this would push the validation process onwards.

root@tgdevmvc01 [ /usr/lib/vmware-vmdir/bin ]# ./vdcrepadmin -f createagreement -2 -h tgdevmvc01.tg.local -H tgdevwvc02.tg.local -u administrator
password:
#Check the agreement has been created
@tgdevmvc01 [ /usr/lib/vmware-vmdir/bin ]# ./vdcrepadmin -f showpartners -h tgdevmvc01.tg.local -u administrator        
password:
ldap://tgdevwvc04.tg.local
ldap://tgdevwvc02.tg.local

Re-running the create WI task within SDDC manager now moves through the SSO Topology Check, proving this was the problem after all!

Hopefully this helps someone out.

Leave a Reply