I’ve recently been working on a large VCF rollout with Dell which comprised of a management cluster and multiple workload domains. For those unfaimilar with VCF, each management and workload cluster has its own vCenter appliance, all created and managed through SDDC manager. To make things more complicated, the vCenters are configured in a single SSO domain (i.e vsphere.local) and configured with linked mode to save you logging into multiple vCenter interfaces.
Whilst VCF is a neat deployment option, it isn’t perfect and often things go wrong mid deployment of which requires a workload domain to be flattened and started again. The removal process of a problematic workload domain in the ‘Activiating’ state involves killing the tasks and removing the job from postgres in the SDDC manager. I will briefly go through the process of killing the domain before moving on to my SSO topology problem.
Remove vCenter and WLD from Postgres
--Connet to postgres from bash on the SDDC Manager psql --host=localhost -U postgres -d platform -- Run the following queries and identify the problematic vCenter - grab the id. -- In this example, workload vcenter 03 requires removing (wvc03) select id,status,type,vm_hostname from vcenter; id | status | type | vm_hostname --------------------------------------+--------+------------+-------------------------------------------- 0ffe1f7d-6e3d-490f-8c0f-acfc65dc59b6 | ACTIVE | MANAGEMENT | tgdevmvc01.tg.local da93939e-96fc-4971-8214-949229c4936b | ACTIVE | VI | tgdevwvc01.tg.local 5fccdeb7-12cf-4f2f-98d8-eee78795e113 | ACTIVE | VI | tgdevwvc03.tg.local 3bd30ad1-c5f3-4a04-8d7b-6916b2e68546 | ACTIVE | VI | tgdevwvc02.tg.local 7081da2d-3d7f-4f34-893f-6228d7257c43 | ACTIVE | VI | tgdevwvc04.tg.local (5 rows) platform=# select * from vm_and_vm_type_and_domain; id | domain_id | vm_id | vm_type -----+--------------------------------------+--------------------------------------+------------------------- 1 | ae4e39d1-641b-49c4-b9fa-d2d05c91b3ad | f1c28cc9-3f96-4706-bda9-c04cad9cc41e | SDDC_MANAGER_CONTROLLER 2 | ae4e39d1-641b-49c4-b9fa-d2d05c91b3ad | d52ca14a-9c14-4b88-aa28-f81e4a052d56 | PSC 3 | ae4e39d1-641b-49c4-b9fa-d2d05c91b3ad | 0ffe1f7d-6e3d-490f-8c0f-acfc65dc59b6 | VCENTER 6 | ae4e39d1-641b-49c4-b9fa-d2d05c91b3ad | 342be33c-8855-4a5e-bca9-731342428114 | NSXT_CLUSTER 7 | ae4e39d1-641b-49c4-b9fa-d2d05c91b3ad | fa35772a-d8d4-409c-ad58-9c1ebbfe7fc0 | VXMANAGER 12 | ae4e39d1-641b-49c4-b9fa-d2d05c91b3ad | 57872f33-28f4-4c7a-8d35-1fea8b91361b | VRSLCM 93 | 1982d879-a129-4042-87f6-83117317262e | da93939e-96fc-4971-8214-949229c4936b | VCENTER 94 | f38c4dbd-9d67-44b8-aa7a-fad5b6e11650 | 3bd30ad1-c5f3-4a04-8d7b-6916b2e68546 | VCENTER 97 | 1982d879-a129-4042-87f6-83117317262e | e260f603-169d-4f35-9e2e-eb3afdbd6935 | VXMANAGER 118 | 1982d879-a129-4042-87f6-83117317262e | 22e46d2d-13fe-4b0c-9350-a3fa02238962 | NSXT_CLUSTER 123 | 1e52a225-70a0-4de1-9aee-8baeee7385ff | 5fccdeb7-12cf-4f2f-98d8-eee78795e113 | VCENTER 126 | f38c4dbd-9d67-44b8-aa7a-fad5b6e11650 | 1fa39e34-c1c7-41f7-9bc8-0c6c987a9136 | VXMANAGER 147 | f38c4dbd-9d67-44b8-aa7a-fad5b6e11650 | 22e46d2d-13fe-4b0c-9350-a3fa02238962 | NSXT_CLUSTER 148 | a48c4525-9ceb-4226-a0aa-744b28c02a7e | 7081da2d-3d7f-4f34-893f-6228d7257c43 | VCENTER (14 rows) platform=# select * from vcenter_and_psc; vcenter_id | psc_id --------------------------------------+-------------------------------------- 0ffe1f7d-6e3d-490f-8c0f-acfc65dc59b6 | d52ca14a-9c14-4b88-aa28-f81e4a052d56 da93939e-96fc-4971-8214-949229c4936b | d52ca14a-9c14-4b88-aa28-f81e4a052d56 3bd30ad1-c5f3-4a04-8d7b-6916b2e68546 | d52ca14a-9c14-4b88-aa28-f81e4a052d56 5fccdeb7-12cf-4f2f-98d8-eee78795e113 | d52ca14a-9c14-4b88-aa28-f81e4a052d56 7081da2d-3d7f-4f34-893f-6228d7257c43 | d52ca14a-9c14-4b88-aa28-f81e4a052d56 (5 rows) -- Once the vcenter IDs have been found in the previous 3 tables they can be deleted platform=# delete from vcenter where id = '5fccdeb7-12cf-4f2f-98d8-eee78795e113'; DELETE 1 platform=# delete from vm_and_vm_type_and_domain where vm_id = '5fccdeb7-12cf-4f2f-98d8-eee78795e113'; DELETE 1 platform=# delete from vcenter_and_psc where vcenter_id ='5fccdeb7-12cf-4f2f-98d8-eee78795e113'; DELETE 1 -- Now run the following select query to find the workload domain ID - in this example WLD3 requires removal platform=# select id,creation_time,name,status,type from domain; id | creation_time | name | status | type --------------------------------------+---------------+------+------------+------------ ae4e39d1-641b-49c4-b9fa-d2d05c91b3ad | 1619439518503 | MGMT | ACTIVE | MANAGEMENT 1982d879-a129-4042-87f6-83117317262e | 1619702192513 | WLD1 | ACTIVE | VI 1e52a225-70a0-4de1-9aee-8baeee7385ff | 1620222985945 | WLD3 | ACTIVATING | VI f38c4dbd-9d67-44b8-aa7a-fad5b6e11650 | 1619770686665 | WLD2 | ACTIVE | VI a48c4525-9ceb-4226-a0aa-744b28c02a7e | 1620298875717 | WLD4 | ACTIVATING | VI (5 rows) -- Delete the problem workload domain platform=# delete from domain where id ='1e52a225-70a0-4de1-9aee-8baeee7385ff'; DELETE 1 -- Re-run the select query to ensure it has now been removed platform=# select id,creation_time,name,status,type from domain; id | creation_time | name | status | type --------------------------------------+---------------+------+------------+------------ ae4e39d1-641b-49c4-b9fa-d2d05c91b3ad | 1619439518503 | MGMT | ACTIVE | MANAGEMENT 1982d879-a129-4042-87f6-83117317262e | 1619702192513 | WLD1 | ACTIVE | VI f38c4dbd-9d67-44b8-aa7a-fad5b6e11650 | 1619770686665 | WLD2 | ACTIVE | VI a48c4525-9ceb-4226-a0aa-744b28c02a7e | 1620298875717 | WLD4 | ACTIVATING | VI (4 rows)
Deregister vCenter from SSO topology prior to removal
Once this is completed, the following command needs running from one of your active vCenter appliances to deregister the problematic vCenter appliance from SSO. SSH to the appliance and run the following:
root@tgdevmvc01 [ ~ ]# /bin/cmsso-util unregister --node-pnid tgdevwvc03.tg.local --username administrator@vsphere.local --passwd VxR@il123! Solution users, computer account and service endpoints will be unregistered 2021-05-06T17:18:52.833Z Running command: ['/usr/lib/vmware-vmafd/bin/dir-cli', 'service', 'list', '--login', 'administrator@vsphere.local'] 2021-05-06T17:18:52.863Z Done running command 2021-05-06T17:18:53.150Z RC = 1 Stopping all the services ... All services stopped. Starting all the services ... Started all the services. Success
Reboot the SDDC manager and the workload domain will be removed, allowing you to delete the vCenter appliance and then begin the process of re-adding a new workload domain from vCenter.
SSO Ring Topology Validation Fault
I’m not going to cover creating a new WI in SDDC manager as its a fairly basic wizard to follow – the output of which it will begin a validation process prior to deploying a new vCenter for the WI. This validation phase is where I hit problems…
This caused me some headaches however I then came across this VMware KB article which pointed me in the right direction..
https://kb.vmware.com/s/article/2127057
To try and determine the problem with SSO I ran the following command on each vCenter server to identify the SSO replication partners that each vCenter server was configured with.
Management Domain vCenter: MVC01
root@tgdevcmvc01 [ /usr/lib/vmware-vmdir/bin ]# ./vdcrepadmin -f showpartners -h tgdevmvc01.tg.local -u administrator password: ldap://tgdevwvc04.tg.local
Workload vCenter 1: WVC01
root@tgdevwvc01 [ /usr/lib/vmware-vmdir/bin ]# ./vdcrepadmin -f showpartners -h tgdevwvc01.tg.local -u administrator password: ldap://tgdevwvc02.tg.local ldap://tgdevwvc04.tg.local
Workload vCenter 2: WVC02
root@ndhmgtrcwvc02 [ /usr/lib/vmware-vmdir/bin ]# ./vdcrepadmin -f showpartners -h tgdevwvc02.tg.local -u administrator password: ldap://tgdevwvc01.tg.local
Workload vCenter 4: WVC04
root@ndhmgtrcwvc04 [ /usr/lib/vmware-vmdir/bin ]# ./vdcrepadmin -f showpartners -h tgdevmvc01.tg.local -u administrator password: ldap://tgdevmvc01.tg.local ldap://tgdevwvc01.tg.local
Taking the above outputs into account, I drew out the results into the simple diagram below. The partner counts of 1 for MVC01 / WVC02 aligns with the error message from the SDDC manager whereby all the PSCs dont have matching partner counts. My only conclusion can be that when I deregistered the problematic vCenter (WVC03), it ripped out the partner agreements with MVC01 / WVC02 leaving it in this unequal state.
At this point I decided to manually create a new replication agreement between the two problem nodes in the hope this would push the validation process onwards.
root@tgdevmvc01 [ /usr/lib/vmware-vmdir/bin ]# ./vdcrepadmin -f createagreement -2 -h tgdevmvc01.tg.local -H tgdevwvc02.tg.local -u administrator password: #Check the agreement has been created @tgdevmvc01 [ /usr/lib/vmware-vmdir/bin ]# ./vdcrepadmin -f showpartners -h tgdevmvc01.tg.local -u administrator password: ldap://tgdevwvc04.tg.local ldap://tgdevwvc02.tg.local
Re-running the create WI task within SDDC manager now moves through the SSO Topology Check, proving this was the problem after all!
Hopefully this helps someone out.