Microsoft Monitoring Agent - Clear cache and HybridWorker Group
The Issue
Some time ago I wrote how to clear cache of Microsoft Monitoring Agent on your on-premises servers. A few days back I experienced a weird issue with Azure Automation.
Currently I have around 200 OSes (mainly VMs) connected to one of our workspaces. Some of them partialy
lost connectivity.
They could report back to Azure Automation about patch status, but were unable to receive any jobs - including patch deployment schedules!
Those machines were stuck in Failed to start
status:
I’ve checked network connectivity, run on-premises troubleshooting script, but that didn’t get me anywhere. All tests came green.
So I went to System hybrid worker groups
tab to check if my worker is there:
- go to your Azure Automation account
- Select Hybrid Worker groups under
Process Automation
menu:
This got me the list of all of my systems connected, and most important - the last time they were seen.
When I compared all the Not configured
instances and those worker groups that didnt report in last few hours - they all matched.
We have a lead…
Clean up time
There were also OSes that didn’t report becuase… they were already decommissioned. I could delete the worker group from the portal but… that would work for a single instance. I had more of them and I wanted to automate it for my decommission process as well.
If you’d like to remove from the GUI, then click on the server and then click delete:
AzureRM.Automation
There are functions in AzureRM.Automation module that I used to query all Hybrid Worker Groups and delete them. Those are Get-AzureRmAutomationHybridWorkerGroup and Remove-AzureRmAutomationHybridWorkerGroup. As you can see at the docs pages Get-
is available in 5.7.0 version and Remove-
in 6.13.0.
I wrote a small script that will:
- Check if required AzureRM module is installed
- If so, will connect to Azure (interactive logon - mainly because I’m using MFA and that cannot be ‘bypassed programaticaly for non-service accounts’)
- Using Out-GridView as interactive menu will:
- Select proper subscription
- Select resource groupName
- Select azure automation account
- Query for all Hybrid Worker Group accounts
- Select only those that didn’t respond in
PastDays
days or more - Remove Hybrid Worker group account
- Use the same list of servers to connect to them using Invoke-Command and clear cache (code from my previous post)
- Will use errorAction preference of
SilentlyContinue
to avoid issues with already deleted machines.
- Will use errorAction preference of
So here it is:
$WorkspaceID = 'xxxxx-da5f-yyyy-bfbf-zzzzzzzzzz'
$WorkspaceKey = 'YourWorkspaceSuperSecretKey'
$PastDays = 3 #How aggressive to clean up
#region Remove Hybrid Worker from Azure
if ((Get-Command Get-AzureRmAutomationHybridWorkerGroup).Version.Major -le 5) {
Write-Host "Please Update AzureRM.Automation module by running : 'Update-Module azurerm.automation -force' as Administrator"
} else {
Connect-AzureRmAccount
Get-AzureRmSubscription | Out-GridView -passthru | Select-AzureRmSubscription
$ResourceGroupName = Get-AzureRmResourceGroup | Out-GridView -PassThru | Select-Object -ExpandProperty ResourceGroupName
$AutomationAccountName = Get-AzureRmAutomationAccount | Out-GridView -PassThru | Select-Object -ExpandProperty AutomationAccountName
$date = (Get-Date).adddays(-$PastDays)
$workers = Get-AzureRmAutomationHybridWorkerGroup -ResourceGroupName $ResourceGroupName -AutomationAccountName $AutomationAccountName
$workers | where-object {$PSItem.RunbookWorker.LastSeenDateTime -le $date} | foreach-object {
Write-Host "Processing {$($PSItem.RunbookWorker.Name)}"
$PSItem | Remove-AzureRmAutomationHybridWorkerGroup
}
}
#endregion
$servers = (($workers | where-object {$PSItem.RunbookWorker.LastSeenDateTime -le $date}).RunbookWorker).Name
#region Reload configuration and report back
Invoke-Command -ComputerName $servers -ErrorAction SilentlyContinue -ScriptBlock {
$WorkspaceID=$USING:WorkspaceID
$WorkspaceKey = $USING:WorkspaceKey
#Create COM Object to manipulate MMA configuration
$AgentCfg = New-Object -ComObject AgentConfigManager.MgmtSvcCfg
# Remove desired OMS Workspace
if($AgentCfg.GetCloudWorkspace($WorkspaceID)) {
$AgentCfg.RemoveCloudWorkspace($WorkspaceID)
}
Stop-Service -ServiceName 'HealthService'
Start-Sleep -Seconds 3
#Remove files
Remove-item -path 'C:\Program Files\Microsoft Monitoring Agent\Agent\Health Service State' -Force -confirm:$false -Recurse
#Remove registry
Get-ChildItem 'HKLM:\software\microsoft\hybridrunbookworker' | Remove-Item -Force -confirm:$false -Recurse -ErrorAction SilentlyContinue
#let it rest a while. It was a hard task! :)
Start-sleep -Seconds 5
Start-Service -ServiceName 'HealthService'
Start-Sleep -Seconds 3
# Add OMS Workspace
$AgentCfg.AddCloudWorkspace($WorkspaceID,$WorkspaceKey)
$AgentCfg.ReloadConfiguration()
Start-Sleep -seconds 3
$AgentStatus = $AgentCfg.GetCloudWorkspaces() | select-object WorkspaceID,ConnectionStatus,ConnectionStatusText
$ServiceStatus = Get-Service 'HealthService' | select-object Status,Name
$HybridWorkerRegister = Get-ChildItem 'HKLM:\software\microsoft\hybridrunbookworker'
[pscustomobject]@{
ComputerName = $env:ComputerName
AgentWorkspaceID = $AgentStatus.WorkspaceID
AgentConnectionStatus = $AgentStatus.ConnectionStatus
AgentConnectionStatusText = $AgentStatus.ConnectionStatusText
ServiceName = $ServiceStatus.Name
ServiceStatus = $ServiceStatus.Status
HybridWorkerName = $HybridWorkerRegister.Name
}
}
#endregion
Summary
Sometimes clearing the cache only on the agent is not enough.
I still don’t know what caused the issue where some of my workers stopped responding. I have my blind guess that it was a network fault at some point. It’s always the network. Or DNS.
Anyway - my Azure Automation account is back in full swing and my schedules are patching and rebooting while I can do something else - like watch Ignite sessions!
Cheers!
Leave a comment