Following the SLA badge incident in Pod 13, several remediation steps were taken. These included exploring better ways to organize and pass environment variables to ensure readiness during system restarts, improving the turnaround time for fixing broken SLAs by updating the 'funfiller', reviewing monitoring and alerts, and reinvestigating the method for passing environment variables to ensure their availability during system restarts.
The missing SLA badges on February 9, 2024, were due to a malfunction in one of the Kubernetes pods in Pod 13. This pod experienced an unplanned restart, which disrupted the 'redis' host, a critical component for the Metric Event Service (MES)….
The SLA badge issue for Zendesk Pod 13 was resolved by redeploying the malfunctioning Kubernetes pod. This action restored the missing SLA events, which were then backfilled. However, the backfill process inadvertently removed SLA data on closed…
The SLA badge issue had a significant impact on closed tickets in Zendesk. During the resolution process, the backfill/restoration of data inadvertently removed SLA data on closed tickets, resulting in 'Null' SLA data in Explore. This was an…
For more information about Zendesk system incidents, you can check the system status page. This page provides current system status information and usually includes a summary of post-mortem investigations a few days after an incident has ended. If…