The root cause of the Zendesk incident on July 23, 2024, was a new feature rollout related to managing team members' permissions. This feature allowed agents in custom roles to manage other team members and their role assignments. The rollout led to a significant increase in requests to the internal permissions service, causing capacity saturation of its database cluster.
The increased traffic resulted in the cluster reaching its maximum network bandwidth capacity, leading to a networking failure between the cluster and the service’s app servers. This failure caused the access issues experienced by customers on Pod 29.
On July 23, 2024, Zendesk experienced a service incident affecting customers on Pod 29. From 10:58 UTC to 14:57 UTC, users faced issues accessing Zendesk products, including the Admin Center, through the Product Tray. Approximately 1% of customer…
Zendesk resolved the service incident on July 23, 2024, by initially increasing the capacity of the permissions service’s database instance. This provided a short-term recovery while the root cause was being identified. Once the root cause was…
To prevent future incidents similar to the one on July 23, 2024, Zendesk planned several remediation items. These include reducing network traffic from permissions checks, which is currently in progress, and scheduling additional monitors and…
During the Zendesk incident on July 23, 2024, customers on Pod 29 experienced several errors. These included the inability to access Zendesk products through the Product Tray and receiving 503 errors when accessing authenticated features within…