gasilhawk.blogg.se - Maintenance downtime logsheet and graph

The introduction of a new feature flag – which influenced the data transforms applied to this legacy API, as well as several others, in support of ongoing improvements to regional expansion workflows.A legacy API implementation – whose responses varied infrequently, made heavy use of costly data transforms on each request without caching.These recent contributors combined with several other factors to trigger and exacerbate the impact, including: This cascaded as the service in question retried failed requests and, over the course of 20 minutes, the region progressed from a healthy to heavily saturated state. When this change was deployed to our West Europe region, it was subjected to a call pattern unique to a specific internal service which exercised this cache path more heavily than the broad-spectrum workloads we had tested in our internal and canary environments.Īpproximately 24 hours after it was deployed to West Europe, a spike in traffic from this internal service that executed a daily cache refresh was able to induce enough lock contention to start this positive feedback loop across a significant portion of the ARM web API instances in the region. Over this period, it was not exposed to the problematic call pattern, and none of these regions exhibited anomalous performance characteristics. This change was rolled out following our standard safe deployment practices, progressively deployed to increasingly larger regions over the course of four days prior to being deployed to West Europe. This change was intended to reduce the performance impact of using feature flags on the request serving path, and had been previously load tested and validated in our internal testing and canary environments - demonstrating a significant reduction in performance impact in these scenarios. There are several factors which contribute to increasing the feedback on this loop, however the ultimate trigger was the recent introduction of a cache used to reduce the time spent parsing complex feature flag definitions in hot loops. These timeouts caused both internal and external clients to retry requests, further increasing load and contention on these locks, eventually causing our Web API tier to saturate its available CPU capacity. As a result of this, latency for long running asynchronous operations (such as outgoing database and web requests) increased, leading to timeouts.

This was caused by high-volume, short-held lock contention on the request serving path, which triggered a significant increase in spin-waits against these locks, driving up CPU load and preventing threads from picking up asynchronous background work. This incident was the result of a positive feedback loop leading to saturation on the ARM web API tier. Additionally, Azure services that leverage the ARM API as part of their own internal workflows, and customers of these services, may have experienced issues managing Azure resources located in West Europe as a result. This principally affected customers and workloads in geographic proximity to our West Europe region, while customers geographically located elsewhere would not have been impacted – with limited exceptions for VPN users and those on managed corporate networks. This caused up to 50% of customer requests to this region to fail (approximately 3% of global requests at the time). The primary source of impact was limited to ARM API calls being processed in our West Europe region. This impacted users of Azure CLI, Azure PowerShell, the Azure portal, as well as Azure services which depend upon ARM for their internal resource management operations. Between 02:20 UTC and 07:30 UTC on 23 March 2023 you may have experienced issues using Azure Resource Manager (ARM) when performing resource management operations in the West Europe region.