An update to our codebase introducing new performance logs caused slight performance degradation, which caused server overload and restarts during peak hours, which lead to requests failing and site being unavailable for short periods of time.
Between 12:07 and 14:47 UTC, Aug 27th, guides and Knowledge Bases were randomly failing to load with a 500 response code. Reloading the page should have fixed the issue.
After analyzing the issue, we're implementing a set of improvements to our code, infrastructure, and processes: We’ll be using an alternative approach to performance monitoring We’ll improve our performance testing to better mimic production traffic patterns We’ll be gradually rolling out similar changes which may impact performance to minimize impact to a single server at a time rather than the whole platform We’ll be re-doing our capacity planning to consider adding more CPU power to better accommodate increases in traffic and load