SSO Downtime

Incident Report for Acenda

Postmortem

Postmortem Report: System Downtime at Acenda

Incident Overview

Date and Time of Downtime: November 14th, 5:00 AM to 7:00 AM PST (1:00 PM to 3:00 PM GMT)

Services Impacted: Front-end Admin Tool

Services Unaffected: All Backend Services

Root Cause Analysis
The primary cause of the system downtime was the activation of advanced logging in our Single Sign-On (SSO) tool. This led to an excessive accumulation of data in the database, which was not anticipated. The oversight was in not deactivating this enhanced logging feature, resulting in an overburdened system that impacted the performance and availability of our front-end admin tools.

Impact
The system downtime affected only the front-end admin tools. Users attempting to access these tools experienced disruptions and were unable to perform their usual tasks during the outage. Importantly, all backend services continued to operate normally, ensuring that the core functionalities of our services remained intact and operational for end-users.

Resolution and Recovery
The issue was identified and rectified by disabling the advanced logging feature, which immediately alleviated the stress on the database. The system returned to normal operational status post this intervention.

Preventive Measures and Future Steps

Improved Monitoring: We are implementing enhanced monitoring measures for our SSO tool. This will ensure any unusual database storage activity is flagged immediately, preventing similar occurrences in the future.

Logging Policy Review: A thorough review of our logging policies and procedures will be conducted. This is to ensure that any logging, especially advanced logging features, are used judiciously and deactivated post-use.

Staff Training: Additional training will be provided to our IT staff to heighten awareness about the potential impacts of changes in system settings, especially in critical tools like SSO.

Regular System Audits: Regular audits of system settings and resource usage will be instituted to catch any potential issues before they escalate into system-wide problems.

Conclusion
This incident highlights the need for vigilant monitoring and management of system resources, especially when implementing changes that could significantly impact system performance. The corrective steps and preventive measures outlined above are aimed at bolstering our systems against similar incidents and ensuring the reliability and efficiency of our services.

Posted Nov 15, 2023 - 10:33 PST

Resolved

**Incident Overview**
- **Date and Time of Downtime**: November 14th, 5:00 AM to 7:00 AM PST (1:00 PM to 3:00 PM GMT)
- **Services Impacted**: Front-end Admin Tool
- **Services Unaffected**: All Backend Services

Posted Nov 14, 2023 - 05:00 PST