Partial Outage of Pingboard Web Application

Incident Report for Pingboard

Postmortem

Earlier today there was an outage for much of the Pingboard application. We know how disruptive this can be and we apologize for the impact on you and your operations. In addition to the following description of the cause of the incident, we have identified several next steps outlined below which describe our efforts to prevent any similar issues in the future.

What was the impact of this incident?

This was a partial outage of the Pingboard application with a duration of approximately 5 hours from ~2AM Central time to ~7AM Central time. The outage affected most of the pages of the Pingboard application. Unaffected areas included user profile pages, org chart page, unauthenticated web pages, and most API endpoints. Users with already valid sessions also had more access and were able to navigate more of the site.

Why was the site unavailable?

The Pingboard application uses a Redis database to support many of it’s functions. At approximately 2AM we lost connectivity to that database, at first intermittently, then completely. Those aspects of the application which rely on the Redis database started to return errors.

Why did the connectivity to Redis fail?

We have contacted our vendor for additional details and will update this as we find out more, but we do have some insight based on how the incident was resolved. Our Redis instance was an older version which is approaching it’s EOL date. The new version requires a different method of connectivity. Prior versions used stunnel to encrypt traffic between the application and Redis, newer versions support direct secure TLS connections. We had a version upgrade scheduled and planned for the upcoming weekend, prior to the EOL date. It appears there was some change on the vendor side which caused the existing method of connectivity to start failing ahead of the planned changeover, and that triggered the incident.

Why didn’t an alert go off when the incident began?

Pingboard’s traffic varies widely by time of day. At the time the incident began, the ratio of traffic that was failing to that which was succeeding was not great enough to trigger an alert. Alerts are triggered when > 5% of traffic results in an error. The vast majority of the traffic at that time are automated API requests which were succeeding. As more human traffic came on starting around 6AM Central time, the ratio of failing requests exceeded our threshold and alerts fired.

How was the problem addressed?

Once the incident was identified as being related to Redis connectivity, the decision was made to proceed with the version upgrade ahead of schedule and roll out the new method of connectivity. This was successful in resolving the issue.

What follow up actions have been identified?

Add a health check and alert specifically for Redis connectivity.
Lower the threshold for total errors to trigger alerts. The threshold may require some adjustment to try and make sure we don’t end up with too many false alarms, but initial research shows that a level of 3% would have triggered alerts promptly for this incident, so we have taken that action immediately.
An analysis of ways we can limit the impact of any future Redis outages will be performed. Just as all areas of the application were not affected, it may be possible to identify other areas where the impact could be minimized so that the application is more tolerant of this type of failure.

Posted May 17, 2023 - 12:51 CDT

Resolved

There was a partial outage of the Pingboard application with a duration of approximately 5 hours from ~2AM Central time to ~7AM Central time. The outage affected most of the pages of the Pingboard application. Unaffected areas included user profile pages, org chart page, unauthenticated web pages, and most API endpoints. Users with already valid sessions also had more access and were able to successfully navigate much of the site.

Posted May 17, 2023 - 02:00 CDT