Earlier today there was an outage for much of the Pingboard application. We know how disruptive this can be and we apologize for the impact on you and your operations. In addition to the following description of the cause of the incident, we have identified several next steps outlined below which describe our efforts to prevent any similar issues in the future.
This was a partial outage of the Pingboard application with a duration of approximately 5 hours from ~2AM Central time to ~7AM Central time. The outage affected most of the pages of the Pingboard application. Unaffected areas included user profile pages, org chart page, unauthenticated web pages, and most API endpoints. Users with already valid sessions also had more access and were able to navigate more of the site.
The Pingboard application uses a Redis database to support many of it’s functions. At approximately 2AM we lost connectivity to that database, at first intermittently, then completely. Those aspects of the application which rely on the Redis database started to return errors.
We have contacted our vendor for additional details and will update this as we find out more, but we do have some insight based on how the incident was resolved. Our Redis instance was an older version which is approaching it’s EOL date. The new version requires a different method of connectivity. Prior versions used stunnel to encrypt traffic between the application and Redis, newer versions support direct secure TLS connections. We had a version upgrade scheduled and planned for the upcoming weekend, prior to the EOL date. It appears there was some change on the vendor side which caused the existing method of connectivity to start failing ahead of the planned changeover, and that triggered the incident.
Pingboard’s traffic varies widely by time of day. At the time the incident began, the ratio of traffic that was failing to that which was succeeding was not great enough to trigger an alert. Alerts are triggered when > 5% of traffic results in an error. The vast majority of the traffic at that time are automated API requests which were succeeding. As more human traffic came on starting around 6AM Central time, the ratio of failing requests exceeded our threshold and alerts fired.
Once the incident was identified as being related to Redis connectivity, the decision was made to proceed with the version upgrade ahead of schedule and roll out the new method of connectivity. This was successful in resolving the issue.