Degraded Performance and Inability to Access Portal
Incident Report for Nextup.ai
Postmortem

Standard procedure allows for adding new compute resources to the Kubernetes cluster in preparation for removing old resources. The new resources are configured to run side by side without issue so long as there are no major version updates being implemented. For major version updates, an outage is planned for even if it is not anticipated.

The new compute resources that were added leveraged a different method of configuring the servers, using AWS LaunchTemplates, than the previous servers which relied solely on the EKS node groups for provisioning. The new methodology was part of a hotfix to overcome a Linux kernel limitation (https://github.com/awslabs/amazon-eks-ami/issues/1179) that was preventing services from properly starting at random times which in turn caused occasional performance degradation during scale events.

Anecdotal commentary on the Internet indicates that using the two different methods resulted in different network security group rules which prevents the pods running on the compute resources in the different node groups from communicating with each other.

During the migration, the DNS services migrated from node group set “A” to node group set “B”, resulting in all pods running on node group set “A” being unable to query DNS. Only pods running on node group set “B” were able to successfully query DNS. Because all Nextup Slack services were running on node group “A” at the time, DNS queries failed and no outbound communications (Jira, Slack, APM systems, etc.) were possible.

The issue was resolved by relaunching ALL services using the new methodology and removing all old nodes, at which point the security groups and security group rules were in complete alignment and all communications were again possible.

Posted Jun 16, 2023 - 16:20 EDT

Resolved
This incident has been resolved.
Posted Jun 16, 2023 - 16:08 EDT
Monitoring
We've identified the issue and have implemented a fix. We are working to confirm that the fix has taken effect and that all services are functioning normally.
Posted Jun 16, 2023 - 15:51 EDT
Update
We are continuing to investigate this issue.
Posted Jun 16, 2023 - 14:17 EDT
Investigating
We are aware of an issue with slowness in the application and an inability for users to login to the Nextup administrative portal. We are currently investigating the issues and will update this page once we've isolated the cause.
Posted Jun 16, 2023 - 14:17 EDT
This incident affected: HelpDesk+ (Jira Service, Slack Service), Jira Integration+ (Jira Service, Slack Service), and Docs+, German Data Center.