Issues accessing platform
Incident Report for Field Nation
Postmortem

On April 25, 2023, we experienced a DNS outage that caused significant disruptions to our infrastructure. The outage lasted for 47 minutes, from 15:18 CT to 16:05 CT, and was ultimately resolved by identifying and restarting a single bad DNS service instance out of a set of 10.

While our team responded quickly and handled the situation in a calm and careful manner, much of the time was spent investigating areas that were not the root cause of the issue. DNS errors were visible, but there were other secondary issues that caused confusion and misdirection. Once the team was able to confirm that DNS was actually the central issue, it was quick work to identify and resolve the problematic DNS server instance. The specific instance was recently added due to an automated scaling operation. It was passing health checks and believed to be operating correctly, but was actually serving bad DNS responses to other services throughout our infrastructure.

Moving forward, we will continue to research the issue with the DNS instance and identify how we can improve our health checking so it fails when a DNS instance is in a state where it is serving bad responses. With the health check identifying the issue and reporting unhealthy, our system will then auto-correct itself by restarting the bad instance before it creates widespread system issues. Additionally, we plan to identify alerting and monitoring mechanisms to help our response teams to more quickly identify DNS resolution failures as a core problem among other secondary issues.

Posted Apr 25, 2023 - 07:25 CDT

Resolved
We are confident this incident is fully resolved.
Posted Apr 20, 2023 - 16:30 CDT
Monitoring
We observed network instability within our infrastructure and believe we have likely found and resolved the core issue. Platform access looks to have returned. We are continuing to monitor to ensure everything is operating normally.
Posted Apr 20, 2023 - 16:10 CDT
Investigating
We are investigating issues accessing our platform.
Posted Apr 20, 2023 - 15:23 CDT
This incident affected: Web App, Mobile App, and API.