Our Linux platform is configured to update automatically to help prevent problems and ensure our operating systems have the latest security updates and protection. A recent update occurred on Thursday (2 Feb) morning. The Saturday following the update some minor behavioral issues began to surface related to user sessions.
We immediately began to investigate the issues and were able to identify that they arose every time a new server was added to our web cluster. The problem would disappear after the server had been online for about 15 minutes. It was difficult to identify the root cause earlier as we automatically add servers at 4:00 and 6:00 am during non peak hours in preparation for busier traffic throughout the day. This means that most people likely never had or noticed any issues previously because the servers would be fully operational by peak hours. On Saturday however, with warmer weather, we had an increased surge of traffic that triggered our cluster to add two additional servers at around 10 AM. These two servers, for roughly 15 minutes, were causing these session errors.
During the access issue on Saturday our engineering team was working hard to identify and resolve the issues. Not yet knowing the root cause of the issue they spun up 6 new servers and killed the old servers in an attempt to fix the problem. The result ended up making things worse as all the 6 new servers would be problematic for 15 minutes. The engineering team continued to investigate any possible causes and eventually the issue subsided as the servers had become operational for more than 15 minutes.
Working into the night on Saturday the engineering team identified the problem was a bug in the latest Amazon AWS platform update. AWS had disabled the automatic time sync which caused all the new servers to be out of sync with the master clock by about 1 second. After 20 minutes they'd resync the clock and the problem would fix itself. We reported the bug to Amazon AWS and received confirmation from them that it was an actual issue and their development team would begin working out a resolution. We are currently awaiting that resolution.
What has foreUP done: 1. Bug reported to AWS. 2. Rolled back all clusters to version 2.3.0 3. Turned off automatic updates and will check alternate solutions with AWS support once the issue has been rectified by Amazon.