Skype Exec Explains Massive Failure

first_imgRabbe promised that the company is doing its best to avoid such issues in the future, through bug fixes, problem detection, and infrastructure reviews. “Lessons will be learned and we will use this as an opportunity to identify and introduce areas of improvement to our software, further assess and invest in capacity and stability, and develop better processes for outage recovery and communications to our user base.” So, why did Skype experience a giant outage during its most heavily utilized time of year? Server clusters, Windows bugs, and supernodes. The VoIP provider’s CIO Lars Rabbe explained why the network went down for roughly 24 hour on December 22 and 23 on the company’s blog today.According to Rabbe, several servers became overloaded on the 22nd, setting into motion a series of unfortunate events culminating in the big crash. “As a result of this overload, some Skype clients received delayed responses from the overloaded servers,” said Rabbe. “In a version of the Skype for Windows client (version 5.0.0152), the delayed responses from the overloaded servers were not properly processed, causing Windows clients running the affected version to crash.”Due to that crash, 25 to 30 percent of the system’s supernodes failed. Is a supernode important? I’m glad you asked. Rabbe again,A supernode is important to the P2P network because it takes on additional responsibilities compared to regular nodes, acting like a directory, supporting other Skype clients, helping to establish connections between them and creating local clusters typically of several hundred peer nodes per each supernode.Once a supernode has failed, even when restarted, it takes some time to become available as a resource to the P2P network again. As a result, the P2P network was left with 25-30% fewer supernodes than normal. This caused a disproportionate load on the remaining available supernodes.last_img read more