Skype Intermittent Queue Issues

Service

Skype for Business

Type

Incident

Description

*** UPDATE: March 24, 12:00pm
It has been over 24 hours since the fiber issues have been resolved and we have observed zero packet loss and no errors on any of our Skype system since. All telemetry data points to no calling issues being reported. At this time we are going to close this alert and if anyone continues to have any issues with any calling scenarios, please do not hesitate to open an RT to have IST investigate any lingering issues if necessary. Thanks again for everyone's patience while many of us worked together to resolve this issue.

*** UPDATE: March 23, 12:45pm
At this point every Skype system in the environment is running clean and there are no known application or network connection failures to the back-end at this time. We will continue to monitor the situation and provide updates shortly. All telemetry data reviewed this morning after the network maintenance points to no known or unexpected call failures. We are in contact with groups who had known issues to validate they are no longer experiencing problems before we will officially close this case.

*** UPDATE: March 23, 10:30am
IST performed the maintenance on the fiber connection from the network switch where the primary SQL server is hosted and this appears to have stabilized the packet loss. Multiple tests indicate this has been resolved. We will monitor the SQL system for any application level failures over the next hour or two and go from there. If this does not resolve the remaining users who are affected we will continue to work with them to fix any remaining specific Skype calling issues. Further remediation attempts on the Skype side were not successful yesterday but now that we have a stable system again it will be easier to pinpoint any lingering issues and resolve them as soon as possible.


*** UPDATE: March 22, 9:00am
On Friday night IST ended performing a pool failover of the MC users to the EC2 pool so we could take the affected SQL server out of service to replace the network card. This work started around 7pm and we had the entire system back in service at 10:30pm. The system is seemingly in a slightly better state but we are still seeing the database timeouts and network loss occur. A priority case will be opened with the Microsoft SQL team to further assist with the next steps. We will also continue to attempt some more remediation steps for affected users (that we will also try to determine specifically who is affected) and communicate that if our efforts are successful.

*** UPDATE: March 19, 5:22pm
IST continues to investigate the packet loss and this activity will continue for some time today. Unless we are able to confirm we can resolve the issue through this testing we may not provide further updates to this case until Monday morning with any weekend activities that take place if required.

*** UPDATE: March 19, 3:15pm
We have swapped the network port on the physical SQL server to another available port and the issue is seeming to still persist. Next troubleshooting steps to be determined but we may need to replace the network card in the SQL server which will require more efforts and will be communicated as needed. All other operations appear to be working as currently stated.

*** UPDATE: March 19, 11:27am
Updating the description to remove the Mac client as it is affecting Windows clients as well. We have found a consistent 1% packet loss occurring on the SQL database server and are attempting to determine the cause of this symptom. IST will be potentially replacing the network cable to the server and possibly the network card in the server itself. Once we begin any of these tasks we will let everyone know in case any of this work will require any downtime to resolve.

*** UPDATE: March 18, 3:43pm
IST has been working diligently trying to determine the cause of the intermittent database connection failures and have not been able to come to a resolution at this time. Upon further testing, our records indicate that the following call types appear to be affected:

- A Windows client user who is homed on the MC Skype pool attempting to contact a Skype queue to be answered by an agent in the EC2 Skype pool is failing.
- A Mac client user who is homed on the MC Skype pool is attempting to contact a Skype queue to be answered by an agent in the MC Skype pool is failing

All other call types in our testing appear to be unaffected.

*** UPDATE: March 18, 11:22am
To ensure we are not adding more users to the MC pool we have revised all our account creation automation scripts to only add new users to the EC2 pool going forward until we resolve the issue with the MC pool. This will possibly put a halt on the EC2 maintenance that was scheduled for next week, updates on that to follow.

*** UPDATE: March 18, 11:00am
Attempts to remediate the issue by moving affected users has uncovered an intermittent database connection issue with the MC Skype pool SQL back-end server. Moving users did not resolve the issue as the process to migrate the user data to the EC2 pool did not happen properly. We are working with Network Services to address the issue ASAP. ALL originally EC2 homed users are still functioning properly with all clients and call types.

*** UPDATE: March 18, 10:00am
After the pool failback was completed last night, the issues appear to be still occurring. During troubleshooting we have determined that this is not specifically a Mac client issue but something with MC Skype pool itself. We are attempting some remediation steps to see if we can move affected enterprise voice users to the EC2 pool which seems to be unaffected. Updates on this to be provided shortly.

*** UPDATE: March 17, 1:46pm
It has been determined that there appears to be some relation to the issues with the scheduled maintenance that we are currently performing. Since the initial maintenance has been completed we will be putting the system back into normal operation tonight to see if this resolves the issues. If we still have issues we will continue to troubleshoot from there.

IST has been made aware of some potential calling scenarios that are not working with the Skype for Business Mac client. We are investigating the situation to determine which scenarios are impacted and will provide updates for remediation once they are determined.

Start time

Wednesday, March 17, 2021 10:00 AM

End time

Wednesday, March 24, 2021 12:00 PM

Notice submitted

Wednesday, March 24, 2021 11:59 AM