The last couple of days I was hunting an interesting bug (FWC-16). During our build process we’re running automated unit tests with NUnit. This usually works pretty nice – until suddenly one of my co-workers got trouble. The build process just hang while running one particular test fixture. No crash, no errors, it just never finished.
We tracked down the problem and pretty soon discovered that it was related to changing the database provider string of our C++/COM OLE DB implementation. We used to specify SQLOLEDB as provider for the connection string, but with SQL Server 2005 some of our users got crashes apparently because the database connection got somehow lost – often after resuming from standby, but occasionally without any obvious reason. Changing the connection string to use SQLNCLI (SQL server native client) as provider seems to solve this problem. But it introduced the problem in the build.
Investigating further I discovered that it happens in NUnit tests that instantiate the OLE DB COM object which in turn instantiates the SQLNCLI provider. My first hypothesis was that it happened in tests that also accessed the clipboard, but it turned out that it even happens without accessing the clipboard.
I wrote a test project to further dig into this problem. I tried using .NET’s OleDbConnection, and that worked without problems. I tried just instantiating our OLE DB COM object and it hang. I wrote a simple ATL COM object that instantiates and initializes SQLNCLI and it hang. I wrote an application that called the test and the fixture tear down methods, and it worked. Then I stepped through the code in the debugger, and it hang!
It turns out that it has to do with STA/MTA threading model. The application had the [STAThread] on the Main() method and that caused it to hang. When I removed it, it didn’t hang. The same thing turned out to be true for our NUnit tests: when I set it to MTA it worked. Unfortunately that doesn’t work as a solution for our project since our COM objects use the apartment threading model and so the tests have to be run in STA.
Here’s the explanation that I came up with for this behavior: the SQLNCLI uses threading model “both”. It creates a separate thread that posts messages and expects some answers back. If we use MTA the framework creates a “.NET SystemEvents” thread that can handle these messages. But if we use STA then we don’t have that thread and so the main thread needs to handle the messages. If the main thread is busy with something else or waits for something, then we’re stuck.
So I looked up the documentation for STA and MTA and the threading models in COM and got an idea which turned out to solve this problem: if we call CoFreeUnusedLibraries() in the fixture tear down method, then the SQLNCLI dll gets unloaded (actually, it’s sqlnclir.rll, which you can see if you run NUnit from the debugger with unmanaged debugging turned on), and NUnit doesn’t hang.
So we changed one of our test base classes to call CoFreeUnusedLibraries() and run all test assemblies, and all worked – except one. Seems like we’re holding on to something so that SQLNCLI or something is still in use which prevents sqlnclir.rll from unloading. We ended up ignoring three tests that caused NUnit to hang and ask someone else who has more insight to look at those tests.