Castle.Core lock contention issue

A while back, a customer reported that their site’s performance was very “stop and go”. Suddenly, it would just stop delivering web pages. Neither the database, web servers nor any other infrastructure components seemed to be experiencing any excessive stress. If anything, the database was more idle than normal.

I asked them to use my colleague Mattias Lövström’s excellent StackDump tool when the site was not performing. You give it a .NET process Id on the command line, and it outputs a wall of text with all the managed thread’s stack traces. You use it like this:

where 3922 would be the w3wp.exe’s Process Id. The generated file contain all the managed stack traces in the specified application. This gives you a really good idea of what the application is doing – or not doing. For this application, the file that StackDump generated grew to 3MB in size(!). Browsing through the managed threads, they seemed all to be stuck in an identical call stack:

with the first few lines being either a call to  EnterMyLockSpin  or WaitOneNative . The highlighted line, the ObtainProxyType  inside the BaseProxyGenerator  of Castle.Core, is the method that tries to acquire the lock.

The lock that they try to enter is the ReaderWriterLockSlim‘s Upgradeable Read Lock, and it’s taken on the first line in ObtainProxyType (pre 3.3.1):

What’s important to know is that Upgradeable Read Locks are mutually exclusive. However, they do allow concurrent Read Locks. Since there are no Read Locks being taken, in this method or in other places of the code, it means that no two threads may simultaneously enter the ObtainProxyType  method’s core implementation!

I reworked the code so that it minimizes the time of the exclusive locks. This is how it turned out:

 

First takes a Read Lock (line 4) and checks the cache. Failing that, it tries to enter an Upgradeable Read Lock (line 15) and re-checks the cache. This is because the serialization caused by trying to acquire the Upgradeable Read Lock may have caused this thread to wait while another thread just completed the proxy generation for the requested type. When the current thread has re-verified the cache and failed to find it, it will start the proxy generation while still holding the Upgradeable Read Lock. This allows other threads to continue to get cache hits while the missing proxy type is being generated, since the first cache check only needs a Read Lock. Only when the proxy generation is complete and the cache needs to be written to, the Upgradeable Read Lock is upgraded (line 34). This means that any new Read Locks (the ones being taken by other threads passing line 4) will be blocked, and the currently held Read Locks will drain. When there’s no other threads holding a lock, the upgrade succeeds, the cache is written to and the lock is returned.

Who is affected by this?

TL;DR

Any CMS+PageTypeBuilder sites, and any CMS sites (≥ 7 && < 9.6.1), anything with Castle.Core prior to 3.3.1.

Details

Since Nuget’s dependency resolution algorithm works the way it does, it will by default install the oldest version of a dependency that’s within the allowed version span. Up until the 25th of January, 2016 with the release of Episerver CMS 9.6.1, the requirement on Castle.Core was (≥ 3.2 && < 4.0). In practice, it meant that it installed Castle.Core 3.2.0.

Now, when you upgrade or install Episerver CMS 9.6.1, Nuget will install a version of Castle.Core that includes the bug fix. The fix was released in Castle.Core 3.3.1.

Joel Abrahamsson and Lee Crowe’s PageTypeBuilder (nuget page) has a dependency to Castle.Core ≥ 3.0.0.4001, so CMS 6/6R2 users with PageTypeBuilder might experience this one too. Actually, because of the way the untyped PageData instances are replaced in PageTypeBuilder, it’s even more pronounced. This is because every time there’s a request for a PageData from the DataFactory class of Episerver, an event is fired and PageTypeBuilder obtains a proxy type from Castle and returns the newly created strongly typed instance. All strongly typed instances with PageTypeBuilder aren’t cached.

Mitigation

The panic fix is to simply drop the updated Castle.Core DLL into your bin-folder and update the bindingRedirect section in your web.config. You don’t need to compile the application. Note that the assemblyVersion and fileVersion of a file may differ; the assemblyVersion for Castle.Core.dll with fileVersion 3.3.3.58 is 3.3.0.0. If you upgrade with Nuget, this will be done automatically for you. otherwise you may have to add/tweak the bindingRedirects:

Supporting tools

Here‘s a little PowerShell script that helps gather stack traces from misbehaving web applications. You just call it like so:

The appPools  parameter name the application pools to capture stack traces for. If omitted, it captures stack traces for all running worker processes. The howMany  parameter gives how many stack traces to capture, and the interval  is the number of milliseconds between them (default 10000 ms, that is, 10 seconds). Set the filePath  parameter if you want the output files to be written to another directory (than the current). Lastly, the app  parameter tells the script which stackdump.exe to use (defaults to the one in the current working directory, choose the right x86/x64 one depending on your target process(es)).

4 thoughts on “Castle.Core lock contention issue”

  1. Hey Kristoffer

    Great write up and analysis (and fix) for what would be an obscure problem for most to discover!

    Takes me back to our performance work we were doing in Sydney three years ago!

    Matt 🙂

    1. Hey Matt!

      Really happy to hear from you! I’ve been in Sydney many times since you moved up north. Let me tell you, this write up takes me back even more than you 🙂

Leave a Reply

Your email address will not be published. Required fields are marked *