.NET 4 Linq Expression Lock Convoy

A few months ago, I was involved in a performance case involving an old CMS6R2 site. They explained that the site seemed to go down for a couple of minutes, to then automatically recover. During the downtime, any browser connecting to the site would be spinning at “Waiting for reply…” and this could potentially go on for much longer than just minutes. Their webops team have been standing by to restart the site night and day, and this was now tearing on them as well as the business.

Conclusion

When I come home babbling about these cases, the missus complains and asks me to “get to the point, already!”. So here it is:

There’s been a relatively recent change in the .NET Framework affecting applications using System.Linq. A lock that was being taken in deep inside Linq internals (  System.Linq.Expressions.Expression.ValidateLambdaArgs) has been removed, removing yet another point of lock contention in your (web?) application.

If your web application is still running .NET Framework 4.0, consider upgrading to take part of new features and improved performance!

Who’s affected?

Anyone using Linq expressions in an older version of the .NET Framework where the lock has not been worked away.

When is the lock taken?

In affected .NET versions, this lock is taken each time a Linq Expression is created in your code. I don’t know for sure, but I assume that the expression just lives for the duration of the method it is created in.

That is, each time you do something like this:

You will take the lock. Any other LambdaExpression will do it too, not just PageTypeBuilder properties. How do you know a lock is taken? Because, when the site hung, you would see that almost all threads were stuck waiting to acquire the lock:

If this property is frequently requested during a single page request, such as in the stack trace above where a bunch of pages from a search result are converted to strings by calling the DisplayPageName property, you will be in trouble.

What happens is that multiple threads will all be fighting over the same lock causing a behavior that goes under the name “lock convoy” or just simply lock contention. In this particular instance, the term lock convoy is more descriptive. The locks that the threads are taking are short-lived, but frequent and many threads are doing the same thing. Since they all keep getting blocked, none of the fighting threads gets to make any progress. They process one page, gets stuck again trying to process the next, passes the control over to another thread, it processes one thread, passes control over to the next. Your process spends more time passing control between threads (context switching), than actually letting the threads do any work.

How do you recognize a lock convoy?

It’s a hunch, really. If you check a memory dump with WinDbg and you use SOS’ !syncblk command, you may see that there’s a free lock that dozens of threads are waiting for. Raymond Chen at Microsoft wrote about it almost a decade ago. He linked to Larry Osterman’s even older post which I recommend reading, especially the section about the boxcars:

Essentially this is the same situation that you get when you have a bunch of boxcars in a trainyard.  The engine at the front of the cars starts to pull.  The first car moves a little bit, then it stops because the slack between its rear hitch and the front hitch of the second car is removed.  And then the second car moves a bit, then IT stops because the slack between its rear hitch and the front hitch of the 3rd card is removed.  And so forth – each boxcar moves a little bit and then stops.  And that’s just what happens to your threads.  You spend all your valuable CPU time executing context switches between the various threads and none of the CPU time is spent actually processing work items.
— Larry Osterman

So how does a free lock with many waiters indicate a lock convoy? A free lock is… well, free. It means we’re currently in the processes of giving the lock to one of the waiting threads. The chances are highest to catch a process in the state of passing a lock around if that’s what it is mostly doing. Passing locks around between multiple threads is one of the symptoms of a lock convoy.

What exactly was changed?

Previously, I guess before .NET 4.5, the lock looked like this:

https://github.com/mscottford/ironruby/blob/master/ndp/fx/src/core/microsoft/scripting/Ast/LambdaExpression.cs#L289

However, a later version of the same method looks like this:

And the TryGetValue implementation, which is where you would suspect a lock to be located, looks like this:

http://referencesource.microsoft.com/#System.Core/Microsoft/Scripting/Ast/LambdaExpression.cs,514

They have removed completely removed that lock. Without even touching the codebase, an upgrade of the .NET Framework fixed this issue!

 

Leave a Reply

Your email address will not be published. Required fields are marked *