Playing with PLINQ Performance using the StackOverflow Data Dump

Not having made use of PLINQ in an actual product yet, I decided to have a play with how it works, and to try and obtain my own small metrics on it’s performance benefits. PLINQ is part of a larger push from the .NET teams at Microsoft to get concurrent/parallel processing out of the box in your C# and VB.NET code. As for performance analysis there are already some great posts out there, not just from the Parallel Team at MS but also from great breakdowns with nice charts such as this.

Right off the bat, I’d like to stress that adding a .AsParallel() to your code won’t magically speed it up. Knowing this I still had unrealistic expectations when I began creating a demo to specifically show performance improvements. Often enough the level of processing I was performing (even on larger sets of data), did not benefit from being made concurrent across 2 cores. A level of variation in my results, leads me to believe part of the issue is also the ability to obtain enough resources to make effective use of 2+ cores. For example running out of the 4GB ram I have available, interference from other processes on the machine (Firefox, TweetDeck, virus scanner).

In my attempts at re-creating the “Baby Names” demo Scott Hanselman previewed at the 2009 NDC Conference in his great presentation: “Whirlwind Tour of .NET 4“. I first got a hold of the preview code samples back from 2008 for PLINQ that were part of the Parallel Extenstions CTP.

I then went on to from scratch create my own simple PLINQ – Windows Presentation Foundation (WPF) application.

I chose WPF to test a small feature I hadn’t made use of yet only because I happened to stumble upon it on that day; Routed Events see this StackOverflow question.

Once I completed my take on a LINQ processing demo based on 2 minutes of video showing the operation of ‘Baby Names’, I discovered (by accident*) the Visual Studio 2010 and .NET Framework 4 Training Kit – May Preview, which contains the demo code for what I was trying to re-create.

*The accident in which I discovered the Training Kit, was I actually performed a google image search on the term ‘PLINQ’ to see what came up for ideas for a graphic to add to this post. The 11th image (centre screen) was the baby name graph displayed in theWhirlwind Tour of .NET 4 presentation. The post that had the image was from Bruno Terkaly, the post was about the tool kit, great!

VS 2010 Training Kit May Preview

VS 2010 Training Kit May Preview

None the less, my not-as polished demo application, makes use of the StackOverflow creative commons data dump (actually the Sep 09 drop).

Some background: I grabbed the StackOverflow data dump via the LegalTorrents link, then I followed this great post from Brent Ozar, where he supplies code for 5 stored procedures to create a table schema and import the XML data into SQL Server. It was as simple as running them, and then writing 5 exec statements and waiting the ~1 hour to load the data (resulting for me in a 2.5 gig DB).

The way I structured a lengthy processing task that can benefit from parallel processing, is by making use of the Posts data (questions and answers), in particular questions with an accepted answer. I make an attempt through a repetitive simple string comparison process to determine how valid the tags on the question are, by scanning the question text for the tags, and counting frequency. Then timing the processing of sequential operation vs the parallel operation as I pipe varying levels of data into the function.

First I extract the data into memory from SQL Server (using LINQ to SQL Entities). Just a note on the specifics of the SO Data Dump structure; ‘Score’ is a nullable int so just to keep the data set down in volume I select posts that have some score and greater than a selected input (usually 10+ at least 1 person liked it), same with a reasonable amount of views (on average 200+).

private IEnumerable<Post> GetPosts(int score, int views)
   var posts = from p in db.Posts
           where (p.Score ?? 0) > score
           && p.ViewCount > views
           select p;

   return posts.ToList();

The next step was to create a function that would take some time to process, and of course potentially benefit from being run in parallel. Each post and it’s tags are operated in isolation, so this is clearly prime for separation over multiple cores. Sadly my development laptop only has 2 cores.

private bool IsDescriptive(Post p)
   //lengthy boring code
   //pseudocode instead:

   var words = extract_all_unique_words_from_the_post();
     //excluding punctuation
     //and other formatting details (markup).

   var tags = extract_tags_from_post();

   return were_the_tags_used_enough_in_post(words, tags);

Note: A more sophisticated algorithm here could help actually determine (and recommend) more appropriate tags based on word frequencies, but that’s beyond what I have time to implement for performance testing purposes. It would need to know to avoid common used words such as ‘the’, ‘code’, ‘error’, ‘problem’, ‘unsure’, etc (you get the point). It would then need to go further and know what words actually make sense to describe the technology (language/environment) the stack overflow question is about.

The Parallel operation is applied to a ‘Where’ filtering of data and this is where the timing and the reporting of the performance is based on. Making use of System.Diagnostics.StopWatch.

//running sequentially:
posts.Where(p => IsDescriptive(p));

// vs making use of parallel processing:
posts.AsParallel().Where(p => IsDescriptive(p));

On average this function making use of .AsParallel(), for varying records quantities from 100k to 300k would result in a 1.75 times speed up over the function operating sequentially on a single core. Which is what I was hoping to see.

All this was performed on a boot from VHD instance of Windows 7 (a great setup guide by Scott Hanselman here) with Visual Studio 2010 Beta 1 and SQL Server 2008, so I do understand there was some performance hit (both running as a VHD and having SQL on the same machine) but on average for effective PLINQ setup functions there was at least a 1.6 times factor speed up.

It’s that simple; it doesn’t do much yet, but there is potential for improved/more-interesting data analysis and performance measuring of it too. I will make time to clean up the demo application and post the solution files in a future post, so stay tuned for that. When I get a chance I’ll also try to investigate more of the data manipulations people are performing via data mining techniques and attempt to re-create them just for more performance tests. When I do I’ll be starting here.

That’s it, I’ll have a follow up post with some more details in particular the types of queries I had that did not benefit from PLINQ once I get a chance to determine how they were flawed or if they simply just run better on a single thread/core.

The source code for the demo app is available up on GitHub.


7 thoughts on “Playing with PLINQ Performance using the StackOverflow Data Dump

  1. […] much memory are you currently using? Following on from my previous post about a demo PLINQ application, I had some small discoveries about memory usage and wanted to blog […]

  2. […] I thought I would post a simple follow up to my Sep 2009 entry about PLINQ and the Stack Overflow data-dump where at the time I was using a Core 2 Duo at 2.53 Ghz to run PLINQ vs LINQ speed-up […]

  3. Glad my post on importing was helpful. There’s an even faster way to import the XML files now using Sam’s SoSlow.exe tool. You give it a connection string (including the database name) and it’ll create the tables and import the data. Just FYI – it doesn’t warn you, but it does delete and recreate the import tables every time. It’s dramatically faster too.

  4. […] Playing with PLINQ Performance using the StackOverflow Data Dump […]

  5. […] my initial post where the core of what I was doing was outlined, at the time the popular (and quickly found) option […]

  6. […] My post covering the main objective of PLINQ tests on the StackOverflow dataset […]

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s