How much memory are you currently using?

Following on from my previous post about a demo PLINQ application, I had some small discoveries about memory usage and wanted to blog them.

As I was loading large chunks of data into memory and started monitoring my little demo applications RAM footprint, I discovered that even tho the garbage collection system had already cleaned up my now out-of-scope variables, the RAM utilisation as reported to windows did not necessarily drop.

This actually turns out is quite a logical thing to have happen and works this way for good reason, I just hadn’t thought about it. I was simply expecting an action of explicit free-up of memory to instantly translate into free’d memory in Windows.

The process was just holding onto a reserved amount of memory the .NET runtime assuming I would soon need that space again, and it was correct in its assumptions; I did need that memory very soon.

Use More Memory!

Use More Memory!

To prove this: subsequent button clicks to load more data (roughly the same size volume), didn’t make the application’s memory footprint grow even larger. Another interesting thing I noted with the task manager window open all the time, was when I loaded additional applications and my demo app didn’t need the memory, it would free up the few hundred mb it had a hold of but wasn’t currently making use of.

Demo Application Memory Usage

Demo Application Memory Usage

Advertisements

Playing with PLINQ Performance using the StackOverflow Data Dump

Not having made use of PLINQ in an actual product yet, I decided to have a play with how it works, and to try and obtain my own small metrics on it’s performance benefits. PLINQ is part of a larger push from the .NET teams at Microsoft to get concurrent/parallel processing out of the box in your C# and VB.NET code. As for performance analysis there are already some great posts out there, not just from the Parallel Team at MS but also from great breakdowns with nice charts such as this.

Right off the bat, I’d like to stress that adding a .AsParallel() to your code won’t magically speed it up. Knowing this I still had unrealistic expectations when I began creating a demo to specifically show performance improvements. Often enough the level of processing I was performing (even on larger sets of data), did not benefit from being made concurrent across 2 cores. A level of variation in my results, leads me to believe part of the issue is also the ability to obtain enough resources to make effective use of 2+ cores. For example running out of the 4GB ram I have available, interference from other processes on the machine (Firefox, TweetDeck, virus scanner).

In my attempts at re-creating the “Baby Names” demo Scott Hanselman previewed at the 2009 NDC Conference in his great presentation: “Whirlwind Tour of .NET 4“. I first got a hold of the preview code samples back from 2008 for PLINQ that were part of the Parallel Extenstions CTP.

I then went on to from scratch create my own simple PLINQ – Windows Presentation Foundation (WPF) application.

I chose WPF to test a small feature I hadn’t made use of yet only because I happened to stumble upon it on that day; Routed Events see this StackOverflow question.

Once I completed my take on a LINQ processing demo based on 2 minutes of video showing the operation of ‘Baby Names’, I discovered (by accident*) the Visual Studio 2010 and .NET Framework 4 Training Kit – May Preview, which contains the demo code for what I was trying to re-create.

*The accident in which I discovered the Training Kit, was I actually performed a google image search on the term ‘PLINQ’ to see what came up for ideas for a graphic to add to this post. The 11th image (centre screen) was the baby name graph displayed in theWhirlwind Tour of .NET 4 presentation. The post that had the image was from Bruno Terkaly, the post was about the tool kit, great!

VS 2010 Training Kit May Preview

VS 2010 Training Kit May Preview

None the less, my not-as polished demo application, makes use of the StackOverflow creative commons data dump (actually the Sep 09 drop).

Some background: I grabbed the StackOverflow data dump via the LegalTorrents link, then I followed this great post from Brent Ozar, where he supplies code for 5 stored procedures to create a table schema and import the XML data into SQL Server. It was as simple as running them, and then writing 5 exec statements and waiting the ~1 hour to load the data (resulting for me in a 2.5 gig DB).

The way I structured a lengthy processing task that can benefit from parallel processing, is by making use of the Posts data (questions and answers), in particular questions with an accepted answer. I make an attempt through a repetitive simple string comparison process to determine how valid the tags on the question are, by scanning the question text for the tags, and counting frequency. Then timing the processing of sequential operation vs the parallel operation as I pipe varying levels of data into the function.

First I extract the data into memory from SQL Server (using LINQ to SQL Entities). Just a note on the specifics of the SO Data Dump structure; ‘Score’ is a nullable int so just to keep the data set down in volume I select posts that have some score and greater than a selected input (usually 10+ at least 1 person liked it), same with a reasonable amount of views (on average 200+).

private IEnumerable<Post> GetPosts(int score, int views)
{
   var posts = from p in db.Posts
           where (p.Score ?? 0) > score
           && p.ViewCount > views
           select p;

   return posts.ToList();
}

The next step was to create a function that would take some time to process, and of course potentially benefit from being run in parallel. Each post and it’s tags are operated in isolation, so this is clearly prime for separation over multiple cores. Sadly my development laptop only has 2 cores.

private bool IsDescriptive(Post p)
{
   //lengthy boring code
   //pseudocode instead:

   var words = extract_all_unique_words_from_the_post();
     //excluding punctuation
     //and other formatting details (markup).

   var tags = extract_tags_from_post();

   return were_the_tags_used_enough_in_post(words, tags);
}

Note: A more sophisticated algorithm here could help actually determine (and recommend) more appropriate tags based on word frequencies, but that’s beyond what I have time to implement for performance testing purposes. It would need to know to avoid common used words such as ‘the’, ‘code’, ‘error’, ‘problem’, ‘unsure’, etc (you get the point). It would then need to go further and know what words actually make sense to describe the technology (language/environment) the stack overflow question is about.

The Parallel operation is applied to a ‘Where’ filtering of data and this is where the timing and the reporting of the performance is based on. Making use of System.Diagnostics.StopWatch.


//running sequentially:
posts.Where(p => IsDescriptive(p));

// vs making use of parallel processing:
posts.AsParallel().Where(p => IsDescriptive(p));

On average this function making use of .AsParallel(), for varying records quantities from 100k to 300k would result in a 1.75 times speed up over the function operating sequentially on a single core. Which is what I was hoping to see.

All this was performed on a boot from VHD instance of Windows 7 (a great setup guide by Scott Hanselman here) with Visual Studio 2010 Beta 1 and SQL Server 2008, so I do understand there was some performance hit (both running as a VHD and having SQL on the same machine) but on average for effective PLINQ setup functions there was at least a 1.6 times factor speed up.

It’s that simple; it doesn’t do much yet, but there is potential for improved/more-interesting data analysis and performance measuring of it too. I will make time to clean up the demo application and post the solution files in a future post, so stay tuned for that. When I get a chance I’ll also try to investigate more of the data manipulations people are performing via data mining techniques and attempt to re-create them just for more performance tests. When I do I’ll be starting here.

That’s it, I’ll have a follow up post with some more details in particular the types of queries I had that did not benefit from PLINQ once I get a chance to determine how they were flawed or if they simply just run better on a single thread/core.

Update:
The source code for the demo app is available up on GitHub.

Quarterly Technology Briefing, Melbourne Q3 2009

This morning I attended the Thoughtworks Melbourne Quarterly Technology Briefing event at the Westin.

It was an interesting sessions with 2 key speakers. The format was basically 2 by approx 40 minute presentations, each presenter giving their take on a single topic of ‘Agile Governance’ and how adoption of it improved their respective organisations; to delivery improved quality products and achieve their specific needs of process improvement. The speakers were introduced by Lindy Stephens, Professional Services Manager at Thoughtworks, and she gave a 10 minute introduction to some history around Agile.

First up was Nigel Dalton, GM IT, Lonely Planet whose presentation focussed on how agile was the most logical and most valuable shift Lonely Planet was able to undertake a few years ago. Nigel was a terrific presenter with solid and interesting metaphors for the process.

His first slide made reference to a historical event involving the action of a “Burning Boat” with a nice image of a large ship engulfed in flames. This extreme situation was a metaphor and served as an to parallel to committing soldiers (consultants/developers/resources) to a battlefield (client engagement) and burning the transport mechanism to prevent a revolt (no choice, no turning back, keep going forward). This has some scary truths in how an organisation will commit extensively making it next to impossible to withdraw (pause and regroup) even if the logical and wise choice is to withdraw and try again. Of course not always at the highest level (i.e. entire project) but aspects, features, deliverables.

As a side note the burnt boats scenario was a reference to a documented historical event by a Conquistador Hernán Cortés during the Spanish conquest of the Aztec Empire. The concept was summed up well in a discussion with my friend the Urban Mariner:

Conquistador’s were different from regular armies, as that they operated more like Privateers. They were motivated purely by the promise of a share of the spoils, rather than service to the crown. As such, a captain maintained control only by exerting control over the prize. Cortés ‘burning’ his ships motivated his men to keep their eyes on the mission, as that was their only hope of escaping and claiming their share.

Nigel went on to discuss one of the greatest hurdles to overcome, on the path to successful agile adoption; combating ingrained “illusions of control“. His recommendation for this is to attempt to prove to stakeholders that the improved transparency of agile will be of far greater benefit. In that there will be more timely and accurate information, and no forced ‘spin’ on the results in a single monthly status meeting. A downside of this (initially) is the level of criticism invited by having everyone able to see detailed concepts, such as; daily burn-down and velocity metrics. But this should not be a disincentive to proceed with agile adoption. The advice here is be patient and bring those willing to be involved in as deep into the process as you can spare time for.

A key concern for those resisting agile adoption is often the question: Where are the guarantees? In particular the lawyers and bean counters (accountants). Nigel described a (still not completely resolved) situation of a long standing practice of accountants is the desire to classify a project/product as “Finished” and “Final” only to simplify their accounting practices; to begin the accounting process of depreciating the value of that asset. The challenge here is to shift the understanding and help revise the practices of other business processes to better suit what is required under an agile methodology, which is at its heart forever iterative.

The final comment from Nigel was a ‘yes’ to a simple question of using outsourcing in conjunction with an agile delivery model. Proven in their situation for their style of product making use of technology to still perform the necessary aspects of running agile; scrum, storyboards, cards, etc. They were able to successfully work with a team with a 2 hour time delay, potentially larger delays could hinder necessary activities, like scrum sessions.

The second speaker was Josh Melville, Executive Manager, Suncorp. Josh was discussing how a large complex corporate consolidation of Suncorp involving 9 separate divisions obtained through various mergers and acquisitions was undertaken. How adopting a corporate wide agile approach to all aspects of their business allowed them to consolidate and improve key processes. The scenario for moving a large corporate mesh of business units into 1 single method, required buy-in from all levels outlining that the greatest challenges were in-fact people. People in a sense were also a concern for Nigel making a bold but accurate statement as a key issue was “quality of talent”.

The approach Suncorp took was to allow their business units to be self-organising and self-directing in line with a larger business plan. Josh went further to outline how buy in from typically resistive parts of businesses i.e. “Risk & Compliance” departments was essential. To support the claim made by Nigel, once these departments were made aware of the increase visibility it allowed for much greater risk assessment. Through such agile tactics as; investigate, build and test the components with greatest risk first, allowing artefacts of a project that will not succeed or haven’t been thought out well enough to be dropped before they cause damage (waste resources). As a result Suncorp was able to get a more extensive buy-in from a broader base of their business.

Key Takeaways:

  • Agile will work if there is complete buy-in.
  • People need to be made aware of the benefits and allowed to adapt in time to the changes.
  • Old thinking of “final deadline”, “total cost” and “total time” are no longer valid.
  • It is absolutely necessary to have people questioning the flow and hitting an imaginary alarm button to get attention.
  • Stay lean.
  • Illusion of control is dangerous and detrimental.
  • There is immense value in true transparency.
  • Just like the coding practice; fail early, fail fast, and fail cheaply, recover and move forward.

LINQ Basics (Part 2) – LINQ to SQL in .NET 4.0

Continuing my 2 part series on LINQ Basics, here I will attempt to discuss a few improvements in LINQ to SQL as part of the .NET 4.0 release scheduled soon. A key post on the 4.0 changes is by Damien Guard who works on LINQ at Microsoft along with other products. This will allow me to discuss some of the changes but also go into a bit more detail extending the “basics discussion” started in the previous post.

Note: In October 2008 there was a storm of discussion about the future of LINQ to SQL, in response to a focus shift from Microsoft for “Entity Framework”, the key reference post for this is again one by Damien G.

The upcoming 4.0 release doesn’t to cause damage to the capability of LINQ to SQL as people may have been worried about (i.e. disabling/deactivation in light of the discussions of Oct 2008). There are actually a reasonable amount of improvements, but as always there’s a great deal of potential improvements that there was not capacity/desire to implement for a technology set that is not a major focus. But LINQ to SQL is still capable enough to function well to be used for business systems.

First up there are some performance improvements in particular surrounding caching of lookups and query plans in it’s interaction with SQL Server. There are a few but not all desired improvements in the class designer subsystem, including support for some flaws with data-types, precisions, foreign key associations and general UI behaviour flaws.

There is a discussion on Damien’s post about 2 potential breaking changes due to bug fixes.

  1. A multiple foreign key association issue – which doesn’t seem to be a common occurrence (in any business systems I’ve been involved in).
  2. A call to .Skip() with a zero input: .Skip(0) is no longer treated as a special case.

A note on .Skip(), such a call is translated into subquery with the SQL NOT EXISTS clause. This comes in handy with the .Take() function to achieve a paging effect on your data set. There seems to be some discussion on potential performance issues using large data sets, but nothing conclusive came up in my searches, but this post by Magnus Paulsson was an interesting investigation.

As for the bug It does seem logical to have treated it as a special case and possibly help simplify code that might be passing a zero (or equivalent null set), but if it was an invalid case for .Skip() to be applied to with a value greater than zero it will fail forcing you to improve the query or its surrounding logic.

Just for the sake of an example here is .Skip() & .Take(). Both methods have an alternate use pattern too: SkipWhile() and TakeWhile(). The while variants take as input a predicate to perform the subquery instead of a count.

var sched = ( from p in db.Procedures  
             where p.Scheduled == DateTime.Today  
             select new {  
                      p.ProcedureType.ProcedureTypeName,  
                      p.Treatment.Patient.FullName  
                     };
                ).Skip(10).Take(10);

There are also improvements for the use of SQL Metal which is the command line code generation tool. As a side note to make use of SQL Metal to generate a DBML file for a SQL Server Database execute the command as such in a Visual Studio command prompt:

    sqlmetal /server:SomeServer /database:SomeDB /dbml:someDBmeta.dbml

LINQ Basics

As part of preparation work I’m doing for a presentation to the Melbourne’s Patterns & Practices group in October on LINQ and PLINQ. I thought I would cover off the basics of LINQ in .NET 3.5 and 3.5 SP1 in this post. The changes in .NET 4.0 in the next post, and then a discussion about PLINQ in a third post.

O.k. so let’s sum up the basics. The core concept of LINQ is to easily query any kind of data you have access to in a type-safe fashion. Be it SQL Server stored, a collection (i.e. something implementing IEnumerable) or an xml data structure. A further addition to this power is the ability through C# 3.0 to create projections of new anonymous structural types on the fly. See the first code sample below for the projection; in that simple examples it’s the creation of an anonymous type that has 2 attributes.

Continuing on with my “medical theme” used for the WCF posts, here is a simple schema layout of the system, consisting of a patient and a medical treatment/procedure hierarchy. This is given the title of ‘MedicalDataContext’ to be used in our LINQ queries.

Medical System Basic Schema

Medical System Basic Schema

These items have been dragged onto a new ‘LINQ to SQL Classes’ diagram from the Server Explorer > Data Connections view of a database.

Server Explorer Window

Server Explorer Window

To create the ‘LINQ to SQL Classes’ diagram simply add …

Add New Item

Add New Item


a new …
Linq to SQL Classes

Linq to SQL Classes

Back to the logic. We have a Patient who’s undergoing a certain type of treatment, and that treatment has associated procedures. To obtain a collection of procedures for today, and the name of the patient who will be attending we simply build up a new query as such:

var db = new MedicalDataContext();

var sched = from p in db.Procedures
            where p.Scheduled == DateTime.Today
            select new {
                     p.ProcedureType.ProcedureTypeName,
                     p.Treatment.Patient.FullName
                    };

Note: The patient table structure doesn’t have a field called ‘FullName’ I make use of a partial class extending its properties to add a read-only representation, check out this post by Chris Sainty for more info on making use of partial classes with LINQ.

At this point we can now iterate over each item in our ‘sched’ (scheduled procedures) collection.

foreach (var procedure in sched)
{
   //process/display/etc
}

This brings me to another key point ‘Delayed Execution’ or (‘Deferred Execution’) check out Derik Whittaker’s: Linq and Delayed execution blog post for a more detailed walk through.

Basically the query we defined earlier is only a representation of the possible result set. Therefore when you first make a call to operate on the variable representing the query results, that’s when execution will occur.

So it becomes a program flow decision whether to always execute the query live when it’s being processed (i.e. most up-to-date data) or to force a single execution then make use of that data in the current process flow. A forced execution can be easily achieved several ways, the simplest choice is to just create a new list object via ToList() to execute the fetching of the data.

var allProcedures = todaysProcedures.ToList();

So far this has all revolved around accessing data in a SQL Server Database (the setup of a LINQ to SQL class). LINQ’s purpose is to be able to query any form of collection of data.

Now let’s say we wanted to obtain some information through recursion of a class using reflection.

Note: a business case for the use of reflection is often tied very deeply into some limitation, special case, etc. So a more specific example would take us well away from the topic of LINQ. So will keep this example trivial, this could just as easily manipulate a more useful class object to interrogate it.

var staticMethods = 
    from m in typeof(string).GetMethods()
    where !m.IsStatic
    order by m.Name
    group m by m.Name into g
    select new 
      { 
         Method = g.Key, 
         Overloads = g.Group.Count()
      };

Which output element by element will generate this output:

  { Method = Clone, Overloads = 1 }
  { Method = CompareTo, Overloads = 2 }
  { Method = Contains, Overloads = 1 }
  { Method = CopyTo, Overloads = 1 }
  { Method = EndsWith, Overloads = 3 }
  . . .

For an off-topic discussion, check out this article by Vance Morrison post titled Drilling into .NET Runtime microbenchmarks: ‘typeof’ optimizations, as a discussion of the use of ‘typeof’ in my above query.

For a more typical business case example using a LINQ statement to fetch a collection of controls and operate on them (in this case disable them). Here we’re operating on a Control collection.

Panel rootControl;

private void DisableVisibleButtons(Control root)
{
   var controls = from r in root.Controls
                       .OfType()
                   where r.Visible
                   select r;

    foreach(var c in controls)
    {
        if(c is Button) c.Enabled = false;
        DisableVisibleButtons(c);  //recursive call
    }
}

//kick off the recursion:
DisableVisibleButtons(rootControl);

Summary:

  • LINQ to SQL generates classes that map directly to your database.
  • LINQ helps you manipulate strongly typed results (including intellisense support).
  • “LINQ to SQL” is basically “LINQ to SQL Server” as that is the only connection type it supports.
  • There is a large set of extension methods out of the box check out for samples: 101 LINQ Samples.