Post

Collection Intersect vs. For Each Loop

In .NET framework, Visual Basic on June 12, 2011 by jwavila

Most of us are familar with a For Each loop. If you have a WinForms app with any type of collection (List(Of T), array, ListBox Items, etc) there’s a good chance you have some type of loop, maybe one comparing two collections. Did you know there is a .NET method that has some serious advantages over a loop such as this. If you’d like to learn more,

A funny thing happened to me on the way to this blog…I learned something. Two things actually!
After seeing another post in the Forums about comparing lines in a TextBox, I decided to write an article on one of my favorite, but rarely seen, VB.NET methods: Intersect. The next article will discuss it’s counterpart: Except.

In the thread the OP laid out this situation: there are two multiline TextBoxes, each of which has a list of names from an XML file. The first contained one list, and the second contained a partial list of the first. The question was how to find the names in the first TextBox that were not in the second.

Immediately, the .NET Collection Except method came to mind (which will be covered in the next article). In this article we will review the Intersect method, because that was where I found some startling information. The reason for the discovery was because after posting a solution using Except, I had a thought: “I wonder if there really is any advantage of one over the other?”.

In order to compare Intersect over a For Each loop, which others had posted, I created an app with two ListBoxes, each with randomly generated items. The code to populate the ListBoxes, along with the Class level variables used:

<code>
 Dim rndm As New Random
    Dim sw As Stopwatch

    Private Sub btnAddData_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles btnAddData.Click
        For Each lb In Controls.OfType(Of ListBox)()
            lb.Items.Clear()
        Next

        For i As Integer = 1 To 10000
            lbNames1.Items.Add("A" & rndm.Next(0, 100001))
            lbNames2.Items.Add("A" & rndm.Next(0, 100001))
        Next
        Debug.WriteLine("Item count in base collection: " & lbNames1.Items.Count)
    End Sub
</code>

I then used the Intersect method in one Button Click event and a loop in another Button Click event. I also added a StopWatch to determine the amount of time each took. I wasn’t expecting all that much difference – previously when I’ve used a loop on a collection with a few dozen items, the results came back almost immediately.

<code>
 Private Sub btnIntersect_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles btnIntersect.Click
        lbIntersectItems.Items.Clear()
        sw = New Stopwatch
        sw.Start()
        Dim arrSameLB1toLB2 As IEnumerable(Of String) = lbNames1.Items().OfType(Of String).Intersect(lbNames2.Items().OfType(Of String))
        lbIntersectItems.Items.AddRange(arrSameLB1toLB2.Cast(Of String).ToArray)
        sw.Stop()

        Label1.Text = "Matches: " & lbIntersectItems.Items.Count.ToString & vbCrLf & _
        "Elapsed: " & sw.ElapsedMilliseconds.ToString
        Debug.WriteLine("Intersect:")
        Debug.WriteLine(vbTab & "Matches: " & lbIntersectItems.Items.Count)
        Debug.WriteLine(vbTab & "Elapsed: " & sw.ElapsedMilliseconds)

        sw.Reset()
    End Sub

    Private Sub btnLoop_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles btnLoop.Click
        lbLoopItems.Items.Clear()
        sw = New Stopwatch
        sw.Start()
        For Each s As String In lbNames1.Items
            If lbNames2.Items.Contains(s) Then
                lbLoopItems.Items.Add(s)
            End If
        Next

        sw.Stop()

        Label2.Text = "Matches: " & lbLoopItems.Items.Count.ToString & vbCrLf & _
        "Elapsed: " & sw.ElapsedMilliseconds.ToString
        Debug.WriteLine("Loop:")
        Debug.WriteLine(vbTab & "Matches: " & lbLoopItems.Items.Count)
        Debug.WriteLine(vbTab & "Elapsed: " & sw.ElapsedMilliseconds)
        sw.Reset()
    End Sub
</code>

And with each ListBox containing 100 items, that was pretty much the case. However as I increased the number of items in each ListBox, the time difference grew substantially.

Here is a breakdown of the times (in milliseconds) to run each (note that I ran each 5 times, except the 100,000, which I only ran 3 times):

.                                        Intersect                       Loop

.                          Range(ms)      Avg(ms)        Range(ms)     Avg(ms)

# Items

100            .               1-1                 1                         2-5                 3

1000         .               2-2                2                      28-36             33

10000      .             34-47            40                 2099-2180      2134

100000   .           189-197          193                       ???               ??

As you can see, with only 100 items, both were comparable in average elapsed time. And even at 1000 items, while there is the beginning of a separation, an average of 33 ms for the loop is not a problem. However, once you get to 10,000 items, there is a significant difference: 40 ms for Intersect versus 2134 ms for the loop. Then at 100,000 items, Intersect still came in at a respectable 193 ms (substantially faster than the loop using 10,000 items). There are no times for the loop using 100,000 items because my app became unresponsive while running it, and after 3 tries at it and failing each time, I gave up.

That was my first interesting discovery – the time difference. The second came about because I noticed the number of matches being displayed in the labels were different. For example, in the 10,000 item test, if Intersect displayed 900 matches, the loop displayed 950. At first I was confused, then I thought “maybe Intersect is not only reporting the matches, but the distinct matches!”. To test this, I then added a LINQ query to the items returned from the loop and compared that number  to Intersect. And guess what? They were the same.

<code>
Private Sub btnLoopDist_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles btnLoopDist.Click
        lbLoopItemsDistinct.Items.Clear()
        sw = New Stopwatch
        sw.Start()
        For Each s As String In lbNames1.Items
            If lbNames2.Items.Contains(s) Then
                lbLoopItemsDistinct.Items.Add(s)
            End If
        Next
        Dim distitem = ((From d In lbLoopItemsDistinct.Items).Distinct().ToArray).Count

        sw.Stop()

        Label3.Text = "Matches: " & distitem.ToString & vbCrLf & _
        "Elapsed: " & sw.ElapsedMilliseconds.ToString
        Debug.WriteLine("Loop w/ Distinct:")
        Debug.WriteLine(vbTab & "Matches: " & distitem)
        Debug.WriteLine(vbTab & "Elapsed: " & sw.ElapsedMilliseconds)

        sw.Reset()
    End Sub
</code>

So not only is Intersect much faster the greater the number of items being compared, but it throws in a Distinct query also, free of charge.

Advertisement

One Response to “Collection Intersect vs. For Each Loop”

  1. [...] the first part of this article found here I reviewed the Collection Intersect method, explaining why it can be a valuable tool when [...]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.