Skip navigation

Once in a while I get deathly bored of writing business code all day long and am in desperate need of some programming puzzle.  One of the sources I use is the webpages for some of the courses I TA’d for during my college years.  The entry level programming courses usually feature algorithmically interesting problems and I like to see how easily (or how not-so-easily) I can implement the solutions.

The most recent challenge I implemented was this – find the “minimum snippet” of a text source that contains all the “search terms.”

The requirement here is better described this way: create a function that returns all substrings of a source string which contain all of a set of possible substrings.

I used C# and LINQ, and it took me about an hour from start to finish – that includes debug time and testing.

At first I figured the problem was trivial – a few loops and some string.Substring functions would do it.  But as I began to think about it a little harder I realized that it isn’t as easy as it sounds if you have more than one search term.

I believe this can probably be done in a single line by constructing the right regular expression, but I decided not to implement it that way because it would defeat my purpose.  I wanted a programming challenge, not to spend an afternoon looking up RegEx syntax.

I started thinking about the problem in terms of numbers and positions.  Given a source string of length n and up to m search terms, you can produce a sequence of indexes into your source string, one index for position where one of the m search terms exist.  At first glance this suggests that at most you’d have n indices, because the very first thing you’d do is ensure that your search terms are not duplicated.

But this is actually not the case.  Imagine a source string of “baa baa black sheep”.  Suppose I search for “b” and “ba”.  I would produce index 0 and index 4 twice, because both “b” and “ba” occur at these positions.

This means that in the worst case, you have n * m total indices, which is a bummer, because m is not necessarily smaller than n (although it would probably be fair to invalidate input where m is larger, because it doesn’t make sense.  You can’t really legitimately search for more substrings than there are letters in the string).

Still, knowing where the search terms occur is a good way to start, because otherwise you can fall into the trap of assuming that the search terms have to be ordered – that is not a requirement of the function.  We must return all substrings that contain all the terms, regardless of which order the terms occur.

However, producing a flat list of positions where any search term occurs is no good either.  We need to produce substrings that contain each term and it is easily foreseeable that a single search term will be repeated many times.  If we produced just a set of positions in the string and didn’t associate them with the search term that exists there, we wouldn’t be able to protect ourselves from duplication.

So I started by essentially looping through each search term and, for each search term, producing a set of indices where that term appears in the source string.   I did this by creating an extension method on string to return an enumeration of all indices of a substring.  I did this using a while loop and a yield return.  In reality, this is a method I had implemented a long time ago that did this, part of my standard C# extension method toolkit.

So I used LINQ to select the set of indices of each m on the source string, which produced me a set of sets of integer.

This algorithm is now simple.  Produce all tuples which contain one member from each set.  In other words, combine them.

The requirement of the project is to produce only the minimum string as fast as possible.  I thought about using a type of greedy algorithm that would go something like this:

Pick an index from the first set, call it a.  Search all other sets for b, which is picked by being closest in position to a (either greater or smaller).  Search the remaining sets for c, which is picked again by having the minimum impact on the resulting length of the string.

The way we pick the “best” item from the remaining sets is by adding it to your existing set of “best” items, then sorting this set, and then subtracting the smallest value from the largest one.  This gives us the distance between the starting position of the first term and the starting position of the last term.  If we don’t know which set our new candidate item is coming from we can’t make this calculation right, because we’re trying to find the minimum string, and that string must include the search term.  If we don’t add the length of the search term to our distance calculation that we can produce bad results.

For example, consider the string “abcdefghijklmnopqrstuvwxyz”, with search terms “a”, “g”, and “efghijk”.  If I use my greedy algorithm without considering the length of the search term, then “abcdefg” would be considered longer than “abcdefghijk” because “efghijk” occurs at position 4 (“e”) and “a” occurs at position 0, and “g” occurs at position 6.  Since 4-0 is less than 6-0, we’d compute our distance wrong.

Once we have produced a candidate minimum length string using the first index in the first set, we now repeat this algorithm for every other index in the first set.  But, at every step, we compare the length we compute for each candidate against the length we’ve already computed.  As long as our length is less than our current best candidate we keep processing.  As soon as our new index causes our candidate to become longer, we abandon that index and don’t process any other sets that include that index.  If we arrive at the last set and find an index that when added to the candidate produces a length shorter than our current best, our new candidate becomes our current best and we repeat until we’ve exhausted all of the starting indexes in the first set.

The resulting best candidate is our result.

In reality, I didn’t implement this algorithm because while it satisfies the CMSC 132 assignment in the best way, it isn’t as useful as a general purpose function, so I wrote a progressive algorithm which returns all possible substrings and simply returns them in order from shrotest to longest.  In big(O) world it’s equally efficient because in the worst case you won’t find the best candidate until you’ve searched every possible combination.  I’m sure there are ways to tweak the above algorithm using some fancy sorts to eliminate a lot of the possible indices to reduce the worst case scenario, probably to something like O(n log n). 

The project challenges students to come up with an O(n) solution “assuming the number of search terms is constant”.  I believe this algorithm satisfies it that requirement but I’m not going to go about proving it.

It was fun to implement and I can see this being pretty challenging for newbie programmers, having worked with them for as long as I have.  I’m glad to see a CS degree still means something.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: