Monday, August 13, 2007

Using Groovy to grep XML

After attending some compelling presentations by Scott Davis at No Fluff Just Stuff, I have been playing with Groovy here and there when I've gotten the chance. At work, we've been working with a some software that's currently producing a pretty massive log file. We tried using Chainsaw to slice and dice it, but it wasn't giving us the functionality that we wanted. So, this was a perfect time to play with some Groovy.

Our input looks something like this:

<root>
<entry level="ERROR">
<message>
An error has occurred while parsing column FOO with value of BAR 23 in row 234
</message>
</entry>
<entry level="ERROR">
<message>
An error has occurred while parsing column FOO with value of BAR 52 in row 234
</message>
</entry>
<entry level="ERROR">
<message>
An error has occurred while parsing column FOO with value of FOOBAR 34 in row 234
</message>
</entry>
<entry level="ERROR">
<message>
An error has occurred while parsing column FOO with value of FOO52 in row 234
</message>
</entry>
</root>


Of course, the file is too massive to read through. For this particular error, we were interested in the unique values of FOO that we weren't handling. Here is the groovy to pop open the XML file and find the unique values


def findUniqueEntries(String inputPath, String search, String extract) {
def uniqueMatches = new HashMap()

//Input all of the XML
def root = new XmlParser().parse(new File(inputPath))

//All the child nodes of the root node will be elements
for (entry in root.children()) {

//Assumes each has exactly one child
text = entry.message[0].text()

//Do a substring search first
if (text.contains(search)) {

//Strip out the failing value with a Regular Expression
def matcher = text =~ extract
uniqueValue = matcher[0][1]

//For values we've seen before, increment the count
if (uniqueMatches[uniqueValue] != null) {
uniqueMatches[uniqueValue] += 1
}

//For a new value, initialize the count
else {
uniqueMatches[uniqueValue] = 1
}
}
}

//Print the values along with their occurance count
for (match in uniqueMatches) {
println match
}

//Print the number of unique matches, and the number of total matches.
def uniqueMatchCount = uniqueMatches.size()
def totalMatchCount = uniqueMatches.values().sum()
println ('\nFound ' + uniqueMatchCount + ' unique matches in '
+ totalMatchCount + ' total matches.\n')
}


There are a couple of interesting things that made this really fun code to write:

1 - Navigating XML with Groovy is easy, and the syntax reads quite well. The code communicates the structure of the XML document as well as could be expected, I think.

2 - Groovy makes it really easy to work with Regular Expressions. No more pattern compiling.

3. The Groovy sum() extension means that we don't need to track the total number of matches, nor do we need to iterate through the HashMap at the end.

All in all, I'm enjoying playing with Groovy at the moment. I've found the barrier to entry to be pretty low. It will be interesting to see how we continue to use Groovy in the future.

Throw Away Code Must Be Thrown Away

From time to time, it’s advantageous to take off my TDD hat a fling a small bit of code. I’ve found this is the quickest way to gain a bit of confidence in working with libraries that I haven’t touched before. This allows me to be sure that I know how to interface with the functionality that I need. After a quick proof-of-concept, I’ll know what parameters and classes are needed to get what I want.

My own personal problem with this has been developing the discipline to remove the code from the system once I’ve figured out what I’m after. Why do I think I need to discipline myself to do this? Simply put, my worst code is always the code that wasn’t written “test-first.” The spikes that I pull over into production code often cause problems when trying to get them under test coverage.

Conversely, I’ve found that following Test Driven Development yields code that is easier to test and frankly, better designed. TDD prevents a great deal of speculative development and over design. Therefore, it’s much better to step back from the initial spike and start over by writing tests.

I have found it much easier to prevent the spikes from getting attached to the project by putting them in a completely separate class. Name it “class ThrowAway” to help yourself remember.

Bottom line: Throw-away code must be thrown away.