Monday, August 13, 2007

Using Groovy to grep XML

After attending some compelling presentations by Scott Davis at No Fluff Just Stuff, I have been playing with Groovy here and there when I've gotten the chance. At work, we've been working with a some software that's currently producing a pretty massive log file. We tried using Chainsaw to slice and dice it, but it wasn't giving us the functionality that we wanted. So, this was a perfect time to play with some Groovy.

Our input looks something like this:

<root>
<entry level="ERROR">
<message>
An error has occurred while parsing column FOO with value of BAR 23 in row 234
</message>
</entry>
<entry level="ERROR">
<message>
An error has occurred while parsing column FOO with value of BAR 52 in row 234
</message>
</entry>
<entry level="ERROR">
<message>
An error has occurred while parsing column FOO with value of FOOBAR 34 in row 234
</message>
</entry>
<entry level="ERROR">
<message>
An error has occurred while parsing column FOO with value of FOO52 in row 234
</message>
</entry>
</root>


Of course, the file is too massive to read through. For this particular error, we were interested in the unique values of FOO that we weren't handling. Here is the groovy to pop open the XML file and find the unique values


def findUniqueEntries(String inputPath, String search, String extract) {
def uniqueMatches = new HashMap()

//Input all of the XML
def root = new XmlParser().parse(new File(inputPath))

//All the child nodes of the root node will be elements
for (entry in root.children()) {

//Assumes each has exactly one child
text = entry.message[0].text()

//Do a substring search first
if (text.contains(search)) {

//Strip out the failing value with a Regular Expression
def matcher = text =~ extract
uniqueValue = matcher[0][1]

//For values we've seen before, increment the count
if (uniqueMatches[uniqueValue] != null) {
uniqueMatches[uniqueValue] += 1
}

//For a new value, initialize the count
else {
uniqueMatches[uniqueValue] = 1
}
}
}

//Print the values along with their occurance count
for (match in uniqueMatches) {
println match
}

//Print the number of unique matches, and the number of total matches.
def uniqueMatchCount = uniqueMatches.size()
def totalMatchCount = uniqueMatches.values().sum()
println ('\nFound ' + uniqueMatchCount + ' unique matches in '
+ totalMatchCount + ' total matches.\n')
}


There are a couple of interesting things that made this really fun code to write:

1 - Navigating XML with Groovy is easy, and the syntax reads quite well. The code communicates the structure of the XML document as well as could be expected, I think.

2 - Groovy makes it really easy to work with Regular Expressions. No more pattern compiling.

3. The Groovy sum() extension means that we don't need to track the total number of matches, nor do we need to iterate through the HashMap at the end.

All in all, I'm enjoying playing with Groovy at the moment. I've found the barrier to entry to be pretty low. It will be interesting to see how we continue to use Groovy in the future.

No comments: