XML input and Hadoop – custom InputFormat
Today I finally hit the task I was scared for so long — processing large XML files on Hadoop. I won’t tell you for how long I crawled the Internet trying to find some working solution… not that anyone wants to know? Eventually, I came out with the solution of my own — even though I hate re-inventing the wheel, in this particular case all the wheels I found were either square or were utterly incompatible with my model of car.
To make things more simple, I won’t include the full source code. I won’t even include the whole InputFormat class. So, to make yourself comfortable, please do following:
- Open
LineRecordReaderfromorg.apache.hadoop.mapreduce.lib.inputso you can see it - Open
TextInputFormatfrom the same package. - Create the input format and record reader of your own, just by copying and pasting the code from aforementioned classes.
- Change the constructor of your input format class so it’ll return your newly-defined record reader.
Now, we’re almost there. Now I’ll include the piece of code for nextKeyValue() which turned out to be the most critical method here. Hold on tight:
public boolean nextKeyValue() throws IOException{ StringBuilder sb = new StringBuilder(); if (key == null) { key = new LongWritable(); } key.set(pos); if (value == null) { value = new Text(); } int newSize = 0; boolean xmlRecordStarted = false; Text tmpLine = new Text(); while (pos < end) { newSize = in.readLine(tmpLine, maxLineLength, Math.max((int) Math.min(Integer.MAX_VALUE, end - pos), maxLineLength)); if (newSize == 0) { break; } if (tmpLine.toString().contains("<document ")) { xmlRecordStarted = true; } if (xmlRecordStarted) { sb.append(tmpLine.toString().replaceAll("n", " ")); } if (tmpLine.toString().contains("</document>")) { xmlRecordStarted = false; this.value.set(sb.toString()); break; } pos += newSize; } if (newSize == 0) { key = null; value = null; return fal se; } else { return true; }}
WTF — you will say? It’s the same code? Well — yes, and no. It’s almost the same. Take a look at this line:
if (tmpLine.toString().contains("<document"))
and this line:
if (tmpLine.toString().contains("</document>"))
This is where we actually split the document into chunks. Code is pretty-much self-explaining so I won’t add anything else.
Now, it’s not the most clean and streamlined solution and I probably will spend a while tomorrow making it more production-ready and good-looking, but compared to other solutions, it has few major benefits:
- It uses very little custom code (you remember, we copied and pasted all the classes?). Unfortunately you cannot just inherit the class — some fields are private, and we clearly want to modify them.
- It’s configurable — you can easily change the
<documentand</document>strings to anything else (and again, I will do it tomorrow, but now I feel too lazy). - It works.
There’re few limitations of this approach. One of them is that if the document contains something like </document><document> it obviously won’t work. Another is — you still need to parse elements in your mapper (although you can easily change it by parsing records in your record reader into Writable-compatible class).
Posted in: Uncategorized
10 Comments
Comments RSS
TrackBack Identifier URI
Leave a comment

