Today I finally hit the task I was scared for so long — processing large XML files on Hadoop. I won’t tell you for how long I crawled the Internet trying to find some working solution… not that anyone wants to know? Eventually, I came out with the solution of my own — even though I hate re-inventing the wheel, in this particular case all the wheels I found were either square or were utterly incompatible with my model of car.

To make things more simple, I won’t include the full source code. I won’t even include the whole InputFormat class. So, to make yourself comfortable, please do following:

  1. Open LineRecordReader from org.apache.hadoop.mapreduce.lib.input so you can see it
  2. Open TextInputFormat from the same package.
  3. Create the input format and record reader of your own, just by copying and pasting the code from aforementioned classes.
  4. Change the constructor of your input format class so it’ll return your newly-defined record reader.

Now, we’re almost there. Now I’ll include the piece of code for nextKeyValue() which turned out to be the most critical method here. Hold on tight:

public boolean nextKeyValue() throws IOException{    StringBuilder sb = new StringBuilder();    if (key == null)    {        key = new LongWritable();    }    key.set(pos);    if (value == null)    {        value = new Text();    }    int newSize = 0;

boolean xmlRecordStarted = false; Text tmpLine = new Text();

while (pos < end) { newSize = in.readLine(tmpLine, maxLineLength, Math.max((int) Math.min(Integer.MAX_VALUE, end - pos), maxLineLength));

if (newSize == 0) { break; }

if (tmpLine.toString().contains("<document ")) { xmlRecordStarted = true; }

if (xmlRecordStarted) { sb.append(tmpLine.toString().replaceAll("n", " ")); }

if (tmpLine.toString().contains("</document>")) { xmlRecordStarted = false; this.value.set(sb.toString()); break; }

pos += newSize;

}

if (newSize == 0) { key = null; value = null; return fal se; } else { return true; }}

WTF — you will say? It’s the same code? Well — yes, and no. It’s almost the same. Take a look at this line:

if (tmpLine.toString().contains("<document")) 

and this line:

if (tmpLine.toString().contains("</document>")) 

This is where we actually split the document into chunks. Code is pretty-much self-explaining so I won’t add anything else.

Now, it’s not the most clean and streamlined solution and I probably will spend a while tomorrow making it more production-ready and good-looking, but compared to other solutions, it has few major benefits:

  1. It uses very little custom code (you remember, we copied and pasted all the classes?). Unfortunately you cannot just inherit the class — some fields are private, and we clearly want to modify them.
  2. It’s configurable — you can easily change the <document and </document> strings to anything else (and again, I will do it tomorrow, but now I feel too lazy).
  3. It works.

There’re few limitations of this approach. One of them is that if the document contains something like </document><document> it obviously won’t work. Another is — you still need to parse elements in your mapper (although you can easily change it by parsing records in your record reader into Writable-compatible class).

Have fun!

Update: As you can see, I have added a space in “<document ” string constant – today I realised that “<documenttype” elements has been successfully used for splits, hence producing inconsistent results.

Tagged with:
 

11 Responses to XML input and Hadoop – custom InputFormat

  1. Holger Dürer says:

    I am not sure I have quite understood yet what you are trying to achieve. You have one big XML file and are trying to create records where each record contains one <document> element?

  2. Roman Kirillov says:

    Pretty much so. For most of standard Hadoop jobs you have a flat text file where every record is a line. In this example we deal with an XML file where every record is represented by one <document> element and all it’s content.

  3. Holger Dürer says:

    Also: what are pos and end? The approximate byte range from which to read the records, right? A record will usually not fall on these boundaries. That's why you skip lines at the beginning, right? But doesn't that mean you should finish reading a record past 'end'? Otherwise your last record might terminate prematurely….

  4. Roman Kirillov says:

    Pretty much so. For most of standard Hadoop jobs you have a flat text file where every record is a line. In this example we deal with an XML file where every record is represented by one <document> element and all it’s content. 

  5. Roman Kirillov says:

    pos and end are members from LineRecordReader. Don’t have the source handy, but I they declare end = Long.MAX_VALUE; (or expect user to specify the end-offset but this isn’t the case).

  6. Holger Dürer says:

    I am still convinced you need to change the “while (pos < end)” to keep looking past the split’s end if you are in the middle of parsing a document — otherwise your last document in the split will be truncated. I.e. something like “while (pos < end || xmlRecordStarted)

  7. Roman Kirillov says:

    I think I’ll give it a try – for now I’m not that really concerned about it, but in future (i.e. when we’ll deal with you-know-which-product) we’ll probably need to sort it out in a proper way.

  8. Holger Dürer says:

    For TPWMNBN you also need to sort out the newline thing. You cannot just replace newlines with spaces — that can be changing the meaning of the contents.

  9. Roman Kirillov says:

    Actually I fixed it – there’s no newline replacing now (mainly because it serves no purpose and we still can handle it with no problems at all). Also, for TPWMNBN we can instead read content straight into objects (by changing the RecordReader to return some Object rather than Text). Once you figure all this stuff out it’s quite simple!

  10. Paul Ingles says:

    Hi,Just found your post. I’ve been trying to do the same thing. In the end, I went with the XmlInputFormat from Mahout’s Bayesian Classifier. Seems to do everything I need it to (and works without going screwy like the streaming one).I posted about it here: http://oobaloo.co.uk/articles/2010/1/20/processing-xml-in-hadoop.html

  11. Laure Drenth says:

    Hello there, Are you going to be publishing a follow up piece? My husband and me have squandered some time browsing over your web page and surprisingly enough you touched on some thing we were discussing only the other week with our accountant. We often notice ourselves quarrelling over the smallest of issues, isn’t it childish? At any rate we wish you greatest wishes from the Usa.

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre lang="" line="" escaped="" highlight="">