<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>sigizmund.com &#187; xml</title>
	<atom:link href="http://sigizmund.com/tag/xml/feed/" rel="self" type="application/rss+xml" />
	<link>http://sigizmund.com</link>
	<description>Geeky fairytales</description>
	<lastBuildDate>Wed, 29 Feb 2012 10:51:50 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>XML input and Hadoop – custom InputFormat</title>
		<link>http://sigizmund.com/xml-input-and-hadoop-%e2%80%93-custom-inputformat/</link>
		<comments>http://sigizmund.com/xml-input-and-hadoop-%e2%80%93-custom-inputformat/#comments</comments>
		<pubDate>Wed, 04 Nov 2009 17:18:00 +0000</pubDate>
		<dc:creator>sigizmund</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[geek]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[mapreduce]]></category>
		<category><![CDATA[xml]]></category>

		<guid isPermaLink="false">http://sigizmund.wordpress.com/2009/11/04/xml-input-and-hadoop-%e2%80%93-custom-inputformat</guid>
		<description><![CDATA[<p></p> <p style="margin-top:0;">Today I finally hit the task I was scared for so long &#8212; processing large XML files on Hadoop. I won&#8217;t tell you for how long I crawled the Internet trying to find some working solution&#8230; not that anyone wants to know? Eventually, I came out with the solution of my own &#8212; [...]]]></description>
			<content:encoded><![CDATA[<p><span style="font-family:Trebuchet MS, Helvetica, sans-serif;font-size:13px;"></p>
<p style="margin-top:0;">Today I finally hit the task I was scared for so long &mdash; processing large XML files on Hadoop. I won&rsquo;t tell you for how long I crawled the Internet trying to find some working solution&hellip; not that anyone wants to know? Eventually, I came out with the solution of my own &mdash; even though I hate re-inventing the wheel, in this particular case all the wheels I found were either square or were utterly incompatible with my model of car.</p>
<p>To make things more simple, I won&rsquo;t include the full source code. I won&rsquo;t even include the whole InputFormat class. So, to make yourself comfortable, please do following:</p>
<ol>
<li style="margin-top:0;">Open&nbsp;<code style="font-size:12px;font-family:LuxiMono, Bitstream Vera Sans Mono, Monaco, Courier New, monospace;color:#1c360c;margin-top:0;">LineRecordReader</code>&nbsp;from&nbsp;<code style="font-size:12px;font-family:LuxiMono, Bitstream Vera Sans Mono, Monaco, Courier New, monospace;color:#1c360c;">org.apache.hadoop.mapreduce.lib.input</code>&nbsp;so you can see it</li>
<li>Open&nbsp;<code style="font-size:12px;font-family:LuxiMono, Bitstream Vera Sans Mono, Monaco, Courier New, monospace;color:#1c360c;margin-top:0;">TextInputFormat</code>&nbsp;from the same package.</li>
<li>Create the input format and record reader of your own, just by copying and pasting the code from aforementioned classes.</li>
<li>Change the constructor of your input format class so it&rsquo;ll return your newly-defined record reader.</li>
</ol>
<p>Now, we&rsquo;re almost there. Now I&rsquo;ll include the piece of code for&nbsp;<code style="font-size:12px;font-family:LuxiMono, Bitstream Vera Sans Mono, Monaco, Courier New, monospace;color:#1c360c;margin-top:0;">nextKeyValue()</code>&nbsp;which turned out to be the most critical method here. Hold on tight:</p>
<pre style="font-size:12px;font-family:LuxiMono, Bitstream Vera Sans Mono, Monaco, Courier New, monospace;background-color:#f0f0f0;border-color:#cccbba;border-style:solid;border-width:1px;padding:10px 10px 10px 20px;"><span style="color:#800000;font-weight:bold;">public</span> boolean nextKeyValue() <span style="color:#800000;font-weight:bold;">throws</span> IOException<span style="color:#800080;">{</span>    StringBuilder sb <span style="color:#808030;">=</span> <span style="color:#800000;font-weight:bold;">new</span> StringBuilder<span style="color:#808030;">(</span><span style="color:#808030;">)</span><span style="color:#800080;">;</span>    <span style="color:#800000;font-weight:bold;">if</span> <span style="color:#808030;">(</span>key <span style="color:#808030;">=</span><span style="color:#808030;">=</span> <span style="color:#800000;font-weight:bold;">null</span><span style="color:#808030;">)</span>    <span style="color:#800080;">{</span>        key <span style="color:#808030;">=</span> <span style="color:#800000;font-weight:bold;">new</span> LongWritable<span style="color:#808030;">(</span><span style="color:#808030;">)</span><span style="color:#800080;">;</span>    <span style="color:#800080;">}</span>    key<span style="color:#808030;">.</span>set<span style="color:#808030;">(</span>pos<span style="color:#808030;">)</span><span style="color:#800080;">;</span>    <span style="color:#800000;font-weight:bold;">if</span> <span style="color:#808030;">(</span>value <span style="color:#808030;">=</span><span style="color:#808030;">=</span> <span style="color:#800000;font-weight:bold;">null</span><span style="color:#808030;">)</span>    <span style="color:#800080;">{</span>        value <span style="color:#808030;">=</span> <span style="color:#800000;font-weight:bold;">new</span> Text<span style="color:#808030;">(</span><span style="color:#808030;">)</span><span style="color:#800080;">;</span>    <span style="color:#800080;">}</span>    <span style="color:#bb7977;">int</span> newSize <span style="color:#808030;">=</span> <span style="color:#008c00;">0</span><span style="color:#800080;">;</span>
<p />    <span style="color:#bb7977;">boolean</span> xmlRecordStarted <span style="color:#808030;">=</span> <span style="color:#800000;font-weight:bold;">false</span><span style="color:#800080;">;</span>    Text tmpLine <span style="color:#808030;">=</span> <span style="color:#800000;font-weight:bold;">new</span> Text<span style="color:#808030;">(</span><span style="color:#808030;">)</span><span style="color:#800080;">;</span>
<p />    <span style="color:#800000;font-weight:bold;">while</span> <span style="color:#808030;">(</span>pos <span style="color:#808030;">&lt;</span> end<span style="color:#808030;">)</span>    <span style="color:#800080;">{</span>        newSize <span style="color:#808030;">=</span> in<span style="color:#808030;">.</span>readLine<span style="color:#808030;">(</span>tmpLine<span style="color:#808030;">,</span>             maxLineLength<span style="color:#808030;">,</span>             <span style="color:#bb7977;font-weight:bold;">Math</span><span style="color:#808030;">.</span>max<span style="color:#808030;">(</span><span style="color:#808030;">(</span><span style="color:#bb7977;">int</span><span style="color:#808030;">)</span>                 <span style="color:#bb7977;font-weight:bold;">Math</span><span style="color:#808030;">.</span>min<span style="color:#808030;">(</span><span style="color:#bb7977;font-weight:bold;">Integer</span><span style="color:#808030;">.</span>MAX_VALUE<span style="color:#808030;">,</span>                         end <span style="color:#808030;">-</span> pos<span style="color:#808030;">)</span><span style="color:#808030;">,</span>             maxLineLength<span style="color:#808030;">)</span><span style="color:#808030;">)</span><span style="color:#800080;">;</span>
<p />                    <span style="color:#800000;font-weight:bold;">if</span> <span style="color:#808030;">(</span>newSize <span style="color:#808030;">=</span><span style="color:#808030;">=</span> <span style="color:#008c00;">0</span><span style="color:#808030;">)</span>        <span style="color:#800080;">{</span>            <span style="color:#800000;font-weight:bold;">break</span><span style="color:#800080;">;</span>        <span style="color:#800080;">}</span>
<p />        <span style="color:#800000;font-weight:bold;">if</span> <span style="color:#808030;">(</span>tmpLine<span style="color:#808030;">.</span>toString<span style="color:#808030;">(</span><span style="color:#808030;">)</span><span style="color:#808030;">.</span>contains<span style="color:#808030;">(</span><span style="color:#0000e6;">"&lt;document "</span><span style="color:#808030;">)</span><span style="color:#808030;">)</span>        <span style="color:#800080;">{</span>            xmlRecordStarted <span style="color:#808030;">=</span> <span style="color:#800000;font-weight:bold;">true</span><span style="color:#800080;">;</span>        <span style="color:#800080;">}</span>
<p />        <span style="color:#800000;font-weight:bold;">if</span> <span style="color:#808030;">(</span>xmlRecordStarted<span style="color:#808030;">)</span>        <span style="color:#800080;">{</span>            sb<span style="color:#808030;">.</span>append<span style="color:#808030;">(</span>tmpLine<span style="color:#808030;">.</span>toString<span style="color:#808030;">(</span><span style="color:#808030;">)</span><span style="color:#808030;">.</span>replaceAll<span style="color:#808030;">(</span><span style="color:#0000e6;">"</span><span style="color:#0f69ff;">n</span><span style="color:#0000e6;">"</span><span style="color:#808030;">,</span> <span style="color:#0000e6;">" "</span><span style="color:#808030;">)</span><span style="color:#808030;">)</span><span style="color:#800080;">;</span>        <span style="color:#800080;">}</span>
<p />        <span style="color:#800000;font-weight:bold;">if</span> <span style="color:#808030;">(</span>tmpLine<span style="color:#808030;">.</span>toString<span style="color:#808030;">(</span><span style="color:#808030;">)</span><span style="color:#808030;">.</span>contains<span style="color:#808030;">(</span><span style="color:#0000e6;">"&lt;/document&gt;"</span><span style="color:#808030;">)</span><span style="color:#808030;">)</span>        <span style="color:#800080;">{</span>            xmlRecordStarted <span style="color:#808030;">=</span> <span style="color:#800000;font-weight:bold;">false</span><span style="color:#800080;">;</span>            <span style="color:#800000;font-weight:bold;">this</span><span style="color:#808030;">.</span>value<span style="color:#808030;">.</span>set<span style="color:#808030;">(</span>sb<span style="color:#808030;">.</span>toString<span style="color:#808030;">(</span><span style="color:#808030;">)</span><span style="color:#808030;">)</span><span style="color:#800080;">;</span>            <span style="color:#800000;font-weight:bold;">break</span><span style="color:#800080;">;</span>        <span style="color:#800080;">}</span>
<p />        pos <span style="color:#808030;">+</span><span style="color:#808030;">=</span> newSize<span style="color:#800080;">;</span>
<p />    <span style="color:#800080;">}</span>
<p />    <span style="color:#800000;font-weight:bold;">if</span> <span style="color:#808030;">(</span>newSize <span style="color:#808030;">=</span><span style="color:#808030;">=</span> <span style="color:#008c00;">0</span><span style="color:#808030;">)</span>    <span style="color:#800080;">{</span>        key <span style="color:#808030;">=</span> <span style="color:#800000;font-weight:bold;">null</span><span style="color:#800080;">;</span>        value <span style="color:#808030;">=</span> <span style="color:#800000;font-weight:bold;">null</span><span style="color:#800080;">;</span>        <span style="color:#800000;font-weight:bold;">return</span> <span style="color:#800000;font-weight:bold;">fal
se</span><span style="color:#800080;">;</span>    <span style="color:#800080;">}</span>    <span style="color:#800000;font-weight:bold;">else</span>    <span style="color:#800080;">{</span>        <span style="color:#800000;font-weight:bold;">return</span> <span style="color:#800000;font-weight:bold;">true</span><span style="color:#800080;">;</span>    <span style="color:#800080;">}</span><span style="color:#800080;">}</span></pre>
<p>WTF &mdash; you will say? It&rsquo;s the same code? Well &mdash; yes, and no. It&rsquo;s almost the same. Take a look at this line:</p>
<pre style="font-size:12px;font-family:LuxiMono, Bitstream Vera Sans Mono, Monaco, Courier New, monospace;background-color:#f0f0f0;border-color:#cccbba;border-style:solid;border-width:1px;padding:10px 10px 10px 20px;"><code style="font-size:12px;font-family:LuxiMono, Bitstream Vera Sans Mono, Monaco, Courier New, monospace;color:#1c360c;margin-top:0;">if (tmpLine.toString().contains("&lt;document")) </code></pre>
<p>and this line:</p>
<pre style="font-size:12px;font-family:LuxiMono, Bitstream Vera Sans Mono, Monaco, Courier New, monospace;background-color:#f0f0f0;border-color:#cccbba;border-style:solid;border-width:1px;padding:10px 10px 10px 20px;"><code style="font-size:12px;font-family:LuxiMono, Bitstream Vera Sans Mono, Monaco, Courier New, monospace;color:#1c360c;margin-top:0;">if (tmpLine.toString().contains("&lt;/document&gt;")) </code></pre>
<p>This is where we actually split the document into chunks. Code is pretty-much self-explaining so I won&rsquo;t add anything else.</p>
<p>Now, it&rsquo;s not the most clean and streamlined solution and I probably will spend a while tomorrow making it more production-ready and good-looking, but compared to other solutions, it has few major benefits:</p>
<ol>
<li style="margin-top:0;">It uses very little custom code (you remember, we copied and pasted all the classes?). Unfortunately you cannot just inherit the class &mdash; some fields are private, and we clearly want to modify them.</li>
<li>It&rsquo;s configurable &mdash; you can easily change the&nbsp;<code style="font-size:12px;font-family:LuxiMono, Bitstream Vera Sans Mono, Monaco, Courier New, monospace;color:#1c360c;margin-top:0;">&lt;document</code>&nbsp;and&nbsp;<code style="font-size:12px;font-family:LuxiMono, Bitstream Vera Sans Mono, Monaco, Courier New, monospace;color:#1c360c;">&lt;/document&gt;</code>&nbsp;strings to anything else (and again, I will do it tomorrow, but now I feel too lazy).</li>
<li>It works.</li>
</ol>
<p>There&rsquo;re few limitations of this approach. One of them is that if the document contains something like&nbsp;<code style="font-size:12px;font-family:LuxiMono, Bitstream Vera Sans Mono, Monaco, Courier New, monospace;color:#1c360c;margin-top:0;">&lt;/document&gt;&lt;document&gt;</code>&nbsp;it obviously won&rsquo;t work. Another is &mdash; you still need to parse elements in your mapper (although you can easily change it by parsing records in your record reader into&nbsp;<code style="font-size:12px;font-family:LuxiMono, Bitstream Vera Sans Mono, Monaco, Courier New, monospace;color:#1c360c;">Writable</code>-compatible class).</p>
<p />
<div>Have fun!</div>
<p />
<div><strong>Update: </strong>As you can see, I have added a space in &#8220;&lt;document &#8221; string constant &ndash; today I realised that &#8220;&lt;documenttype&#8221; elements has been successfully used for splits, hence producing inconsistent results.</div>
<p></span></p>
]]></content:encoded>
			<wfw:commentRss>http://sigizmund.com/xml-input-and-hadoop-%e2%80%93-custom-inputformat/feed/</wfw:commentRss>
		<slash:comments>11</slash:comments>
		</item>
	</channel>
</rss>

