<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>sigizmund.com &#187; hadoop</title>
	<atom:link href="http://sigizmund.com/tag/hadoop/feed/" rel="self" type="application/rss+xml" />
	<link>http://sigizmund.com</link>
	<description>Geeky fairytales</description>
	<lastBuildDate>Tue, 21 Jun 2011 14:53:11 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.2.1</generator>
		<item>
		<title>Hadoop&#8217;s &#8220;DistributedFileSystem vs DistributedCache&#8221; mystery</title>
		<link>http://sigizmund.com/hadoops-distributedfilesystem-vs-distributedcache-mystery/</link>
		<comments>http://sigizmund.com/hadoops-distributedfilesystem-vs-distributedcache-mystery/#comments</comments>
		<pubDate>Tue, 23 Mar 2010 16:52:22 +0000</pubDate>
		<dc:creator>sigizmund</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[geek]]></category>
		<category><![CDATA[hadoop]]></category>

		<guid isPermaLink="false">http://sigizmund.com/hadoops-distributedfilesystem-vs-distributedcache-mystery/</guid>
		<description><![CDATA[1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 try &#123; FileSystem dfs = DistributedFileSystem.get&#40;hadoopJobConfiguration&#41;; final FileStatus&#91;&#93; sts = dfs.listStatus&#40;new Path&#40;this.hdfsDirectory&#41;&#41;; for &#40; FileStatus s : sts &#41; &#123; if &#40; s.getPath&#40;&#41;.toString&#40;&#41;.endsWith&#40;&#34;.jar&#34;&#41; &#41; &#123; log.info&#40;&#34;Jar found: &#34; + s.getPath&#40;&#41;.toString&#40;&#41;&#41;; DistributedCache.addFileToClassPath&#40;new Path&#40;s.getPath&#40;&#41;.toUri&#40;&#41;.getPath&#40;&#41;&#41;, hadoopJobConfiguration&#41;; &#125; &#125; &#125; catch &#40;IOException [...]]]></description>
			<content:encoded><![CDATA[
<div class="wp_syntax"><table><tr><td class="line_numbers"><pre>1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
</pre></td><td class="code"><pre class="java" style="font-family:monospace;"><span style="color: #000000; font-weight: bold;">try</span>
<span style="color: #009900;">&#123;</span>
    FileSystem dfs <span style="color: #339933;">=</span> DistributedFileSystem.<span style="color: #006633;">get</span><span style="color: #009900;">&#40;</span>hadoopJobConfiguration<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
    <span style="color: #000000; font-weight: bold;">final</span> FileStatus<span style="color: #009900;">&#91;</span><span style="color: #009900;">&#93;</span> sts <span style="color: #339933;">=</span> dfs.<span style="color: #006633;">listStatus</span><span style="color: #009900;">&#40;</span><span style="color: #000000; font-weight: bold;">new</span> Path<span style="color: #009900;">&#40;</span><span style="color: #000000; font-weight: bold;">this</span>.<span style="color: #006633;">hdfsDirectory</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
    <span style="color: #000000; font-weight: bold;">for</span> <span style="color: #009900;">&#40;</span> FileStatus s <span style="color: #339933;">:</span> sts <span style="color: #009900;">&#41;</span>
    <span style="color: #009900;">&#123;</span>
        <span style="color: #000000; font-weight: bold;">if</span> <span style="color: #009900;">&#40;</span> s.<span style="color: #006633;">getPath</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span>.<span style="color: #006633;">toString</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span>.<span style="color: #006633;">endsWith</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;.jar&quot;</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#41;</span>
        <span style="color: #009900;">&#123;</span>
            log.<span style="color: #006633;">info</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;Jar found: &quot;</span> <span style="color: #339933;">+</span> s.<span style="color: #006633;">getPath</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span>.<span style="color: #006633;">toString</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
            DistributedCache.<span style="color: #006633;">addFileToClassPath</span><span style="color: #009900;">&#40;</span><span style="color: #000000; font-weight: bold;">new</span> Path<span style="color: #009900;">&#40;</span>s.<span style="color: #006633;">getPath</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span>.<span style="color: #006633;">toUri</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span>.<span style="color: #006633;">getPath</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span>, hadoopJobConfiguration<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
        <span style="color: #009900;">&#125;</span>
    <span style="color: #009900;">&#125;</span>
<span style="color: #009900;">&#125;</span>
<span style="color: #000000; font-weight: bold;">catch</span> <span style="color: #009900;">&#40;</span><span style="color: #003399;">IOException</span> e<span style="color: #009900;">&#41;</span>
<span style="color: #009900;">&#123;</span>
    <span style="color: #000000; font-weight: bold;">throw</span> <span style="color: #000000; font-weight: bold;">new</span> MyException<span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;FileSystem exception while caching JAR files: &quot;</span>, e<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
<span style="color: #009900;">&#125;</span></pre></td></tr></table></div>

<p>Hadoop still manages to surprise me every day. Now, it would certainly make sense if I take <font face="Menlo">Path</font> object from <font face="Monaco">DistributedFileSystem</font> and feed it to <font face="Monaco">DistributedCache&#8217;s</font> <font face="Monaco">addFileToClassPath</font>. It would. But it doesn&#8217;t work.</p>
<p>In fact, full <code>Path</code> in Hadoop looks like <font face="Monaco">http://hadoop-master-host:9000/path/to/the/file</font>. But if you want to use this path with <font face="Monaco">DistributedCache</font> you need to chop off everything but the path itself, which is <font face="Monaco">/path/to/fil</font>e in this example. And of course, there&#8217;s no other way to find out about this but to try (in fact, I only figured it out because I had some hard-coded constants which did work, while nice and clean code didn&#8217;t).</p>
]]></content:encoded>
			<wfw:commentRss>http://sigizmund.com/hadoops-distributedfilesystem-vs-distributedcache-mystery/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Hadoop article</title>
		<link>http://sigizmund.com/hadoop-article/</link>
		<comments>http://sigizmund.com/hadoop-article/#comments</comments>
		<pubDate>Tue, 10 Nov 2009 09:27:00 +0000</pubDate>
		<dc:creator>sigizmund</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[hadoop]]></category>

		<guid isPermaLink="false">http://sigizmund.wordpress.com/2009/11/10/hadoop-article</guid>
		<description><![CDATA[<p>Recently I wrote a Hadoop article in Russian for one of very popular Russian IT blogs. After giving this idea a second thought, I translated this article (or, rather, first part of this article as the second is still in progress) to English and uploaded it to my website (Posterous format isn&#8217;t very good for [...]]]></description>
			<content:encoded><![CDATA[<p>Recently I wrote a Hadoop article in Russian for one of very popular Russian IT blogs. After giving this idea a second thought, I translated this article (or, rather, first part of this article as the second is still in progress) to English and uploaded it to my website (Posterous format isn&#8217;t very good for such long articles).
<p /> Check it here: <a href="http://romankirillov.info/hadoop.html">http://romankirillov.info/hadoop.html</a>
<p /> (and don&#8217;t be mad for my clumsy English!)</p>
<p>&nbsp;</p>
<p>P. S. in case you can read Russian: <a href="http://sigizmund.habrahabr.ru/blog/74792/">http://sigizmund.habrahabr.ru/blog/74792/</a></p>
]]></content:encoded>
			<wfw:commentRss>http://sigizmund.com/hadoop-article/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>More Hadoop Mysteries &#8211; order of initialisation</title>
		<link>http://sigizmund.com/more-hadoop-mysteries-order-of-initialisation/</link>
		<comments>http://sigizmund.com/more-hadoop-mysteries-order-of-initialisation/#comments</comments>
		<pubDate>Fri, 06 Nov 2009 14:46:00 +0000</pubDate>
		<dc:creator>sigizmund</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[geek]]></category>
		<category><![CDATA[hadoop]]></category>

		<guid isPermaLink="false">http://sigizmund.wordpress.com/2009/11/06/more-hadoop-mysteries-order-of-initialisation</guid>
		<description><![CDATA[<p></p> <p style="margin-top:0;">Hey out there! Still not tired of my Hadoop experiments? Not yet? That&#8217;s another one for you!</p> <p>What&#8217;d you think the difference is between two snippets of code? Say, this:</p> SomeCodeWhichChangesConfig.initialise(getConf()); Job job = new Job(conf, "MyHadoopJob"); // ... setting the job details <p />if (!job.waitForCompletion(true)) { System.err.println("FAILED, cannot continue"); } <p>&#8230; and [...]]]></description>
			<content:encoded><![CDATA[<p><span style="font-family:Trebuchet MS, Helvetica, sans-serif;font-size:13px;"></p>
<p style="margin-top:0;">Hey out there! Still not tired of my Hadoop experiments? Not yet? That&rsquo;s another one for you!</p>
<p>What&rsquo;d you think the difference is between two snippets of code? Say, this:</p>
<pre>SomeCodeWhichChangesConfig.initialise(getConf())<span style="margin-top:0;color:#808030;">;</span> Job job = new Job(conf, "MyHadoopJob")<span style="color:#808030;">;</span> <span style="color:#696969;">// ... setting the job details</span>
<p />if (!job.waitForCompletion(true)) <span style="color:#800080;">{</span>   <span style="color:#bb7977;font-weight:bold;">System</span><span style="color:#808030;">.</span>err<span style="color:#808030;">.</span>println<span style="color:#808030;">(</span><span style="color:#0000e6;">"FAILED, cannot continue"</span><span style="color:#808030;">)</span><span style="color:#800080;">;</span> <span style="color:#800080;">}</span> </pre>
<p>&hellip; and this:</p>
<pre>Job job = new Job(conf, "MyHadoopJob")<span style="margin-top:0;color:#808030;">;</span> <span style="color:#696969;">// ... setting the job details</span>
<p />SomeCodeWhichChangesConfig.initialise(getConf())<span style="color:#808030;">;</span> if (!job.waitForCompletion(true)) <span style="color:#800080;">{</span>   <span style="color:#bb7977;font-weight:bold;">System</span><span style="color:#808030;">.</span>err<span style="color:#808030;">.</span>println<span style="color:#808030;">(</span><span style="color:#0000e6;">"FAILED, cannot continue"</span><span style="color:#808030;">)</span><span style="color:#800080;">;</span> <span style="color:#800080;">}</span> </pre>
<p>No difference, you say? Not quite right, sir: the difference is that&nbsp;<span style="margin-top:0;">whatever you do to&nbsp;<code style="font-size:12px;font-family:LuxiMono, Bitstream Vera Sans Mono, Monaco, Courier New, monospace;color:#1c360c;margin-top:0;">conf</code></span><em>&nbsp;after creating a job</em>&nbsp;will have no further effect. That is,&nbsp;<code style="font-size:12px;font-family:LuxiMono, Bitstream Vera Sans Mono, Monaco, Courier New, monospace;color:#1c360c;">Job</code>&nbsp;constructor apparently copies all the data and doesn&rsquo;t link&nbsp;<em>your</em>&nbsp;copy of&nbsp;<code style="font-size:12px;font-family:LuxiMono, Bitstream Vera Sans Mono, Monaco, Courier New, monospace;color:#1c360c;">Configuration</code>&nbsp;object with <em>it&rsquo;s</em>&nbsp;copy. Brilliant, no?</p>
<p>(and I spent a couple of hours trying to understand why distributed cache works properly in one app and doesn&rsquo;t work at all in another). So you know now. Be warned.</p>
<p></span></p>
]]></content:encoded>
			<wfw:commentRss>http://sigizmund.com/more-hadoop-mysteries-order-of-initialisation/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>XML input and Hadoop – custom InputFormat</title>
		<link>http://sigizmund.com/xml-input-and-hadoop-%e2%80%93-custom-inputformat/</link>
		<comments>http://sigizmund.com/xml-input-and-hadoop-%e2%80%93-custom-inputformat/#comments</comments>
		<pubDate>Wed, 04 Nov 2009 17:18:00 +0000</pubDate>
		<dc:creator>sigizmund</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[geek]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[mapreduce]]></category>
		<category><![CDATA[xml]]></category>

		<guid isPermaLink="false">http://sigizmund.wordpress.com/2009/11/04/xml-input-and-hadoop-%e2%80%93-custom-inputformat</guid>
		<description><![CDATA[<p></p> <p style="margin-top:0;">Today I finally hit the task I was scared for so long &#8212; processing large XML files on Hadoop. I won&#8217;t tell you for how long I crawled the Internet trying to find some working solution&#8230; not that anyone wants to know? Eventually, I came out with the solution of my own &#8212; [...]]]></description>
			<content:encoded><![CDATA[<p><span style="font-family:Trebuchet MS, Helvetica, sans-serif;font-size:13px;"></p>
<p style="margin-top:0;">Today I finally hit the task I was scared for so long &mdash; processing large XML files on Hadoop. I won&rsquo;t tell you for how long I crawled the Internet trying to find some working solution&hellip; not that anyone wants to know? Eventually, I came out with the solution of my own &mdash; even though I hate re-inventing the wheel, in this particular case all the wheels I found were either square or were utterly incompatible with my model of car.</p>
<p>To make things more simple, I won&rsquo;t include the full source code. I won&rsquo;t even include the whole InputFormat class. So, to make yourself comfortable, please do following:</p>
<ol>
<li style="margin-top:0;">Open&nbsp;<code style="font-size:12px;font-family:LuxiMono, Bitstream Vera Sans Mono, Monaco, Courier New, monospace;color:#1c360c;margin-top:0;">LineRecordReader</code>&nbsp;from&nbsp;<code style="font-size:12px;font-family:LuxiMono, Bitstream Vera Sans Mono, Monaco, Courier New, monospace;color:#1c360c;">org.apache.hadoop.mapreduce.lib.input</code>&nbsp;so you can see it</li>
<li>Open&nbsp;<code style="font-size:12px;font-family:LuxiMono, Bitstream Vera Sans Mono, Monaco, Courier New, monospace;color:#1c360c;margin-top:0;">TextInputFormat</code>&nbsp;from the same package.</li>
<li>Create the input format and record reader of your own, just by copying and pasting the code from aforementioned classes.</li>
<li>Change the constructor of your input format class so it&rsquo;ll return your newly-defined record reader.</li>
</ol>
<p>Now, we&rsquo;re almost there. Now I&rsquo;ll include the piece of code for&nbsp;<code style="font-size:12px;font-family:LuxiMono, Bitstream Vera Sans Mono, Monaco, Courier New, monospace;color:#1c360c;margin-top:0;">nextKeyValue()</code>&nbsp;which turned out to be the most critical method here. Hold on tight:</p>
<pre style="font-size:12px;font-family:LuxiMono, Bitstream Vera Sans Mono, Monaco, Courier New, monospace;background-color:#f0f0f0;border-color:#cccbba;border-style:solid;border-width:1px;padding:10px 10px 10px 20px;"><span style="color:#800000;font-weight:bold;">public</span> boolean nextKeyValue() <span style="color:#800000;font-weight:bold;">throws</span> IOException<span style="color:#800080;">{</span>    StringBuilder sb <span style="color:#808030;">=</span> <span style="color:#800000;font-weight:bold;">new</span> StringBuilder<span style="color:#808030;">(</span><span style="color:#808030;">)</span><span style="color:#800080;">;</span>    <span style="color:#800000;font-weight:bold;">if</span> <span style="color:#808030;">(</span>key <span style="color:#808030;">=</span><span style="color:#808030;">=</span> <span style="color:#800000;font-weight:bold;">null</span><span style="color:#808030;">)</span>    <span style="color:#800080;">{</span>        key <span style="color:#808030;">=</span> <span style="color:#800000;font-weight:bold;">new</span> LongWritable<span style="color:#808030;">(</span><span style="color:#808030;">)</span><span style="color:#800080;">;</span>    <span style="color:#800080;">}</span>    key<span style="color:#808030;">.</span>set<span style="color:#808030;">(</span>pos<span style="color:#808030;">)</span><span style="color:#800080;">;</span>    <span style="color:#800000;font-weight:bold;">if</span> <span style="color:#808030;">(</span>value <span style="color:#808030;">=</span><span style="color:#808030;">=</span> <span style="color:#800000;font-weight:bold;">null</span><span style="color:#808030;">)</span>    <span style="color:#800080;">{</span>        value <span style="color:#808030;">=</span> <span style="color:#800000;font-weight:bold;">new</span> Text<span style="color:#808030;">(</span><span style="color:#808030;">)</span><span style="color:#800080;">;</span>    <span style="color:#800080;">}</span>    <span style="color:#bb7977;">int</span> newSize <span style="color:#808030;">=</span> <span style="color:#008c00;">0</span><span style="color:#800080;">;</span>
<p />    <span style="color:#bb7977;">boolean</span> xmlRecordStarted <span style="color:#808030;">=</span> <span style="color:#800000;font-weight:bold;">false</span><span style="color:#800080;">;</span>    Text tmpLine <span style="color:#808030;">=</span> <span style="color:#800000;font-weight:bold;">new</span> Text<span style="color:#808030;">(</span><span style="color:#808030;">)</span><span style="color:#800080;">;</span>
<p />    <span style="color:#800000;font-weight:bold;">while</span> <span style="color:#808030;">(</span>pos <span style="color:#808030;">&lt;</span> end<span style="color:#808030;">)</span>    <span style="color:#800080;">{</span>        newSize <span style="color:#808030;">=</span> in<span style="color:#808030;">.</span>readLine<span style="color:#808030;">(</span>tmpLine<span style="color:#808030;">,</span>             maxLineLength<span style="color:#808030;">,</span>             <span style="color:#bb7977;font-weight:bold;">Math</span><span style="color:#808030;">.</span>max<span style="color:#808030;">(</span><span style="color:#808030;">(</span><span style="color:#bb7977;">int</span><span style="color:#808030;">)</span>                 <span style="color:#bb7977;font-weight:bold;">Math</span><span style="color:#808030;">.</span>min<span style="color:#808030;">(</span><span style="color:#bb7977;font-weight:bold;">Integer</span><span style="color:#808030;">.</span>MAX_VALUE<span style="color:#808030;">,</span>                         end <span style="color:#808030;">-</span> pos<span style="color:#808030;">)</span><span style="color:#808030;">,</span>             maxLineLength<span style="color:#808030;">)</span><span style="color:#808030;">)</span><span style="color:#800080;">;</span>
<p />                    <span style="color:#800000;font-weight:bold;">if</span> <span style="color:#808030;">(</span>newSize <span style="color:#808030;">=</span><span style="color:#808030;">=</span> <span style="color:#008c00;">0</span><span style="color:#808030;">)</span>        <span style="color:#800080;">{</span>            <span style="color:#800000;font-weight:bold;">break</span><span style="color:#800080;">;</span>        <span style="color:#800080;">}</span>
<p />        <span style="color:#800000;font-weight:bold;">if</span> <span style="color:#808030;">(</span>tmpLine<span style="color:#808030;">.</span>toString<span style="color:#808030;">(</span><span style="color:#808030;">)</span><span style="color:#808030;">.</span>contains<span style="color:#808030;">(</span><span style="color:#0000e6;">"&lt;document "</span><span style="color:#808030;">)</span><span style="color:#808030;">)</span>        <span style="color:#800080;">{</span>            xmlRecordStarted <span style="color:#808030;">=</span> <span style="color:#800000;font-weight:bold;">true</span><span style="color:#800080;">;</span>        <span style="color:#800080;">}</span>
<p />        <span style="color:#800000;font-weight:bold;">if</span> <span style="color:#808030;">(</span>xmlRecordStarted<span style="color:#808030;">)</span>        <span style="color:#800080;">{</span>            sb<span style="color:#808030;">.</span>append<span style="color:#808030;">(</span>tmpLine<span style="color:#808030;">.</span>toString<span style="color:#808030;">(</span><span style="color:#808030;">)</span><span style="color:#808030;">.</span>replaceAll<span style="color:#808030;">(</span><span style="color:#0000e6;">"</span><span style="color:#0f69ff;">n</span><span style="color:#0000e6;">"</span><span style="color:#808030;">,</span> <span style="color:#0000e6;">" "</span><span style="color:#808030;">)</span><span style="color:#808030;">)</span><span style="color:#800080;">;</span>        <span style="color:#800080;">}</span>
<p />        <span style="color:#800000;font-weight:bold;">if</span> <span style="color:#808030;">(</span>tmpLine<span style="color:#808030;">.</span>toString<span style="color:#808030;">(</span><span style="color:#808030;">)</span><span style="color:#808030;">.</span>contains<span style="color:#808030;">(</span><span style="color:#0000e6;">"&lt;/document&gt;"</span><span style="color:#808030;">)</span><span style="color:#808030;">)</span>        <span style="color:#800080;">{</span>            xmlRecordStarted <span style="color:#808030;">=</span> <span style="color:#800000;font-weight:bold;">false</span><span style="color:#800080;">;</span>            <span style="color:#800000;font-weight:bold;">this</span><span style="color:#808030;">.</span>value<span style="color:#808030;">.</span>set<span style="color:#808030;">(</span>sb<span style="color:#808030;">.</span>toString<span style="color:#808030;">(</span><span style="color:#808030;">)</span><span style="color:#808030;">)</span><span style="color:#800080;">;</span>            <span style="color:#800000;font-weight:bold;">break</span><span style="color:#800080;">;</span>        <span style="color:#800080;">}</span>
<p />        pos <span style="color:#808030;">+</span><span style="color:#808030;">=</span> newSize<span style="color:#800080;">;</span>
<p />    <span style="color:#800080;">}</span>
<p />    <span style="color:#800000;font-weight:bold;">if</span> <span style="color:#808030;">(</span>newSize <span style="color:#808030;">=</span><span style="color:#808030;">=</span> <span style="color:#008c00;">0</span><span style="color:#808030;">)</span>    <span style="color:#800080;">{</span>        key <span style="color:#808030;">=</span> <span style="color:#800000;font-weight:bold;">null</span><span style="color:#800080;">;</span>        value <span style="color:#808030;">=</span> <span style="color:#800000;font-weight:bold;">null</span><span style="color:#800080;">;</span>        <span style="color:#800000;font-weight:bold;">return</span> <span style="color:#800000;font-weight:bold;">fal
se</span><span style="color:#800080;">;</span>    <span style="color:#800080;">}</span>    <span style="color:#800000;font-weight:bold;">else</span>    <span style="color:#800080;">{</span>        <span style="color:#800000;font-weight:bold;">return</span> <span style="color:#800000;font-weight:bold;">true</span><span style="color:#800080;">;</span>    <span style="color:#800080;">}</span><span style="color:#800080;">}</span></pre>
<p>WTF &mdash; you will say? It&rsquo;s the same code? Well &mdash; yes, and no. It&rsquo;s almost the same. Take a look at this line:</p>
<pre style="font-size:12px;font-family:LuxiMono, Bitstream Vera Sans Mono, Monaco, Courier New, monospace;background-color:#f0f0f0;border-color:#cccbba;border-style:solid;border-width:1px;padding:10px 10px 10px 20px;"><code style="font-size:12px;font-family:LuxiMono, Bitstream Vera Sans Mono, Monaco, Courier New, monospace;color:#1c360c;margin-top:0;">if (tmpLine.toString().contains("&lt;document")) </code></pre>
<p>and this line:</p>
<pre style="font-size:12px;font-family:LuxiMono, Bitstream Vera Sans Mono, Monaco, Courier New, monospace;background-color:#f0f0f0;border-color:#cccbba;border-style:solid;border-width:1px;padding:10px 10px 10px 20px;"><code style="font-size:12px;font-family:LuxiMono, Bitstream Vera Sans Mono, Monaco, Courier New, monospace;color:#1c360c;margin-top:0;">if (tmpLine.toString().contains("&lt;/document&gt;")) </code></pre>
<p>This is where we actually split the document into chunks. Code is pretty-much self-explaining so I won&rsquo;t add anything else.</p>
<p>Now, it&rsquo;s not the most clean and streamlined solution and I probably will spend a while tomorrow making it more production-ready and good-looking, but compared to other solutions, it has few major benefits:</p>
<ol>
<li style="margin-top:0;">It uses very little custom code (you remember, we copied and pasted all the classes?). Unfortunately you cannot just inherit the class &mdash; some fields are private, and we clearly want to modify them.</li>
<li>It&rsquo;s configurable &mdash; you can easily change the&nbsp;<code style="font-size:12px;font-family:LuxiMono, Bitstream Vera Sans Mono, Monaco, Courier New, monospace;color:#1c360c;margin-top:0;">&lt;document</code>&nbsp;and&nbsp;<code style="font-size:12px;font-family:LuxiMono, Bitstream Vera Sans Mono, Monaco, Courier New, monospace;color:#1c360c;">&lt;/document&gt;</code>&nbsp;strings to anything else (and again, I will do it tomorrow, but now I feel too lazy).</li>
<li>It works.</li>
</ol>
<p>There&rsquo;re few limitations of this approach. One of them is that if the document contains something like&nbsp;<code style="font-size:12px;font-family:LuxiMono, Bitstream Vera Sans Mono, Monaco, Courier New, monospace;color:#1c360c;margin-top:0;">&lt;/document&gt;&lt;document&gt;</code>&nbsp;it obviously won&rsquo;t work. Another is &mdash; you still need to parse elements in your mapper (although you can easily change it by parsing records in your record reader into&nbsp;<code style="font-size:12px;font-family:LuxiMono, Bitstream Vera Sans Mono, Monaco, Courier New, monospace;color:#1c360c;">Writable</code>-compatible class).</p>
<p />
<div>Have fun!</div>
<p />
<div><strong>Update: </strong>As you can see, I have added a space in &#8220;&lt;document &#8221; string constant &ndash; today I realised that &#8220;&lt;documenttype&#8221; elements has been successfully used for splits, hence producing inconsistent results.</div>
<p></span></p>
]]></content:encoded>
			<wfw:commentRss>http://sigizmund.com/xml-input-and-hadoop-%e2%80%93-custom-inputformat/feed/</wfw:commentRss>
		<slash:comments>11</slash:comments>
		</item>
		<item>
		<title>Debugging Hadoop applications using your Eclipse</title>
		<link>http://sigizmund.com/debugging-hadoop-applications-using-your-eclipse/</link>
		<comments>http://sigizmund.com/debugging-hadoop-applications-using-your-eclipse/#comments</comments>
		<pubDate>Thu, 17 Sep 2009 15:30:00 +0000</pubDate>
		<dc:creator>sigizmund</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[geek]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[work]]></category>

		<guid isPermaLink="false">http://sigizmund.wordpress.com/2009/09/17/debugging-hadoop-applications-using-your-eclipse</guid>
		<description><![CDATA[<p>Well, it can be annoying &#8211; it can be awfully annoying, in fact, to debug Hadoop applications. But sometimes you need it, because logging doesn&#8217;t show anything, and you&#8217;ve tried anything but still cannot get under the Hadoop&#8217;s cover. In this case, do few simple steps.</p> <p /> 1. Download and unpack Hadoop to your [...]]]></description>
			<content:encoded><![CDATA[<p>Well, it can be annoying &#8211; it can be awfully annoying, in fact, to debug Hadoop applications. But sometimes you need it, because logging doesn&#8217;t show anything, and you&#8217;ve tried anything but still cannot get under the Hadoop&#8217;s cover. In this case, do few simple steps.</p>
<p />
<div>1. Download and unpack Hadoop to your local machine.&nbsp;</div>
<div>2. Prepare small set of data you&#8217;re planning to run the test on</div>
<div>3. Check that you actually can run Hadoop locally, something like this (don&#8217;t forget to set <span style="font-family:Monaco;font-size:small;"><span style="font-size:12px;">$HADOOP_CLASSPATH</span></span> first!):&nbsp;</div>
<p />
<div><span style="font-family:Monaco;font-size:small;"><span style="font-size:12px;"><span class="Apple-tab-span"> </span>bin/hdebug jar yourprogram.jar com.company.project.HadoopApp </span></span></div>
<div><span style="font-family:Monaco;font-size:small;"><span style="font-size:12px;">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; tiny.txt ./out</span></span></div>
<p />
<div>4. Go to Hadoop&#8217;s directory, and copy file <span style="font-family:Monaco;font-size:small;"><span style="font-size:12px;">bin/hadoop</span></span> to <span style="font-family:Monaco;font-size:small;"><span style="font-size:12px;">bin/hdebug</span></span></div>
<div>5. Now, we need to make Hadoop start in debug mode. What you should do is to add one line of text into the starting script:</div>
<p />
<div><a href='http://sigizmund.info/sigizmund.com/wp-content/uploads/2010/03/pastedgraphic-2.png'><img src="http://sigizmund.info/sigizmund.com/wp-content/uploads/2010/03/pastedgraphic-2.png?w=300" width="500"></a>
</div>
<p />
<div>Yes, here&#8217;s it. Copy it from here:</div>
<p />
<div>
<pre>-Xdebug -Xrunjdwp:transport=dt_socket,address=8001,server=y,suspend=y</pre>
</div>
<p />
<div>What does it say basically is an instruction to Java to start in debug mode, and wait for socket connection of the remote debugger on port 8001; execution should be suspended after the start until debugger is connected.</p>
<p />
<div>Now, go and start your grid application like you did in step 3, but now use <span style="font-family:Monaco;font-size:small;"><span style="font-size:12px;">bin/hdebug</span></span> script we&#8217;ve created. If you&#8217;ve done everything correctly, program should output something like this:</div>
<p />
<div><span style="font-family:Monaco;font-size:small;"><span style="font-size:12px;">Listening for transport dt_socket at address: 8001</span></span></div>
<p />
<div>and wait for debugger. So, let&#8217;s get it some debugger then! Fire up your Eclipse with your project (likely you have it opened already since you&#8217;re trying to debug something) and add new Debug configuration:</div>
<p />
<div><a href='http://sigizmund.files.wordpress.com/2009/09/pastedgraphic-4.png'><img src="http://sigizmund.files.wordpress.com/2009/09/pastedgraphic-4.png?w=300" width="500"></a>
</div>
<p />
<div>After you&#8217;ve set everything up, click &#8220;Apply&#8221; and close the window for now &ndash; probably, you&#8217;d want to set some breakpoints before starting the actual debugging. Go and do it, and then simply choose created debug configuration &#8211; and off you go! If everything worked properly, you should soon get a standard debugger window, with all the nice things Java can offer you. Hope it&#8217;ll help some of us in our difficult business of writing distributed grid-enabled applications! :)</div>
<p />
</div>
]]></content:encoded>
			<wfw:commentRss>http://sigizmund.com/debugging-hadoop-applications-using-your-eclipse/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
	</channel>
</rss>

