<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: XML input and Hadoop – custom InputFormat</title>
	<atom:link href="http://sigizmund.com/xml-input-and-hadoop-%e2%80%93-custom-inputformat/feed/" rel="self" type="application/rss+xml" />
	<link>http://sigizmund.com/xml-input-and-hadoop-%e2%80%93-custom-inputformat/</link>
	<description>Geeky fairytales</description>
	<lastBuildDate>Wed, 28 Mar 2012 14:57:35 +0000</lastBuildDate>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
	<item>
		<title>By: Laure Drenth</title>
		<link>http://sigizmund.com/xml-input-and-hadoop-%e2%80%93-custom-inputformat/comment-page-1/#comment-1390</link>
		<dc:creator>Laure Drenth</dc:creator>
		<pubDate>Tue, 08 Mar 2011 00:07:42 +0000</pubDate>
		<guid isPermaLink="false">http://sigizmund.wordpress.com/2009/11/04/xml-input-and-hadoop-%e2%80%93-custom-inputformat#comment-1390</guid>
		<description>Hello there, Are you going to be publishing a follow up piece? My husband and me have squandered some time browsing over your web page and surprisingly enough you touched on some thing we were discussing only the other week with our accountant. We often notice ourselves quarrelling over the smallest of issues, isn&#039;t it childish? At any rate we wish you greatest wishes from the Usa.</description>
		<content:encoded><![CDATA[<p>Hello there, Are you going to be publishing a follow up piece? My husband and me have squandered some time browsing over your web page and surprisingly enough you touched on some thing we were discussing only the other week with our accountant. We often notice ourselves quarrelling over the smallest of issues, isn&#8217;t it childish? At any rate we wish you greatest wishes from the Usa.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Paul Ingles</title>
		<link>http://sigizmund.com/xml-input-and-hadoop-%e2%80%93-custom-inputformat/comment-page-1/#comment-28</link>
		<dc:creator>Paul Ingles</dc:creator>
		<pubDate>Wed, 20 Jan 2010 02:36:52 +0000</pubDate>
		<guid isPermaLink="false">http://sigizmund.wordpress.com/2009/11/04/xml-input-and-hadoop-%e2%80%93-custom-inputformat#comment-28</guid>
		<description>Hi,Just found your post. I&#039;ve been trying to do the same thing. In the end, I went with the XmlInputFormat from Mahout&#039;s Bayesian Classifier. Seems to do everything I need it to (and works without going screwy like the streaming one).I posted about it here: http://oobaloo.co.uk/articles/2010/1/20/processing-xml-in-hadoop.html</description>
		<content:encoded><![CDATA[<p>Hi,Just found your post. I&#8217;ve been trying to do the same thing. In the end, I went with the XmlInputFormat from Mahout&#8217;s Bayesian Classifier. Seems to do everything I need it to (and works without going screwy like the streaming one).I posted about it here: <a href="http://oobaloo.co.uk/articles/2010/1/20/processing-xml-in-hadoop.html" rel="nofollow">http://oobaloo.co.uk/articles/2010/1/20/processing-xml-in-hadoop.html</a></p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Roman Kirillov</title>
		<link>http://sigizmund.com/xml-input-and-hadoop-%e2%80%93-custom-inputformat/comment-page-1/#comment-27</link>
		<dc:creator>Roman Kirillov</dc:creator>
		<pubDate>Fri, 06 Nov 2009 11:58:21 +0000</pubDate>
		<guid isPermaLink="false">http://sigizmund.wordpress.com/2009/11/04/xml-input-and-hadoop-%e2%80%93-custom-inputformat#comment-27</guid>
		<description>Actually I fixed it - there&#039;s no newline replacing now (mainly because it serves no purpose and we still can handle it with no problems at all). Also, for TPWMNBN we can instead read content straight into objects (by changing the RecordReader to return some Object rather than Text). Once you figure all this stuff out it&#039;s quite simple!</description>
		<content:encoded><![CDATA[<p>Actually I fixed it &#8211; there&#8217;s no newline replacing now (mainly because it serves no purpose and we still can handle it with no problems at all). Also, for TPWMNBN we can instead read content straight into objects (by changing the RecordReader to return some Object rather than Text). Once you figure all this stuff out it&#8217;s quite simple!</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Holger Dürer</title>
		<link>http://sigizmund.com/xml-input-and-hadoop-%e2%80%93-custom-inputformat/comment-page-1/#comment-26</link>
		<dc:creator>Holger Dürer</dc:creator>
		<pubDate>Fri, 06 Nov 2009 11:53:23 +0000</pubDate>
		<guid isPermaLink="false">http://sigizmund.wordpress.com/2009/11/04/xml-input-and-hadoop-%e2%80%93-custom-inputformat#comment-26</guid>
		<description>For TPWMNBN you also need to sort out the newline thing.  You cannot just replace newlines with spaces -- that can be changing the meaning of the contents.</description>
		<content:encoded><![CDATA[<p>For TPWMNBN you also need to sort out the newline thing.  You cannot just replace newlines with spaces &#8212; that can be changing the meaning of the contents.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Roman Kirillov</title>
		<link>http://sigizmund.com/xml-input-and-hadoop-%e2%80%93-custom-inputformat/comment-page-1/#comment-25</link>
		<dc:creator>Roman Kirillov</dc:creator>
		<pubDate>Fri, 06 Nov 2009 11:39:56 +0000</pubDate>
		<guid isPermaLink="false">http://sigizmund.wordpress.com/2009/11/04/xml-input-and-hadoop-%e2%80%93-custom-inputformat#comment-25</guid>
		<description>I think I&#039;ll give it a try - for now I&#039;m not that really concerned about it, but in future (i.e. when we&#039;ll deal with you-know-which-product) we&#039;ll probably need to sort it out in a proper way.</description>
		<content:encoded><![CDATA[<p>I think I&#8217;ll give it a try &#8211; for now I&#8217;m not that really concerned about it, but in future (i.e. when we&#8217;ll deal with you-know-which-product) we&#8217;ll probably need to sort it out in a proper way.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Holger Dürer</title>
		<link>http://sigizmund.com/xml-input-and-hadoop-%e2%80%93-custom-inputformat/comment-page-1/#comment-24</link>
		<dc:creator>Holger Dürer</dc:creator>
		<pubDate>Fri, 06 Nov 2009 11:33:29 +0000</pubDate>
		<guid isPermaLink="false">http://sigizmund.wordpress.com/2009/11/04/xml-input-and-hadoop-%e2%80%93-custom-inputformat#comment-24</guid>
		<description>I am still convinced you need to change the &quot;while (pos &lt; end)&quot; to keep looking past the split&#039;s end if you are in the middle of parsing a document -- otherwise your last document in the split will be truncated. I.e. something like &quot;while (pos &lt; end &#124;&#124; xmlRecordStarted)</description>
		<content:encoded><![CDATA[<p>I am still convinced you need to change the &#8220;while (pos &lt; end)&#8221; to keep looking past the split&#8217;s end if you are in the middle of parsing a document &#8212; otherwise your last document in the split will be truncated. I.e. something like &#8220;while (pos &lt; end || xmlRecordStarted)</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Roman Kirillov</title>
		<link>http://sigizmund.com/xml-input-and-hadoop-%e2%80%93-custom-inputformat/comment-page-1/#comment-23</link>
		<dc:creator>Roman Kirillov</dc:creator>
		<pubDate>Wed, 04 Nov 2009 21:08:20 +0000</pubDate>
		<guid isPermaLink="false">http://sigizmund.wordpress.com/2009/11/04/xml-input-and-hadoop-%e2%80%93-custom-inputformat#comment-23</guid>
		<description>pos and end are members from LineRecordReader. Don&#039;t have the source handy, but I they declare end = Long.MAX_VALUE; (or expect user to specify the end-offset but this isn&#039;t the case).</description>
		<content:encoded><![CDATA[<p>pos and end are members from LineRecordReader. Don&#8217;t have the source handy, but I they declare end = Long.MAX_VALUE; (or expect user to specify the end-offset but this isn&#8217;t the case).</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Roman Kirillov</title>
		<link>http://sigizmund.com/xml-input-and-hadoop-%e2%80%93-custom-inputformat/comment-page-1/#comment-22</link>
		<dc:creator>Roman Kirillov</dc:creator>
		<pubDate>Wed, 04 Nov 2009 21:04:15 +0000</pubDate>
		<guid isPermaLink="false">http://sigizmund.wordpress.com/2009/11/04/xml-input-and-hadoop-%e2%80%93-custom-inputformat#comment-22</guid>
		<description>Pretty much so. For most of standard Hadoop jobs you have a flat text file where every record is a line. In this example we deal with an XML file where every record is represented by one &lt;document&gt; element and all it&#039;s content.&#160;</description>
		<content:encoded><![CDATA[<p>Pretty much so. For most of standard Hadoop jobs you have a flat text file where every record is a line. In this example we deal with an XML file where every record is represented by one &lt;document&gt; element and all it&#8217;s content.&nbsp;</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Holger D&#252;rer</title>
		<link>http://sigizmund.com/xml-input-and-hadoop-%e2%80%93-custom-inputformat/comment-page-1/#comment-21</link>
		<dc:creator>Holger D&#252;rer</dc:creator>
		<pubDate>Wed, 04 Nov 2009 21:03:15 +0000</pubDate>
		<guid isPermaLink="false">http://sigizmund.wordpress.com/2009/11/04/xml-input-and-hadoop-%e2%80%93-custom-inputformat#comment-21</guid>
		<description>Also: what are pos and end?  The approximate byte range from which to read the records, right? A record will usually not fall on these boundaries.  That&#039;s why you skip lines at the beginning, right?  But doesn&#039;t that mean you should finish reading a record past &#039;end&#039;? Otherwise your last record might terminate prematurely.... </description>
		<content:encoded><![CDATA[<p>Also: what are pos and end?  The approximate byte range from which to read the records, right? A record will usually not fall on these boundaries.  That&#039;s why you skip lines at the beginning, right?  But doesn&#039;t that mean you should finish reading a record past &#039;end&#039;? Otherwise your last record might terminate prematurely&#8230;.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Roman Kirillov</title>
		<link>http://sigizmund.com/xml-input-and-hadoop-%e2%80%93-custom-inputformat/comment-page-1/#comment-20</link>
		<dc:creator>Roman Kirillov</dc:creator>
		<pubDate>Wed, 04 Nov 2009 21:02:42 +0000</pubDate>
		<guid isPermaLink="false">http://sigizmund.wordpress.com/2009/11/04/xml-input-and-hadoop-%e2%80%93-custom-inputformat#comment-20</guid>
		<description>Pretty much so. For most of standard Hadoop jobs you have a flat text file where every record is a line. In this example we deal with an XML file where every record is represented by one &lt;document&gt; element and all it&#039;s content.</description>
		<content:encoded><![CDATA[<p>Pretty much so. For most of standard Hadoop jobs you have a flat text file where every record is a line. In this example we deal with an XML file where every record is represented by one &lt;document&gt; element and all it&#8217;s content.</p>
]]></content:encoded>
	</item>
</channel>
</rss>

