<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>pushin&#039; and poppin&#039; your eax &#187; linux</title>
	<atom:link href="http://blog.mahmoudimus.com/category/linux/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.mahmoudimus.com</link>
	<description>a hacker&#039;s moleskine</description>
	<lastBuildDate>Sat, 24 Dec 2011 00:13:51 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0.1</generator>
		<item>
		<title>Reading and Writing Null-Terminated CSV Files in Python</title>
		<link>http://blog.mahmoudimus.com/2010/09/reading-and-writing-null-terminated-csv-files-in-python/</link>
		<comments>http://blog.mahmoudimus.com/2010/09/reading-and-writing-null-terminated-csv-files-in-python/#comments</comments>
		<pubDate>Mon, 13 Sep 2010 01:42:03 +0000</pubDate>
		<dc:creator>Mahmoud</dc:creator>
				<category><![CDATA[linux]]></category>
		<category><![CDATA[programming]]></category>
		<category><![CDATA[python]]></category>

		<guid isPermaLink="false">http://blog.mahmoudimus.com/?p=132</guid>
		<description><![CDATA[I&#8217;ve recently had to do some work that required sorting a very large CSV file, containing fields with embedded newlines, quickly. As it turns out, Linux comes with a sort implementation that has a &#8220;&#8211;zero-terminated&#8221; option, which sorts on null-terminated delimited strings instead of the default newline separator. Writing null-terminated CSV files Since I was [...]]]></description>
			<content:encoded><![CDATA[<p></p><p>I&#8217;ve recently had to do some work that required sorting a very large CSV file, containing fields with embedded newlines, quickly. As it turns out, Linux comes with a sort implementation that has a <a href="http://linux.die.net/man/1/sort" target="_blank">&#8220;&#8211;zero-terminated&#8221;</a> option, which sorts on null-terminated delimited strings instead of the default newline separator.</p>
<p><strong><span style="text-decoration: underline;">Writing null-terminated CSV files</span></strong></p>
<p>Since I was writing a process to generate these CSV files, I figured I can just use Python&#8217;s <a href="http://docs.python.org/library/csv.html" target="_blank">CSV module</a>, which has support for different types of dialects. Inheriting from <a href="http://docs.python.org/library/csv.html#csv.Dialect" target="_blank">csv.Dialect</a>, we can write a simple dialect that will allow us to terminate all lines with a null byte.</p>
<pre class="brush: python; gutter: false; title: ;">
import csv
import struct

class null_terminated(csv.excel):
    lineterminator = struct.pack('B', 0)

csv.register_dialect(&quot;null-terminated&quot;, null_terminated)
</pre>
<p>Essentially, we&#8217;ve registered a global csv dialect called <code>"null-terminated"</code> that inherits from the <a href="http://docs.python.org/library/csv.html#csv.excel" target="_blank">excel</a> dialect, which has sensible standard defaults.</p>
<p>Here&#8217;s a simple snippet that shows the usage of the new <code>"null-terminated"</code> dialect that I created above.</p>
<pre class="brush: python; gutter: false; title: ;">
from csv import DictWriter

with open(&quot;/tmp/file.csv&quot;, &quot;w&quot;) as f:
	dwriter = DictWriter(f, fieldnames=[&quot;id&quot;,&quot;field&quot;], dialect=&quot;null-terminated&quot;)

	for i, field in enumerate((&quot;foo&quot;, &quot;bar&quot;, &quot;baz&quot;, &quot;bif&quot;)):
    		dwriter.writerow({&quot;id&quot;: i, &quot;field&quot;: field})
</pre>
<p>Now, <em>/tmp/file.csv</em> will contain a file with four rows that are separated by a null-terminator. As you can see, it&#8217;s pretty easy to write a null-terminated CSV file, but unfortunately, it&#8217;s a bit tricky to <em>read</em> a null-terminated csv file due to some inflexible hardcoded defaults.</p>
<p><strong><span style="text-decoration: underline;">Reading null-terminated CSV files</span></strong></p>
<p>The CSV module&#8217;s unintuitive restriction for <a href="http://docs.python.org/library/csv.html#csv.Dialect.lineterminator" target="_blank">Dialect.lineterminator</a> is hard-coded to recognize <code>'\r'</code> or <code>'\n'</code> as the end of line terminator, which unfortunately, means we will need to handle null-termination and implement reading ourselves.</p>
<p>There are many ways of writing a procedure to read null-terminated strings, but I figured the simplest algorithm is to read character-by-character, concatenating everything into a string until we reach a null byte, then we can just return the string. I&#8217;d figure an implementation might go something like this:</p>
<pre class="brush: python; gutter: false; title: ;">
def read(fobj):
    current_string = &quot;&quot;
    while True:
        char = fobj.read(1)
        if char and char != nullbyte:
            current_string += char
        elif char == nullbyte:
            yield current_string
            current_string = &quot;&quot;
        elif not char:
            if current_string:
                yield current_string
            raise StopIteration
</pre>
<p>Looks awesome, but, how can we integrate this into the CSV module? We would want to just plug and play with the existing CSV module. A simple solution is to wrap the function above to iterate over each line, like so:</p>
<pre class="brush: python; gutter: false; title: ;">
 # we use StringIO since cStringIO has poor unicode support
from StringIO import StringIO
from csv import reader

class NullTerminatedDelimiterReader(object):
    &quot;&quot;&quot;
    A CSV reader which will iterate over lines in the CSV file 'f',
    which are line terminated by a null byte

    &quot;&quot;&quot;

    def __init__(self, f,  dialect, *args, **kwds):
        # satisfying DictReader instance
        self._line_num = 0
        self.fobj = f
        self.dialect = dialect
        self.reader = self._read()
        self.string_io = StringIO()

    def _properly_parse_row(self, current_string):
        self.string_io.write(current_string)
        # seek to the first byte
        self.string_io.seek(0)
        # we instantiate a reader here to properly parse the row
        # taking into account escaping, and various edge cases
        return next(reader(self.string_io, dialect=self.dialect))

    def _read(self):
        current_string = &quot;&quot;
        while True:
            char = self.fobj.read(1)  # read one byte
            if char and char != null_byte:
                # keep appending to the current string
                current_string += char
            elif char == null_byte:
                yield self._properly_parse_row(current_string)
                # increment instrumentation
                self._line_num += 1
                # clear internal reading buffer
                self.string_io.seek(0)
                self.string_io.truncate()
                # clear row
                current_string = &quot;&quot;
            elif not char:
                if current_string:
                    yield self._properly_parse_row(current_string)
                raise StopIteration

    @property
    def line_num(self):
        return self._line_num

    def next(self):
        return next(self.reader)

    def __iter__(self):
        return self
</pre>
<p>To use the DictReader class, we&#8217;ll inherit from the <a href="http://svn.python.org/projects/python/trunk/Lib/csv.py" target="_blank">DictReader</a> class and override the reader object. It&#8217;s the cleanest and simplest way of doing it.</p>
<pre class="brush: python; gutter: false; title: ;">
class NullByteDictReader(csv.DictReader):
    def __init__(self, f, *args, **kwds):
        csv.DictReader.__init__(self, f, *args, **kwds)
        self.reader = NullTerminatedDelimiterReader(f, *args, **kwds)

with open(&quot;/tmp/file.csv&quot;, &quot;r&quot;) as f:
    for line in NullByteDictReader(f, dialect=&quot;null-terminated&quot;):
        print line[&quot;id&quot;], line[&quot;field&quot;]
</pre>
<p>Voila <img src='http://blog.mahmoudimus.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
<p><strong>Conclusions and Future Work</strong></p>
<p>Something that might be interesting to pursue further is the possibility of writing, or wrapping a python interface around, a <a href="http://www.kilabit.org/" target="_blank">C library</a> as a substitute for the current CSV module. It should be able to support different line terminators, multi-byte delimiters, and have unicode detection outside the box, which happen to be my main three gripes with the CSV module.</p>
<p>You can find working source code and implementation of this <a href="http://gist.github.com/576675" target="_blank">here</a> and you should follow me on twitter <a href="http://twitter.com/mahmoudimus">here</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.mahmoudimus.com/2010/09/reading-and-writing-null-terminated-csv-files-in-python/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>Verifying Python64 builds</title>
		<link>http://blog.mahmoudimus.com/2009/07/verifying-python64-builds/</link>
		<comments>http://blog.mahmoudimus.com/2009/07/verifying-python64-builds/#comments</comments>
		<pubDate>Mon, 06 Jul 2009 15:24:01 +0000</pubDate>
		<dc:creator>mahmoud</dc:creator>
				<category><![CDATA[linux]]></category>
		<category><![CDATA[programming]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[32bit]]></category>
		<category><![CDATA[64bit]]></category>
		<category><![CDATA[verification]]></category>

		<guid isPermaLink="false">http://blog.mahmoudimus.com/?p=10</guid>
		<description><![CDATA[At work, I&#8217;m migrating over python to our 64bit machines and one thing that I&#8217;ve noticed was that there really was no standard python 64bit verification method to ensure the build was really 64bit or not. I&#8217;ve read somewhere previously, especially for the Mac OS X crowd, that the LDFLAGS=&#8221;-arch x86_64&#8243; flag had to be [...]]]></description>
			<content:encoded><![CDATA[<p></p><p>At work, I&#8217;m migrating over <a title="Python" href="http://www.python.org/" target="_blank">python </a>to our 64bit machines and one thing that I&#8217;ve noticed was that there really was no standard python 64bit verification method to ensure the build was really 64bit or not. I&#8217;ve read <a title="somewhere" href="http://www.corepy.org/wiki/index.php?title=How_To_Build_a_64-bit_Python_and_use_Corepy/x86_64_on_OSX" target="_blank">somewhere</a> previously, especially for the Mac OS X crowd, that the LDFLAGS=&#8221;-arch x86_64&#8243; flag had to be passed in before building on a 64bit machine.</p>
<p>It looks like python2.6 changed the way it was required to build respective 64bit binaries. To build on standard linux x86_64 architecture, the following standard steps to installing on a 64bit machine worked for me:</p>
<pre class="brush: bash; title: ;">
./configure
make &amp;&amp; make test
make install
</pre>
<p>Surprisingly, I received a segmentation fault when building as well as testing. I&#8217;ve never seen this before, but for those of you who are interested, the error message was:</p>
<pre class="brush: bash; title: ;">
Parser/pgen ./Grammar/Grammar ./Include/graminit.h ./Python/graminit.c
make: *** [Include/graminit.h] Segmentation fault
Parser/pgen ./Grammar/Grammar ./Include/graminit.h ./Python/graminit.c
make: *** [Python/graminit.c] Segmentation fault
</pre>
<p>The verification step is actually pretty intuitive. An easy test to verify that you&#8217;re on a 64bit machine is to find the size of the MAX_INT. Luckily for us, python makes this a very easy verification.</p>
<p>To verify the build, I went on a regular python 32bit machine and I did:</p>
<pre class="brush: python; title: ;">
h[1] &gt;&gt;&gt; import sys
h[1] &gt;&gt;&gt; sys.maxint
2147483647
</pre>
<p>On a 64bit machine, I did:</p>
<pre class="brush: python; title: ;">
h[2] &gt;&gt;&gt; import sys
h[2] &gt;&gt;&gt; sys.maxint
9223372036854775807
</pre>
<p>Clearly, my 64bit installation worked:)</p>
<p>Hope this helps some of you.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.mahmoudimus.com/2009/07/verifying-python64-builds/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>python -c &#8216;print &#8220;hello world!&#8221; &#8216;</title>
		<link>http://blog.mahmoudimus.com/2009/07/hello-world/</link>
		<comments>http://blog.mahmoudimus.com/2009/07/hello-world/#comments</comments>
		<pubDate>Sat, 04 Jul 2009 20:19:19 +0000</pubDate>
		<dc:creator>mahmoud</dc:creator>
				<category><![CDATA[algorithms]]></category>
		<category><![CDATA[engineering]]></category>
		<category><![CDATA[general]]></category>
		<category><![CDATA[ide]]></category>
		<category><![CDATA[linux]]></category>
		<category><![CDATA[math]]></category>
		<category><![CDATA[programming]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[software]]></category>
		<category><![CDATA[ubuntu]]></category>
		<category><![CDATA[vim]]></category>
		<category><![CDATA[canonical]]></category>
		<category><![CDATA[hello]]></category>
		<category><![CDATA[jVI]]></category>
		<category><![CDATA[mathematics]]></category>
		<category><![CDATA[musings]]></category>
		<category><![CDATA[natural language processing]]></category>
		<category><![CDATA[netbeans]]></category>
		<category><![CDATA[nlp]]></category>
		<category><![CDATA[statistics]]></category>
		<category><![CDATA[vocabulary]]></category>
		<category><![CDATA[world]]></category>
		<category><![CDATA[writing]]></category>

		<guid isPermaLink="false">http://mahmoudimus.com/blog/?p=1</guid>
		<description><![CDATA[And so, we meet again, world. I&#8217;ve finally gotten around to registering a home online, installing WordPress, and ready to share my ideas with the world. I&#8217;ve given a lot of topics some thought, and I think I might be able to influence and/or help others with my various migrations. First, I&#8217;d like to thank [...]]]></description>
			<content:encoded><![CDATA[<p></p><p>And so, we meet again, world. I&#8217;ve finally gotten around to registering a home online, installing WordPress, and ready to share my ideas with the world. I&#8217;ve given a lot of topics some thought, and I think I might be able to influence and/or help others with my various migrations.</p>
<p>First, I&#8217;d like to thank Canonical for Ubuntu and making my migration from Windows to Linux desktop. Some kinks here, and there, but overall I think that it was pretty flawless. I&#8217;ve also started a large push towards using vim as my primary editor, instead of constantly switching between IDEs. I believe Netbeans 6.7 came out, I haven&#8217;t had the opportunity to play around with it, but if it was anything like Netbeans 6.5, then hey, that&#8217;s +1 for them! What a great IDE, especially with the fantastic jVI extension.</p>
<p>I&#8217;ll update various posts here and there with some musings about python (what a language), some software releases, mathematical musings, natural language processing tidbits (including really cool algorithms to generate domain names), and various interesting ideas that I&#8217;ve had some time to play around with.</p>
<p>A primary reason for starting up this blog is to share with the world some of my thoughts, improve my writing, and try to contribute to the open source world. I think there&#8217;s a lot of work to do and I can NOT wait to start. Well world, I&#8217;ll hope to speak to you soon.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.mahmoudimus.com/2009/07/hello-world/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

