Reading and Writing Null-Terminated CSV Files in Python

by Mahmoud on September 12, 2010

I’ve recently had to do some work that required sorting a very large CSV file, containing fields with embedded newlines, quickly. As it turns out, Linux comes with a sort implementation that has a “–zero-terminated” option, which sorts on null-terminated delimited strings instead of the default newline separator.

Writing null-terminated CSV files

Since I was writing a process to generate these CSV files, I figured I can just use Python’s CSV module, which has support for different types of dialects. Inheriting from csv.Dialect, we can write a simple dialect that will allow us to terminate all lines with a null byte.

import csv
import struct

class null_terminated(csv.excel):
    lineterminator = struct.pack('B', 0)

csv.register_dialect("null-terminated", null_terminated)

Essentially, we’ve registered a global csv dialect called "null-terminated" that inherits from the excel dialect, which has sensible standard defaults.

Here’s a simple snippet that shows the usage of the new "null-terminated" dialect that I created above.

from csv import DictWriter

with open("/tmp/file.csv", "w") as f:
	dwriter = DictWriter(f, fieldnames=["id","field"], dialect="null-terminated")

	for i, field in enumerate(("foo", "bar", "baz", "bif")):
    		dwriter.writerow({"id": i, "field": field})

Now, /tmp/file.csv will contain a file with four rows that are separated by a null-terminator. As you can see, it’s pretty easy to write a null-terminated CSV file, but unfortunately, it’s a bit tricky to read a null-terminated csv file due to some inflexible hardcoded defaults.

Reading null-terminated CSV files

The CSV module’s unintuitive restriction for Dialect.lineterminator is hard-coded to recognize '\r' or '\n' as the end of line terminator, which unfortunately, means we will need to handle null-termination and implement reading ourselves.

There are many ways of writing a procedure to read null-terminated strings, but I figured the simplest algorithm is to read character-by-character, concatenating everything into a string until we reach a null byte, then we can just return the string. I’d figure an implementation might go something like this:

def read(fobj):
    current_string = ""
    while True:
        char = fobj.read(1)
        if char and char != nullbyte:
            current_string += char
        elif char == nullbyte:
            yield current_string
            current_string = ""
        elif not char:
            if current_string:
                yield current_string
            raise StopIteration

Looks awesome, but, how can we integrate this into the CSV module? We would want to just plug and play with the existing CSV module. A simple solution is to wrap the function above to iterate over each line, like so:

 # we use StringIO since cStringIO has poor unicode support
from StringIO import StringIO
from csv import reader

class NullTerminatedDelimiterReader(object):
    """
    A CSV reader which will iterate over lines in the CSV file 'f',
    which are line terminated by a null byte

    """

    def __init__(self, f,  dialect, *args, **kwds):
        # satisfying DictReader instance
        self._line_num = 0
        self.fobj = f
        self.dialect = dialect
        self.reader = self._read()
        self.string_io = StringIO()

    def _properly_parse_row(self, current_string):
        self.string_io.write(current_string)
        # seek to the first byte
        self.string_io.seek(0)
        # we instantiate a reader here to properly parse the row
        # taking into account escaping, and various edge cases
        return next(reader(self.string_io, dialect=self.dialect))

    def _read(self):
        current_string = ""
        while True:
            char = self.fobj.read(1)  # read one byte
            if char and char != null_byte:
                # keep appending to the current string
                current_string += char
            elif char == null_byte:
                yield self._properly_parse_row(current_string)
                # increment instrumentation
                self._line_num += 1
                # clear internal reading buffer
                self.string_io.seek(0)
                self.string_io.truncate()
                # clear row
                current_string = ""
            elif not char:
                if current_string:
                    yield self._properly_parse_row(current_string)
                raise StopIteration

    @property
    def line_num(self):
        return self._line_num

    def next(self):
        return next(self.reader)

    def __iter__(self):
        return self

To use the DictReader class, we’ll inherit from the DictReader class and override the reader object. It’s the cleanest and simplest way of doing it.

class NullByteDictReader(csv.DictReader):
    def __init__(self, f, *args, **kwds):
        csv.DictReader.__init__(self, f, *args, **kwds)
        self.reader = NullTerminatedDelimiterReader(f, *args, **kwds)

with open("/tmp/file.csv", "r") as f:
    for line in NullByteDictReader(f, dialect="null-terminated"):
        print line["id"], line["field"]

Voila :)

Conclusions and Future Work

Something that might be interesting to pursue further is the possibility of writing, or wrapping a python interface around, a C library as a substitute for the current CSV module. It should be able to support different line terminators, multi-byte delimiters, and have unicode detection outside the box, which happen to be my main three gripes with the CSV module.

You can find working source code and implementation of this here and you should follow me on twitter here.

  • Guest

    Is there a reason to use struct.pack(‘B’, 0) instead of ‘x00′ ?

  • Justin

    Pretty sure you can just do this, using the ‘read’ function you wrote:

    with open(“/tmp/file.csv”, “r”) as f:
    for line in csv.DictReader(read(f)):
    print line["id"], line["field"]

  • http://www.facebook.com/abdelkader Mahmoud Mack Abdelkader

    You’re absolutely right.

  • http://mahmoudimus.com/ Mahmoud Abdelkader

    You’re absolutely right. That would work.

  • Guest

    reading a file byte by byte is not very effecient. It would be much more usable if you read(4096), split(“”) and continue that process until you reach EOF. The trick is the first element should be concatenated with the last element from the previous iteration, and the last element in the list should be skipped until EOF is reached, which it would be “”

  • http://mahmoudimus.com/ Mahmoud Abdelkader

    That’s a good idea. I’ll see if I can do that and get some benchmarks as well. I’ll also have a follow up post about this because there are a few more tips and tricks I’d like to share regarding parsing CSV files.

blog comments powered by Disqus

Previous post:

Next post: