Tuesday, March 15, 2016

Unix split command in Python

By Vasudev Ram

Recently, there was an HN thread about the implementation (not just use) of text editors. Someone mentioned that some editors, including vim, have problems opening large files. Various people gave workarounds or solutions, including using vim and other ways.

I commented that you can use the Unix command bfs (for big file scanner), if you have it on your system, to open the file read-only and then move around in it, like you can in an editor.

I also said that the Unix commands split and csplit can be used to split a large file into smaller chunks, edit the chunks as needed, and then combine the chunks back into a single file using the cat commmand.

This made me think of writing, just for fun, a simple version [1] of the split command in Python. So I did that, and then tested it some [2]. Seems to be working okay so far.

[1] I have not implemented the full functionality of the POSIX split command, only a subset, for now. May enhance it with a few command-line options, or more functionality, later, e.g. with the ability to split binary files. I've also not implemented the default size of 1000 lines, or the ability to take input from standard input if no filename is specfied. (Both are easy.)

However, I am not sure whether the binary file splitting feature should be a part of split, or should be a separate command, considering the Unix philosophy of doing one thing and doing it well. Binary file splitting seems like it should be a separate task from text file splitting. Maybe it is a matter of opinion.

[2] I tested split.py with various valid and invalid values for the lines_per_file argument (such as -3, -2, -1, 0, 1, 2, 3, 10, 50, 100) on each of these input files:

in_file_0_lines.txt
in_file_1_line.txt
in_file_2_lines.txt
in_file_3_lines.txt
in_file_10_lines.txt
in_file_100_lines.txt

where the meaning of the filenames should be self-explanatory.

Of course, I also checked after each test run, that the output file(s) contained the right data.

(There may still be some bugs, of course. If you find any, I'd appreciate hearing about it.)

Here is the code for split.py:

import sys
import os

OUTFIL_PREFIX = "out_"

def make_out_filename(prefix, idx):
    '''Make a filename with a serial number suffix.'''
    return prefix + str(idx).zfill(4)

def split(in_filename, lines_per_file):
    '''Split the input file in_filename into output files of 
    lines_per_file lines each. Last file may have less lines.'''
    in_fil = open(in_filename, "r")
    outfil_idx = 1
    out_filename = make_out_filename(OUTFIL_PREFIX, outfil_idx)
    out_fil = open(out_filename, "w")
    # Using chain assignment feature of Python.
    line_count = tot_line_count = file_count = 0
    # Loop over the input and split it into multiple files.
    # A text file is an iterable sequence, from Python 2.2,
    # so the for line below works.
    for lin in in_fil:
        # Bump vars; change to next output file.
        if line_count >= lines_per_file:
            tot_line_count += line_count
            line_count = 0
            file_count += 1
            out_fil.close()
            outfil_idx += 1
            out_filename = make_out_filename(OUTFIL_PREFIX, outfil_idx)
            out_fil = open(out_filename, "w")
        line_count += 1
        out_fil.write(lin)
    in_fil.close()
    out_fil.close()
    sys.stderr.write("Output is in file(s) with prefix {}\n".format(OUTFIL_PREFIX))
        
def usage():
    sys.stderr.write(
    "Usage: {} in_filename lines_per_file\n".format(sys.argv[0]))

def main():

    if len(sys.argv) != 3:
        usage()
        sys.exit(1)

    try:
        # Get and validate in_filename.
        in_filename = sys.argv[1]
        # If input file does not exist, exit.
        if not os.path.exists(in_filename):
            sys.stderr.write("Error: Input file '{}' not found.\n".format(in_filename))
            sys.exit(1)
        # If input is empty, exit.
        if os.path.getsize(in_filename) == 0:
            sys.stderr.write("Error: Input file '{}' has no data.\n".format(in_filename))
            sys.exit(1)
        # Get and validate lines_per_file.
        lines_per_file = int(sys.argv[2])
        if lines_per_file <= 0:
            sys.stderr.write("Error: lines_per_file cannot be less than or equal to 0.\n")
            sys.exit(1)
        # If all checks pass, split the file.
        split(in_filename, lines_per_file) 
    except ValueError as ve:
        sys.stderr.write("Caught ValueError: {}\n".format(repr(ve)))
    except IOError as ioe:
        sys.stderr.write("Caught IOError: {}\n".format(repr(ioe)))
    except Exception as e:
        sys.stderr.write("Caught Exception: {}\n".format(repr(e)))
        raise

if __name__ == '__main__':
    main()
You can run split.py like this:
$ python split.py
Usage: split.py in_filename lines_per_file
which will give you the usage help. And like this to actually split text files, in this case, a 100-line text file into 10 files of 10 lines each:
$ python split.py in_file_100_lines.txt 10
Output is in file(s) with prefix out_
Here are a couple of runs with invalid values for either the input file or the lines_per_file argument:
$ python split.py in_file_100_lines.txt 0
Error: lines_per_file cannot be less than or equal to 0.

$ python split.py not-there.txt 0
Error: Input file 'not-there.txt' not found.
As an aside, thinking about whether to use 0 or 1 as initial value for some of the _count variables in the program, made me remember this topic:

Why programmers count from 0

See the first few hits for some good answers.

And finally, speaking of zero, check out this earlier post by me:

Bhaskaracharya and the man who found zero

- Enjoy.

- Vasudev Ram - Online Python training and programming

Signup to hear about new products and services I create.

Posts about Python  Posts about xtopdf

My ActiveState recipes

6 comments:

Jay R. Wren said...

It looks like it will crash or perform very poorly on a large binary file with no newline characters in it. Have you tested it in that use case?

Vasudev Ram said...

Thanks for your comment.

No, I haven't tested it in that use case. My post does indicate (though maybe it is implied, not stated explicitly) that binary files are not supported in this initial version, at least. I know that the original Unix split (which I've linked to) does support both text and binary files in the same program. But I do not plan to. I plan to make that a separate program. See point about Unix philosophy in the post.

I will also make it explicit via another comment, that this version does not support binary files.

And commenting again in a short while, with some more of my thoughts on this matter.

Vasudev Ram said...


Note to readers: this version does not support binary files. Also see previous comment.

Vasudev Ram said...


@Jay Wren: Here's the follow-up comment I said I would write (a bit long, sorry):

I reviewed my post just now, the part about the features that this split command supports, and what I said there about handling text vs. binary files. Also thought about that a bit more. Here's what I think:

Of course, it's possible to handle splitting of both text files and binary files in the same program. But I prefer not to do that\. From my early days of doing database CRUD programming, where some colleagues used to have C functions with names like priamd (standing for PRInt, Add, Modify, Delete - I know :), I've not liked that style of intertwining the code for the different CRUD operations in the same function, which some people did, just to save a few lines of code, or to avoid writing separate functions for each operation. IMO that is a false economy, and code is cleaner and more maintainable if one writes separate functions for each logically distinct operation. I think the same in this case. I can easily write another program, called, say bsplit (for binary split).

However, that still leaves us with the point you raised:

>It looks like it will crash or perform very poorly on a large binary file with no newline characters in it.

An interesting point. I do know that you cannot read lines of unlimited length with built-in Python file object's read*() functions (one of which, maybe readline(), my for loop is calling under the hood), unless you read the file a character/byte at a time, until EOF. Or rather, you can attempt to do so, but as you said, the program may either crash (due to a buffer overflow) or perform very slowly because of allocating a lot of virtual memory (on disk). Haven't really experienced such as situation myself, so I'll experiment and try to simulate it, and see what happens to the current split program. But leaving aside that, there are a couple of things than can be done:

- inform the users (via the documentation of the program) to use it only on text files, and that too, where no line is extremely long.

- it may be possible to write some heuristic code in the program, near the start, which detects very long lines, and aborts with a warning, so the program does not hang or crash later. Something like that the Unix "file" command does - it reads the first few hundred bytes of any file given as argument, to try and detect what type of file it is, based on known file headers, typical content of C programs vs. shell scripts vs. other types of files, etc. This heuristic would have to read maybe a few thousands bytes before deciding.

However even that heuristic may not work, if the file has many lines that are not very long, at the start of the file, and then somewhere in the middle or near the end of the file, there is a very long line or a huge chunk of binary data with no newline in it.

So I guess the ultimate answer is that no program can handle every situation that can be thrown at it, and the user has to take some care too.

One final point - I think (though I'd need to verify) - that it may be possible to handle the issue you mention (huge binary data with no newline) by reading the file character by character using in_fil's.read(1), looking for newlines in the characters read, and handling the data accordingly. My selpg Linux utility does something like that when invoked with the option to handle pages demarcated by form feeds. However then how do you decide where to split the data into lines (since no newlines exist)?

Or use in_fil.readline(size) method with some large positive value for size. Would still need some extra code to handle the need to split the output into files of lines_per_file lines each, because a newline may only occur after many read() calls - or not at all.

Anyway, good point - it made me think of some possibilities ...

Vasudev Ram said...

I just realized another point, regarding selpg - when used in the line mode (not the form feed mode), it can act like a rudimentary version of the split command. But you would have to invoke it many times, once for each output file you want to split the input file into; i.e. if you have a file of 100 lines, call selpg once to extract lines 1 to 10, again to extract lines 11 to 20, etc. It would be inefficient to use it that way compared to split, which creates all the output files in one run.

But of course that is not selpg's intended use. It is only meant to extract one range of pages from the input, for purposes such as printing, etc.

Vasudev Ram said...


Here's the code for the selpg.c utility, in which, when the form feed mode is used, I read character by character, using getc(fin):

selpg.c code

A similar approach could be taken in Python using in_fil.read(1), but as I said, that does not solve the problem of where do you split the file into lines - if it has no newlines. In fact, it is then a binary file, so I would prefer to just treat it separately, using the bsplit program I mentioned. I think it would rarely be the case that you do not know whether the file you want to process is a text file or a binary file. If that happens, there may be a problem upstream, which should be fixed first.