Friday, September 27, 2013

Extract text from Word .docx files with python-docx


By Vasudev Ram



python-docx is a Python library that can be used to extract the text content from Microsoft Word files that are in the .docx format.

Here is a program (modified a bit from the python-docx examples) that shows how to do it:

# extract_docx_text.py

import sys
from docx import opendocx, getdocumenttext

def extract_docx_text(infil, outfil):

    # Extract the text from the DOCX file object infile and write it to 
    # the text file object outfil.

    paragraphs = getdocumenttext(infil)

    # For Unicode handling.
    new_paragraphs = []
    for paragraph in paragraphs:
        new_paragraphs.append(paragraph.encode("utf-8"))

    outfil.write('\n'.join(new_paragraphs))

def usage():

    return "Usage: python extract_docx_text.py infile.docx outfile.txt\n"

def main():

    if len(sys.argv) != 3:
        print usage()
        sys.exit(1)

    try:
        infil = opendocx(sys.argv[1])
        outfil = open(sys.argv[2], 'w')
    except Exception, e:
        print "Exception: " + repr(e) + "\n"
        sys.exit(1)

    extract_docx_text(infil, outfil)

if __name__ == '__main__':
    main()

# EOF

Save the program as extract_docx_text.py and run it with:

python extract_docx_text.py input_file.docx output_file.txt

That should result in the text of the .docx file being extracted and written to the .txt file.

- Vasudev Ram - Dancing Bison Enterprises

Make a training or consulting inquiry (Python, open source, Linux ...)

No comments: