pdffile

A library to access PDF file content from Python

pdffile.pyc 16 Feb 2006 82.3k
pdf2txt 6 Feb 2006 0.6k
pdffile.py 11 May 2004 49.1k

Description

This is a Python module to access the object tree within a PDF file. It seems to run okay on pretty much every PDF I can throw at it.

>>> from pdffile import PDFFile
>>> f = PDFFile()
>>> f.load_file('14882.pdf')
>>> f.Root.value()
<{'PageMode': /UseOutlines, 'Names': <{'JavaScript': <7661 0 R>}>, 'AcroForm': <7667 0 R>, 'Type': /Catalog, 'Pages': <7694 0 R>, 'Outlines': <5710 0 R>}>
>>> f.Root['Pages']['Kids'][0].value()
<{'Count': <216>, 'Kids': <[<7668 0 R>, <7669 0 R>, <7670 0 R>, <7671 0 R>, <7672 0 R>, <7673 0 R>]>, 'Type':/Pages, 'Parent': <7694 0 R>}>

Some limitations are that it supports only some of the standard decoders, and provides only user-level password access for the standard (revision 2) security handler.

See also:

Licence

This software has been placed in the public domain.