A library to access PDF file content from Python
pdffile.pyc | 16 Feb 2006 | 82.3k |
pdf2txt | 6 Feb 2006 | 0.6k |
pdffile.py | 11 May 2004 | 49.1k |
This is a Python module to access the object tree within a PDF file. It seems to run okay on pretty much every PDF I can throw at it.
>>> from pdffile import PDFFile >>> f = PDFFile() >>> f.load_file('14882.pdf') >>> f.Root.value() <{'PageMode': /UseOutlines, 'Names': <{'JavaScript': <7661 0 R>}>, 'AcroForm': <7667 0 R>, 'Type': /Catalog, 'Pages': <7694 0 R>, 'Outlines': <5710 0 R>}> >>> f.Root['Pages']['Kids'][0].value() <{'Count': <216>, 'Kids': <[<7668 0 R>, <7669 0 R>, <7670 0 R>, <7671 0 R>, <7672 0 R>, <7673 0 R>]>, 'Type':/Pages, 'Parent': <7694 0 R>}>
Some limitations are that it supports only some of the standard decoders, and provides only user-level password access for the standard (revision 2) security handler.
See also:
This software has been placed in the public domain.