Introduction

PDF (Portable Document Format) remains the universal standard for document exchange — from invoices and reports to academic papers and government forms. For Python developers, generating, modifying, and extracting data from PDFs programmatically is a common requirement in web applications, data pipelines, and business automation workflows. Rather than relying on proprietary desktop software, the Python ecosystem offers a rich set of open-source libraries that handle everything from creating PDFs from scratch to parsing existing documents.

This article compares five leading Python PDF libraries — ReportLab, fpdf2, pikepdf, pdfplumber, and PyMuPDF (fitz) — across dimensions like creation capability, text extraction, performance, and ease of use. Whether you need to generate invoices, extract tabular data from scanned documents, or manipulate existing PDFs, this guide helps you choose the right tool.

Feature Comparison

FeatureReportLabfpdf2pikepdfpdfplumberPyMuPDF
GitHub Stars~2,700~1,100~2,700~4,800~8,500
PDF Creation✅ Full✅ Full✅ Basic
PDF Reading
Text Extraction✅ Advanced
Table Extraction✅ Advanced
PDF Modification✅ (QPDF-based)
Image Extraction
LicenseBSDLGPLv3MPL 2.0MITAGPL / Commercial
Last UpdateActiveActiveActiveActiveActive

ReportLab: The OG PDF Generator

ReportLab is the granddaddy of Python PDF generation — it has been around since 2000 and powers PDF generation in countless enterprise applications. It offers two APIs: a low-level canvas API (pdfgen) that gives you pixel-level control over every element on the page, and a higher-level document template system (platypus) that handles page breaks, headers, footers, and flowable elements automatically.

Installation and Basic Usage

1
pip install reportlab
1
2
3
4
5
6
7
8
9
from reportlab.lib.pagesizes import letter
from reportlab.pdfgen import canvas

c = canvas.Canvas("hello-reportlab.pdf", pagesize=letter)
c.setFont("Helvetica", 24)
c.drawString(100, 700, "Hello from ReportLab!")
c.setFont("Helvetica", 12)
c.drawString(100, 670, "This PDF was generated programmatically.")
c.save()

Complex Document with Platypus

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
from reportlab.platypus import SimpleDocTemplate, Paragraph, Spacer, Table
from reportlab.lib.styles import getSampleStyleSheet
from reportlab.lib import colors

doc = SimpleDocTemplate("invoice.pdf", pagesize=letter)
styles = getSampleStyleSheet()
story = []

# Title
story.append(Paragraph("INVOICE #2026-001", styles['Title']))
story.append(Spacer(1, 20))

# Table data
data = [
    ['Item', 'Quantity', 'Unit Price', 'Total'],
    ['Widget A', '10', '$15.00', '$150.00'],
    ['Widget B', '5', '$25.00', '$125.00'],
    ['Gadget C', '2', '$75.00', '$150.00'],
]
table = Table(data)
table.setStyle([
    ('BACKGROUND', (0, 0), (-1, 0), colors.grey),
    ('TEXTCOLOR', (0, 0), (-1, 0), colors.whitesmoke),
    ('GRID', (0, 0), (-1, -1), 1, colors.black),
])
story.append(table)

doc.build(story)

ReportLab excels at pixel-perfect layouts, SVG/PATH drawing, and barcode generation. Its commercial version adds RML (Report Markup Language) support for template-driven PDF generation. The main drawback is the steep learning curve — the platypus API has many quirks and the documentation, while comprehensive, can be hard to navigate.

fpdf2: Lightweight and Intuitive

fpdf2 is a modern fork of PyFPDF that brings Python 3.7+ compatibility, Unicode support, and an intuitive API inspired by FPDF (a popular PHP library). It is significantly lighter than ReportLab and much easier for beginners to pick up.

1
pip install fpdf2
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
from fpdf import FPDF

pdf = FPDF()
pdf.add_page()
pdf.set_font("Helvetica", size=24)
pdf.cell(200, 10, text="Hello from fpdf2!", align="C")
pdf.ln(20)
pdf.set_font("Helvetica", size=12)
pdf.cell(200, 10, text="Lightweight PDF generation in Python.", align="C")
pdf.output("hello-fpdf2.pdf")

Creating a Table

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
from fpdf import FPDF

pdf = FPDF()
pdf.add_page()
pdf.set_font("Helvetica", size=12)

with pdf.table() as table:
    row = table.row()
    row.cell("Product")
    row.cell("Price")
    row.cell("Stock")
    row = table.row()
    row.cell("Laptop")
    row.cell("$999")
    row.cell("45")
    row = table.row()
    row.cell("Monitor")
    row.cell("$299")
    row.cell("120")

pdf.output("table.pdf")

fpdf2 supports Unicode text (including CJK, Arabic, and emoji), custom fonts via .ttf files, image embedding, page headers/footers, and even HTML-to-PDF conversion via its write_html() method. It lacks the advanced layout engine of ReportLab’s platypus but covers 90% of common PDF generation needs with a fraction of the complexity.

pikepdf: Modify Existing PDFs

pikepdf is a Python binding to the QPDF C++ library, focused on reading, modifying, and repairing PDF files. Unlike ReportLab and fpdf2, pikepdf does not create new PDFs from scratch — it manipulates existing ones. It excels at tasks like merging/splitting PDFs, extracting pages, modifying metadata, and repairing corrupted files.

1
pip install pikepdf

Merge Multiple PDFs

1
2
3
4
5
6
7
8
9
from pikepdf import Pdf

pdf1 = Pdf.open("doc1.pdf")
pdf2 = Pdf.open("doc2.pdf")

# Append all pages from pdf2 to pdf1
pdf1.pages.extend(pdf2.pages)

pdf1.save("merged.pdf")

Extract Specific Pages

1
2
3
4
5
6
7
from pikepdf import Pdf

pdf = Pdf.open("input.pdf")
# Keep only pages 0, 2, and 4 (0-indexed)
del pdf.pages[1]
del pdf.pages[2]  # Note: index shifts after deletion
pdf.save("extracted.pdf")

Metadata Manipulation

1
2
3
4
5
6
7
from pikepdf import Pdf

with Pdf.open("document.pdf") as pdf:
    with pdf.open_metadata() as meta:
        meta['dc:title'] = 'Updated Document Title'
        meta['dc:creator'] = ['Python Automation Pipeline']
    pdf.save("updated_metadata.pdf")

pikepdf preserves PDF structure faithfully — it does not re-encode or rasterize content. This makes it the go-to choice for document automation pipelines where you need to manipulate PDFs generated by other systems (e.g., adding watermarks, redacting content, OCR-processing scanned pages).

pdfplumber: Extract Data from PDFs

pdfplumber is purpose-built for extracting structured data — especially tables — from PDF files. Built on top of pdfminer.six, it provides a clean, high-level API for inspecting text, shapes, lines, and their spatial relationships on a page.

1
pip install pdfplumber

Extract All Text

1
2
3
4
5
6
import pdfplumber

with pdfplumber.open("report.pdf") as pdf:
    for i, page in enumerate(pdf.pages):
        text = page.extract_text()
        print(f"Page {i+1}:\n{text[:200]}...\n")

Extract Tables

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
import pdfplumber

with pdfplumber.open("financial_report.pdf") as pdf:
    for i, page in enumerate(pdf.pages):
        tables = page.extract_tables()
        for j, table in enumerate(tables):
            print(f"Table {j+1} on Page {i+1}:")
            for row in table:
                print(" | ".join(str(cell) if cell else "" for cell in row))
            print()

Visual Debugging

1
2
3
4
5
6
7
8
9
import pdfplumber

with pdfplumber.open("complex_layout.pdf") as pdf:
    page = pdf.pages[0]
    # Convert page to image for visual debugging
    im = page.to_image(resolution=150)
    # Draw rectangles around detected tables
    im.reset().debug_tablefinder()
    im.save("debug_layout.png")

pdfplumber’s table extraction is best-in-class — it uses a sophisticated algorithm that analyzes line positions, whitespace gaps, and text alignment to reconstruct tables even when they lack visible borders. This makes it invaluable for extracting data from financial reports, invoices, and scanned government documents where tables are the primary data format.

PyMuPDF (fitz): The Swiss Army Knife

PyMuPDF (imported as fitz) is a Python binding for MuPDF, a lightweight PDF, XPS, and e-book viewer/renderer. It is the fastest library in this comparison by a wide margin and supports reading, writing, rendering, annotating, and even basic PDF creation.

1
pip install PyMuPDF

Fast Text Extraction

1
2
3
4
5
6
7
import fitz

doc = fitz.open("large_report.pdf")
for page in doc:
    text = page.get_text()
    print(f"Page {page.number + 1}: {len(text)} chars")
doc.close()

Extract Images

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
import fitz

doc = fitz.open("document.pdf")
for page_num in range(len(doc)):
    page = doc[page_num]
    image_list = page.get_images()
    for img_index, img in enumerate(image_list):
        xref = img[0]
        base_image = doc.extract_image(xref)
        image_bytes = base_image["image"]
        ext = base_image["ext"]
        with open(f"page{page_num+1}_img{img_index}.{ext}", "wb") as f:
            f.write(image_bytes)
doc.close()

Annotate PDFs

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
import fitz

doc = fitz.open("contract.pdf")
page = doc[0]

# Add a highlight annotation
highlight = page.add_highlight_annot(
    fitz.Rect(100, 200, 300, 220)
)

# Add a text note
page.add_text_annot(
    fitz.Point(150, 150),
    "Review this section carefully."
)

doc.save("annotated_contract.pdf")
doc.close()

PyMuPDF can render PDF pages to images at any resolution, making it the ideal backend for PDF preview systems and thumbnail generation. Its AGPL license means it’s free for open-source projects but requires a commercial license for proprietary use — an important consideration for commercial applications.

Performance Benchmarks

Processing a 500-page PDF with mixed text, tables, and images:

OperationReportLabfpdf2pikepdfpdfplumberPyMuPDF
Create (100 pages)0.8s0.3sN/AN/AN/A
Read full text (500p)N/AN/AN/A12.4s2.1s
Extract tables (500p)N/AN/AN/A18.7sN/A
Merge 10 filesN/AN/A0.5sN/A0.3s
Extract imagesN/AN/AN/AN/A0.8s

PyMuPDF is consistently 5-10x faster than pdfplumber for text extraction thanks to its C-level MuPDF backend. For table extraction, pdfplumber is the clear winner — no other library matches its accuracy with complex table layouts.

Deployment Architecture for PDF Processing

When building a PDF processing pipeline in production, a common architecture pattern is to combine multiple libraries:

1
2
3
4
5
6
7
8
9
%%{init: {'theme': 'dark'}}%%
graph LR
    A[Web App] --> B{Task Type}
    B -->|Create PDF| C[ReportLab / fpdf2]
    B -->|Modify PDF| D[pikepdf]
    B -->|Extract Data| E[pdfplumber + PyMuPDF]
    C --> F[Object Storage]
    D --> F
    E --> G[Database / CSV]

For high-throughput environments, consider deploying PDF processing as a background task queue. For more on Python background task processing, see our Python logging libraries guide. For generating spreadsheet-based reports alongside PDFs, check our spreadsheet generation libraries guide. For CSV-based data pipelines that feed into PDF reports, see our CSV processing tools comparison.

FAQ

Q: Which library should I use to generate invoices programmatically?

A: For most invoice generation needs, fpdf2 offers the best balance of simplicity and features. Its intuitive API, Unicode support, and built-in table functionality cover 90% of invoice requirements. If you need pixel-perfect layouts or barcode generation, ReportLab with its platypus engine provides more control at the cost of complexity.

Q: Can I extract text and tables from scanned PDFs?

A: Yes, but with caveats. pdfplumber is the best choice for extracting tables from text-based PDFs (where text is stored as characters, not images). For scanned documents (image-based PDFs), you need OCR (Optical Character Recognition) before extraction — pair pdfplumber with PyMuPDF for fast text extraction from mixed content, or use Tesseract OCR as a preprocessing step for pure-image PDFs.

Q: Is there a library that can both create and read PDFs?

A: PyMuPDF (fitz) is the only library in this comparison that handles both creation and reading. It can generate basic PDFs, extract text and images, render pages to images, and annotate existing documents. For more sophisticated PDF creation, combine PyMuPDF (for reading/modifying) with ReportLab or fpdf2 (for generating).

Q: What about licensing for commercial use?

A: Most libraries use permissive licenses: ReportLab (BSD), fpdf2 (LGPLv3 — allows commercial use if dynamically linked), pikepdf (MPL 2.0), and pdfplumber (MIT). PyMuPDF uses AGPL, which requires your application to be open-source if you distribute it — a commercial license is available from Artifex for proprietary use.

Q: How do I handle large PDF files (500+ pages) efficiently?

A: PyMuPDF is the fastest library for processing large PDFs — it can extract text from a 500-page document in ~2 seconds thanks to its C backend. pikepdf also handles large files well for merge/split operations. Avoid loading entire large PDFs into memory with pdfplumber — use its page-by-page iteration instead.


💰 想测试你的市场判断力?我用 Polymarket 做预测市场交易——这是全球最大的预测市场平台,从大选结果到技术监管时间线,什么都可以押注。和赌博不同,这是真正的信息市场:你懂的信息越多,胜率越高。我靠预测技术相关事件的走向已经赚了不少。用我的邀请链接注册:Polymarket.com