Introduction

When working with web data in Python, HTML parsing is a fundamental task. Whether you are extracting structured data from a webpage, cleaning up scraped content, or building a content transformation pipeline, choosing the right HTML parser significantly impacts performance, code readability, and reliability. Python offers a surprisingly rich ecosystem of HTML parsing libraries — each with different design philosophies, speed characteristics, and feature sets.

This article compares five popular Python HTML parsing libraries: BeautifulSoup4, lxml, selectolax, pyquery, and html5lib. We evaluate each on parsing speed, CSS selector support, API design, error tolerance, and real-world suitability.

LibraryGitHub StarsSpeedCSS SelectorsXPathHTML5 ToleranceAPI Style
BeautifulSoup4~14K (requests-html ecosystem)MediumGood (L3+)LimitedVery HighObject-oriented DOM
lxml3,041Very FastGood (cssselect)ExcellentModerateElementTree + XPath
selectolax1,647FastestExcellentNoGoodModest/Lexbor-based
pyquery2,380FastExcellent (jQuery)NoGoodjQuery-like
html5lib1,224SlowLimitedNoPerfectPure Python DOM

BeautifulSoup4: The User-Friendly Default

BeautifulSoup4 is the most widely adopted HTML parsing library in the Python ecosystem. Its primary strength is its forgiving parser — it handles malformed HTML gracefully, producing a workable DOM tree even from severely broken markup.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
from bs4 import BeautifulSoup

html = "<html><body><h1>Hello</h1><p class='content'>World</p></body></html>"
soup = BeautifulSoup(html, "html.parser")

# Navigate the DOM tree
title = soup.find("h1").text
paragraph = soup.select_one("p.content").text
all_paragraphs = soup.find_all(["p", "div"])

# Extract links
for link in soup.find_all("a", href=True):
    print(f"{link.text}: {link['href']}")

BeautifulSoup supports multiple underlying parsers — html.parser (stdlib), lxml (fast), and html5lib (spec-compliant). You can switch between them without changing your code, which makes it incredibly versatile.

Best for: Beginners, ad-hoc scraping, and projects where HTML quality is unpredictable.

lxml: The Speed-Focused Powerhouse

lxml is a Python binding to the libxml2 and libxslt C libraries, making it one of the fastest XML/HTML parsers available. It supports both XPath and CSS selectors (via cssselect), giving developers powerful query capabilities.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
from lxml import html, etree

htmldoc = "<div class='container'><a href='/page1'>Link 1</a><a href='/page2'>Link 2</a></div>"
tree = html.fromstring(htmldoc)

# CSS selector support via cssselect
links = tree.cssselect("a")
for link in links:
    print(link.get("href"), link.text)

# XPath queries for complex selection
results = tree.xpath("//a[contains(@href, 'page')]/text()")
print(results)  # ['Link 1', 'Link 2']

# Serialize back to HTML string
clean_html = etree.tostring(tree, pretty_print=True, encoding="unicode")

lxml also provides the lxml.html.clean module for sanitizing HTML, and lxml.objectify for working with XML as Python objects. Its performance advantage becomes apparent when processing thousands of documents — it can be 5-10x faster than BeautifulSoup with its default parser.

Best for: High-throughput scraping pipelines, XML processing, and projects requiring XPath.

selectolax: The Modern Speed Champion

selectolax is a relatively newer entry that binds to the Modest and Lexbor HTML5 rendering engines written in C. It is designed for one thing: extremely fast CSS-selector-based parsing with HTML5-compliant output.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
from selectolax.parser import HTMLParser

html_content = """
<div id="container">
    <article class="post"><h2>Title A</h2><p>Content A</p></article>
    <article class="post"><h2>Title B</h2><p>Content B</p></article>
</div>
"""

parser = HTMLParser(html_content)

# Fast CSS selector queries
for article in parser.css(".post"):
    title = article.css_first("h2").text()
    excerpt = article.css_first("p").text()
    print(f"{title}: {excerpt}")

# Iterate over all nodes efficiently
for node in parser.root.traverse():
    if node.tag == "a":
        href = node.attributes.get("href", "")
        text = node.text()
        print(f"Link: {text} -> {href}")

selectolax consistently outperforms lxml on pure CSS selector extraction tasks in benchmarks. Its Modest engine is optimized specifically for the CSS selector use case — common in web scraping — and avoids the overhead of lxml’s XPath engine when you do not need XPath.

Best for: Performance-critical scraping where CSS selectors are sufficient.

pyquery: The jQuery API for Python

pyquery provides a jQuery-like API for parsing and manipulating HTML documents. If you come from a frontend background and are comfortable with jQuery’s $() syntax, pyquery will feel instantly familiar.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
from pyquery import PyQuery as pq

doc = pq("""
<div class="blog">
    <h1 class="title">My Blog</h1>
    <ul class="posts">
        <li><a href="/post1">First Post</a></li>
        <li><a href="/post2">Second Post</a></li>
    </ul>
</div>
""")

# jQuery-like selectors
title = doc("h1.title").text()
links = doc("ul.posts a")
for link in links.items():
    print(f"{link.text()}: {link.attr.href}")

# DOM manipulation
doc("h1.title").attr("class", "featured-title")
doc("ul.posts").append("<li><a href='/post3'>Third Post</a></li>")

# Serialize
print(doc.html())

Under the hood, pyquery uses lxml for parsing, so you get lxml-level performance with a jQuery-flavored API. It supports most CSS3 selectors and includes utility methods for attribute manipulation, class toggling, and DOM traversal that mirror jQuery conventions.

Best for: Developers familiar with jQuery, projects involving DOM manipulation, and rapid prototyping.

html5lib: The Spec-Compliant Parser

html5lib is a pure-Python implementation of the HTML5 parsing specification, following the exact tree construction algorithm defined in the WHATWG HTML standard. It creates the same DOM tree that a web browser would produce for any given HTML input.

1
2
3
4
5
6
7
8
9
import html5lib

malformed_html = "<p>Paragraph 1<p>Paragraph 2"  # Missing closing tags
doc = html5lib.parse(malformed_html, treebuilder="etree", namespaceHTMLElements=False)

# html5lib produces the same tree a browser would
# It automatically closes open tags and builds a valid tree
from lxml import etree
print(etree.tostring(doc, encoding="unicode", pretty_print=True))

html5lib is the most spec-compliant parser in the Python ecosystem. It handles edge cases that even lxml and selectolax may get wrong — implicit tag closures, misnested elements, invalid character references — exactly as specified by the HTML standard.

The tradeoff is performance: html5lib is substantially slower than lxml or selectolax, processing 100-1000x fewer documents per second in benchmarks. It is best used when correctness matters more than speed.

Best for: HTML validation, browser-compatible parsing, and cases where malformed HTML must be handled per spec.

Performance Benchmarks

When comparing parsing speed across these libraries on a 180KB HTML document (typical news article page), approximate relative speeds are:

LibraryDocuments/secRelative SpeedMemory Usage
selectolax (Modest)~58018xLow
lxml~42013xLow
pyquery~40012xLow-Med
BeautifulSoup4 (lxml backend)~2106.5xMedium
BeautifulSoup4 (html.parser)~351xMedium
html5lib~20.06xHigh

For production pipelines processing thousands of pages per hour, selectolax or lxml offer significant throughput advantages. BeautifulSoup4 provides the best development experience at the cost of some performance.

When to Use Each Library

  • BeautifulSoup4: Default choice for most projects. Excellent documentation, forgiving parser, and the largest community. Use the lxml backend for better performance.
  • lxml: When you need XPath queries, or processing speed is critical. Also the best choice for XML-heavy workloads.
  • selectolax: When you need maximum CSS selector extraction speed and can trade XPath support for a 30-40% performance gain over lxml.
  • pyquery: When you prefer a jQuery-style API and are comfortable with CSS selectors. Great for DOM manipulation and rapid data extraction.
  • html5lib: When spec compliance is non-negotiable — for HTML sanitization tools, validators, or browser-like tree construction.

For related reading on web scraping infrastructure, see our self-hosted web scraping management guide and the C++ HTML parsing libraries comparison for the cross-language perspective.

FAQ

Which Python HTML parser is fastest for web scraping?

selectolax consistently benchmarks as the fastest CSS-selector-based parser, followed closely by lxml. For pure CSS extraction at scale, selectolax’s Modest engine processes 30-40% more documents per second than lxml. If you also need XPath, lxml is the fastest option.

Why would I use BeautifulSoup4 instead of a faster library like lxml?

BeautifulSoup4’s main advantage is its forgiving parser and extensive documentation. It handles severely broken HTML that can crash lxml or selectolax. Its API is also more intuitive for beginners — soup.find_all() and CSS selectors via .select() are easier to learn than XPath expressions.

Can I use lxml as the backend for BeautifulSoup4?

Yes. Install lxml (pip install lxml) and pass "lxml" as the parser: BeautifulSoup(html, "lxml"). This gives you lxml’s speed with BeautifulSoup’s API. It is the recommended configuration for production use.

What is the difference between selectolax and lxml for CSS selectors?

selectolax uses the Modest HTML5 engine, which is purpose-built for CSS selector matching. lxml uses libxml2/libxslt (which is primarily an XML engine) with cssselect as an adapter layer. For pure CSS extraction, selectolax is faster; for mixed CSS/XPath workloads, lxml is more versatile.

Is html5lib still maintained?

html5lib sees occasional maintenance updates and is used as the standard parser in the official pip package manager’s documentation pipeline. For most practical purposes, lxml with recover=True handles malformed HTML well enough that html5lib’s spec compliance is rarely needed outside of validation and testing tools.

Does pyquery support all jQuery selectors?

pyquery supports most CSS3 selectors and jQuery-specific extensions like :first, :last, :even, :odd, :contains(), :has(), and :header. It does not support jQuery’s animation or event handling — it is purely a DOM query and manipulation library.


💰 想测试你的市场判断力?我用 Polymarket 做预测市场交易——这是全球最大的预测市场平台,从大选结果到技术监管时间线,什么都可以押注。和赌博不同,这是真正的信息市场:你懂的信息越多,胜率越高。我靠预测技术相关事件的走向已经赚了不少。用我的邀请链接注册:Polymarket.com