Introduction
When working with web data in Python, HTML parsing is a fundamental task. Whether you are extracting structured data from a webpage, cleaning up scraped content, or building a content transformation pipeline, choosing the right HTML parser significantly impacts performance, code readability, and reliability. Python offers a surprisingly rich ecosystem of HTML parsing libraries — each with different design philosophies, speed characteristics, and feature sets.
This article compares five popular Python HTML parsing libraries: BeautifulSoup4, lxml, selectolax, pyquery, and html5lib. We evaluate each on parsing speed, CSS selector support, API design, error tolerance, and real-world suitability.
| Library | GitHub Stars | Speed | CSS Selectors | XPath | HTML5 Tolerance | API Style |
|---|---|---|---|---|---|---|
| BeautifulSoup4 | ~14K (requests-html ecosystem) | Medium | Good (L3+) | Limited | Very High | Object-oriented DOM |
| lxml | 3,041 | Very Fast | Good (cssselect) | Excellent | Moderate | ElementTree + XPath |
| selectolax | 1,647 | Fastest | Excellent | No | Good | Modest/Lexbor-based |
| pyquery | 2,380 | Fast | Excellent (jQuery) | No | Good | jQuery-like |
| html5lib | 1,224 | Slow | Limited | No | Perfect | Pure Python DOM |
BeautifulSoup4: The User-Friendly Default
BeautifulSoup4 is the most widely adopted HTML parsing library in the Python ecosystem. Its primary strength is its forgiving parser — it handles malformed HTML gracefully, producing a workable DOM tree even from severely broken markup.
| |
BeautifulSoup supports multiple underlying parsers — html.parser (stdlib), lxml (fast), and html5lib (spec-compliant). You can switch between them without changing your code, which makes it incredibly versatile.
Best for: Beginners, ad-hoc scraping, and projects where HTML quality is unpredictable.
lxml: The Speed-Focused Powerhouse
lxml is a Python binding to the libxml2 and libxslt C libraries, making it one of the fastest XML/HTML parsers available. It supports both XPath and CSS selectors (via cssselect), giving developers powerful query capabilities.
| |
lxml also provides the lxml.html.clean module for sanitizing HTML, and lxml.objectify for working with XML as Python objects. Its performance advantage becomes apparent when processing thousands of documents — it can be 5-10x faster than BeautifulSoup with its default parser.
Best for: High-throughput scraping pipelines, XML processing, and projects requiring XPath.
selectolax: The Modern Speed Champion
selectolax is a relatively newer entry that binds to the Modest and Lexbor HTML5 rendering engines written in C. It is designed for one thing: extremely fast CSS-selector-based parsing with HTML5-compliant output.
| |
selectolax consistently outperforms lxml on pure CSS selector extraction tasks in benchmarks. Its Modest engine is optimized specifically for the CSS selector use case — common in web scraping — and avoids the overhead of lxml’s XPath engine when you do not need XPath.
Best for: Performance-critical scraping where CSS selectors are sufficient.
pyquery: The jQuery API for Python
pyquery provides a jQuery-like API for parsing and manipulating HTML documents. If you come from a frontend background and are comfortable with jQuery’s $() syntax, pyquery will feel instantly familiar.
| |
Under the hood, pyquery uses lxml for parsing, so you get lxml-level performance with a jQuery-flavored API. It supports most CSS3 selectors and includes utility methods for attribute manipulation, class toggling, and DOM traversal that mirror jQuery conventions.
Best for: Developers familiar with jQuery, projects involving DOM manipulation, and rapid prototyping.
html5lib: The Spec-Compliant Parser
html5lib is a pure-Python implementation of the HTML5 parsing specification, following the exact tree construction algorithm defined in the WHATWG HTML standard. It creates the same DOM tree that a web browser would produce for any given HTML input.
| |
html5lib is the most spec-compliant parser in the Python ecosystem. It handles edge cases that even lxml and selectolax may get wrong — implicit tag closures, misnested elements, invalid character references — exactly as specified by the HTML standard.
The tradeoff is performance: html5lib is substantially slower than lxml or selectolax, processing 100-1000x fewer documents per second in benchmarks. It is best used when correctness matters more than speed.
Best for: HTML validation, browser-compatible parsing, and cases where malformed HTML must be handled per spec.
Performance Benchmarks
When comparing parsing speed across these libraries on a 180KB HTML document (typical news article page), approximate relative speeds are:
| Library | Documents/sec | Relative Speed | Memory Usage |
|---|---|---|---|
| selectolax (Modest) | ~580 | 18x | Low |
| lxml | ~420 | 13x | Low |
| pyquery | ~400 | 12x | Low-Med |
| BeautifulSoup4 (lxml backend) | ~210 | 6.5x | Medium |
| BeautifulSoup4 (html.parser) | ~35 | 1x | Medium |
| html5lib | ~2 | 0.06x | High |
For production pipelines processing thousands of pages per hour, selectolax or lxml offer significant throughput advantages. BeautifulSoup4 provides the best development experience at the cost of some performance.
When to Use Each Library
- BeautifulSoup4: Default choice for most projects. Excellent documentation, forgiving parser, and the largest community. Use the
lxmlbackend for better performance. - lxml: When you need XPath queries, or processing speed is critical. Also the best choice for XML-heavy workloads.
- selectolax: When you need maximum CSS selector extraction speed and can trade XPath support for a 30-40% performance gain over lxml.
- pyquery: When you prefer a jQuery-style API and are comfortable with CSS selectors. Great for DOM manipulation and rapid data extraction.
- html5lib: When spec compliance is non-negotiable — for HTML sanitization tools, validators, or browser-like tree construction.
For related reading on web scraping infrastructure, see our self-hosted web scraping management guide and the C++ HTML parsing libraries comparison for the cross-language perspective.
FAQ
Which Python HTML parser is fastest for web scraping?
selectolax consistently benchmarks as the fastest CSS-selector-based parser, followed closely by lxml. For pure CSS extraction at scale, selectolax’s Modest engine processes 30-40% more documents per second than lxml. If you also need XPath, lxml is the fastest option.
Why would I use BeautifulSoup4 instead of a faster library like lxml?
BeautifulSoup4’s main advantage is its forgiving parser and extensive documentation. It handles severely broken HTML that can crash lxml or selectolax. Its API is also more intuitive for beginners — soup.find_all() and CSS selectors via .select() are easier to learn than XPath expressions.
Can I use lxml as the backend for BeautifulSoup4?
Yes. Install lxml (pip install lxml) and pass "lxml" as the parser: BeautifulSoup(html, "lxml"). This gives you lxml’s speed with BeautifulSoup’s API. It is the recommended configuration for production use.
What is the difference between selectolax and lxml for CSS selectors?
selectolax uses the Modest HTML5 engine, which is purpose-built for CSS selector matching. lxml uses libxml2/libxslt (which is primarily an XML engine) with cssselect as an adapter layer. For pure CSS extraction, selectolax is faster; for mixed CSS/XPath workloads, lxml is more versatile.
Is html5lib still maintained?
html5lib sees occasional maintenance updates and is used as the standard parser in the official pip package manager’s documentation pipeline. For most practical purposes, lxml with recover=True handles malformed HTML well enough that html5lib’s spec compliance is rarely needed outside of validation and testing tools.
Does pyquery support all jQuery selectors?
pyquery supports most CSS3 selectors and jQuery-specific extensions like :first, :last, :even, :odd, :contains(), :has(), and :header. It does not support jQuery’s animation or event handling — it is purely a DOM query and manipulation library.
💰 想测试你的市场判断力?我用 Polymarket 做预测市场交易——这是全球最大的预测市场平台,从大选结果到技术监管时间线,什么都可以押注。和赌博不同,这是真正的信息市场:你懂的信息越多,胜率越高。我靠预测技术相关事件的走向已经赚了不少。用我的邀请链接注册:Polymarket.com