Why HTML Parsing Matters in Native Code
When building web scrapers, security scanners, content extractors, or browser engines, you need a fast and reliable HTML parser. While Python developers reach for BeautifulSoup and JavaScript developers use cheerio, C and C++ developers have a different set of options — native libraries that parse HTML5 at the speed of compiled code.
This article compares three of the most popular open-source C/C++ HTML parsing libraries: Google’s Gumbo Parser, Lexbor, and MyHTML. Each takes a different approach to the same problem, and the right choice depends on your performance requirements, API preferences, and maintenance expectations.
Library Overview
| Feature | Gumbo Parser | Lexbor | MyHTML |
|---|---|---|---|
| GitHub Stars | ~5,189 | ~2,028 | ~1,713 |
| Language | C99 | C | C |
| Last Updated | Jan 2026 | Jun 2026 | Jan 2025 |
| HTML5 Spec | Full | Full | Full |
| Threading | Single-threaded | Single-threaded | Multi-threaded |
| DOM Output | Parse tree | DOM tree + Render tree | DOM tree |
| CSS Selectors | Via third-party | Native | No |
| Memory Model | Arena allocator | Custom allocator | Custom allocator |
| License | Apache 2.0 | Apache 2.0 | LGPL 2.1 |
Google Gumbo Parser: The Battle-Tested Standard
Gumbo is an HTML5 parsing library written in pure C99 by Google. It was designed to be a fully conformant HTML5 parser with no external dependencies — just drop the source files into your project and compile.
| |
Strengths:
- Zero dependencies — pure C99, compiles everywhere from embedded systems to mainframes
- Full HTML5 spec compliance — passes all html5lib-tests
- Arena-based allocation — fast, predictable memory usage, excellent for batch processing
- Simple, stable API — hasn’t changed significantly since release
Weaknesses:
- No CSS selector support — you must manually traverse the tree or use wrappers like
gumbo-query - Slower update cadence — last commit was January 2026 after years of minimal changes
- No streaming API — must parse the entire document before processing
Gumbo powers many popular projects including the Zircon kernel’s documentation tools and several Python HTML parsing backends.
Lexbor: The Full-Featured Modern Alternative
Lexbor takes a more ambitious approach — it’s not just an HTML parser but a complete HTML rendering engine in development. It parses HTML5 into a DOM tree and provides CSS selector matching natively.
| |
Strengths:
- Native CSS selectors — query elements without third-party libraries
- Actively maintained — latest commit June 2026, vibrant development community
- Modular architecture — separate modules for HTML, CSS, URL, encoding
- Full DOM tree — supports mutation, insertion, and node manipulation after parsing
Weaknesses:
- Larger codebase — more complex build system with autotools/cmake
- Heavier memory footprint — full DOM with live collections uses more RAM than Gumbo’s arena model
- Less battle-tested — newer project with smaller deployment base than Gumbo
Lexbor is the best choice when you need CSS selectors without external dependencies and value active development.
MyHTML: The Multi-Threaded Speed Demon
MyHTML comes from the same developer as Lexbor (Alexander Borisov) but takes a fundamentally different approach. It is designed from the ground up for threaded parsing — you can parse multiple HTML documents in parallel using worker threads.
| |
Strengths:
- Multi-threaded parsing — designed for high-throughput scenarios where you need to parse hundreds of documents simultaneously
- Streaming/chunked parsing — can parse HTML as it arrives over the network
- Proven in production — MyHTML and Lexbor share architectural insights
- Compact API surface — smaller and simpler than Lexbor
Weaknesses:
- Slower maintenance — last update January 2025, effectively in maintenance mode
- No CSS selectors — must traverse the DOM tree manually
- LGPL license — more restrictive than Apache 2.0 for commercial embedding
- Documentation gaps — less comprehensive than Gumbo’s docs
Setting Up Docker Environments for HTML Processing
For projects that need to run HTML parsing in containerized microservices, here is a Docker Compose setup for a C++ HTML processing service:
| |
| |
Performance Considerations
For most use cases, raw parsing speed differs by less than 15% between the three libraries. The real performance differentiators are:
- Memory allocation strategy: Gumbo’s arena allocator excels at batch processing (parse 10,000 small documents, free them all at once). Lexbor’s custom allocator is better for long-running server processes where documents live for different durations.
- Threading model: If you process documents in a worker pool, MyHTML’s thread-safe design eliminates mutex contention. With Gumbo or Lexbor, you would create one parser instance per thread.
- Streaming vs batch: MyHTML supports incremental/chunked parsing, making it the best choice for streaming HTML over network connections where latency matters.
Why Self-Host Your HTML Processing Pipeline?
Running your own HTML processing infrastructure gives you complete control over data privacy and processing logic. For web scraping at scale, a self-hosted setup avoids the rate limits, IP blocks, and data-sharing risks of third-party scraping APIs. You can also customize the parsing pipeline to extract exactly the data structures your application needs.
For web scraping frameworks, see our web scraping tools guide. If you need to parse structured data formats beyond HTML, check our XML parsing libraries comparison. For URL handling before fetching pages, our URL parsing libraries guide covers the full pipeline.
FAQ
Which parser should I choose for a new project in 2026?
For new projects, Lexbor offers the best balance of modern features (native CSS selectors, modular design, active maintenance) and performance. Choose Gumbo if you need zero-dependency portability or are working in resource-constrained environments. Choose MyHTML only if you specifically need multi-threaded or streaming parsing at scale.
Can I use these libraries from Python or other languages?
Yes. Gumbo has excellent bindings for Python (gumbo on PyPI), Ruby, Rust, and Go. Lexbor has Python bindings under active development. MyHTML has Python bindings via myhtml-py. For production Python projects, the Gumbo Python bindings are the most mature.
How do these compare to libxml2’s HTML parser?
libxml2’s HTML parser (htmlParser module) is older and less HTML5-compliant than these three. It handles malformed HTML reasonably well but does not follow the full HTML5 tree construction algorithm. For modern web content with complex HTML5 features (custom elements, templates, Shadow DOM hints), use a dedicated HTML5 parser like Gumbo or Lexbor.
Are these parsers safe against malicious HTML input?
All three parsers are designed to handle malformed HTML gracefully — that is part of the HTML5 specification. However, they do not include built-in XSS filtering or sanitization. For security applications, pair the parser with a separate sanitizer that strips dangerous elements and attributes. The parser’s job is to build a correct tree; sanitization is a separate concern.
What about performance with very large documents?
For documents over 10MB, Gumbo’s arena allocator can cause memory fragmentation if you parse many large documents without freeing. Lexbor’s incremental memory management handles large documents more gracefully. MyHTML’s streaming mode is ideal for documents where you only need the first N elements and can discard the rest.
Can these parsers handle non-English content and encodings?
All three support the full HTML5 encoding detection algorithm, including auto-detection of UTF-8, UTF-16, Latin-1, Shift-JIS, GBK, and other common encodings. They also handle bidirectional text, RTL languages, and Unicode normalization correctly as specified by HTML5.
💰 Want to test your market prediction skills? I use Polymarket — the world’s largest prediction market platform. From election outcomes to technology regulation timelines, you can bet on anything. Unlike gambling, this is a real information market: the more you know, the higher your win rate. I’ve made solid returns predicting technology-related events. Sign up with my referral link: Polymarket.com