Why Self-Host a Graph Database
Graph databases excel at modeling and querying highly connected data — relationships are first-class citizens, not an afterthought computed via expensive JOINs at query time. They power recommendation engines, fraud detection pipelines, knowledge graphs, network topology maps, and social network analytics.
Self-hosting a graph database gives you complete control over your data:
- Data sovereignty: No third-party cloud vendor reads or monetizes your relationship data
- Cost predictability: Enterprise cloud graph databases charge per query or per connection — self-hosted pricing is a flat server cost regardless of query volume
- Custom integrations: Full access to the database engine, backup tooling, and monitoring APIs
- Compliance: Meet GDPR, HIPAA, or internal data residency requirements by keeping everything on your own infrastructure
- Performance tuning: Adjust memory allocation, storage engines, and replication factors to match your specific workload
Whether you are building a product recommendation system for an e-commerce store, modeling an IT infrastructure for a security audit, or running a knowledge graph over your organization’s documentation, a self-hosted graph database is the right choice.
What Is a Graph Database?
A graph database stores data as nodes (entities), edges (relationships), and properties (key-value pairs on both). Unlike relational databases where relationships are implicit (foreign keys resolved at query time), graph databases store relationships as physical pointers — traversing a million-hop path is essentially the same cost as a single hop.
Core Concepts
| Concept | Description | Example |
|---|---|---|
| Node | An entity or object | User, Product, Server |
| Edge | A directed relationship between two nodes | FRIEND_OF, BOUGHT, CONNECTED_TO |
| Property | Key-value data attached to a node or edge | name: "Alice", since: 2023 |
| Label/Type | Category classification for nodes or edges | Person, Company |
| Path | A sequence of connected nodes and edges | Alice → FRIEND_OF → Bob → BOUGHT → Laptop |
The Three Contenders
Neo4j — The Industry Standard
Neo4j is the most widely adopted graph database, created in 2007. It pioneered the property graph model and the Cypher query language. Available in a free Community Edition (self-hosted, single-node) and a paid Enterprise Edition (clustering, causal consistency, online backups).
Best for: Teams that need mature tooling, extensive documentation, and the widest ecosystem support.
ArangoDB — The Multi-Model Contender
ArangoDB supports three data models in one engine: graph, document (JSON), and key-value. It uses AQL (ArangoDB Query Language), a SQL-like syntax extended with graph traversal operators. The open-source Community Edition is fully self-hostable and includes graph algorithms.
Best for: Teams that need graph capabilities alongside document storage without running separate databases.
NebulaGraph — The Distributed Scale-Out Option
NebulaGraph is designed from the ground up for massive, distributed graph workloads. It separates compute (graphd), storage (storaged), and metadata (metad) into independent services, allowing each to scale horizontally. It uses nGQL, a SQL-inspired query language. The Community Edition is open source and fully self-hostable.
Best for: Large-scale deployments with billions of nodes and trillions of edges where horizontal scaling is non-negotiable.
Feature Comparison
| Feature | Neo4j (Community) | ArangoDB (Community) | NebulaGraph (Community) |
|---|---|---|---|
| License | GPL-3.0 | Apache-2.0 | Apache-2.0 |
| Query Language | Cypher | AQL | nGQL |
| Data Model | Property graph only | Multi-model (graph + document + key-value) | Property graph |
| Architecture | Single-node (CE) | Single-node or active-active cluster | Distributed (compute + storage separation) |
| Max Graph Size | ~34B nodes/edges (single node) | Limited by RAM/disk on single node | Virtually unlimited (horizontal scale) |
| ACID Transactions | ✅ Full | ✅ Full | ✅ Full |
| Graph Algorithms | ❌ (Enterprise only) | ✅ Built-in (Pregel framework) | ✅ via NebulaGraph Algorithm |
| Full-Text Search | ❌ (requires plugin) | ✅ Built-in | ⚠️ Via Elasticsearch plugin |
| Web UI | Neo4j Browser | ArangoDB Web UI | NebulaGraph Studio |
| docker Support | ✅ Official image | ✅ Official image kubernetesal compose | |
| Kubernetes | ✅ Helm chart | ✅ Helm chart (KubeArangoDB) | ✅ Helm chart |
| Language Drivers | Java, Python, Go, .NET, JS, Rust | Java, Python, Go, .NET, JS, Rust, C# | Java, Python, Go, C++, Rust |
| Import Tools | neo4j-admin, APOC | arangoimp (CSV, JSON, TSV) | nebula-importer |
Query Language Comparison
All three databases let you express the same fundamental operations, but the syntax differs significantly.
Creating Nodes and Relationships
Neo4j (Cypher):
| |
ArangoDB (AQL):
| |
NebulaGraph (nGQL):
| |
Querying: Find All Products Bought by Friends of Alice
Neo4j (Cypher):
| |
ArangoDB (AQL):
| |
NebulaGraph (nGQL):
| |
Key Takeaway
Cypher (Neo4j) has the most intuitive syntax — it visually resembles the graph you are querying. AQL (ArangoDB) feels closer to SQL with graph traversal extensions. nGQL (NebulaGraph) is also SQL-inspired but requires a schema definition phase before you can insert data.
Self-Hosted Installation Guides
Neo4j Community Edition
Neo4j Community runs as a single-node instance. It is straightforward to deploy with Docker.
| |
After startup, access the web interface at http://localhost:7474. The default credentials are neo4j / selfhosted-password-2026.
Docker Compose (persistent setup):
| |
ArangoDB Community Edition
ArangoDB supports both single-node and active-active cluster deployments.
| |
Access the web UI at http://localhost:8529.
Docker Compose with ArangoDB:
| |
NebulaGraph Community Edition
NebulaGraph uses a multi-service architecture. The recommended approach is Docker Compose with the official template.
| |
Custom Docker Compose:
| |
Connect to the cluster using the Nebula Console:
| |
Schema Design: Modeling a Social Network
To illustrate the practical differences, here is how you would model a social network with users, posts, likes, and follows in each database.
Neo4j Schema
Neo4j is schema-free — you create nodes and relationships on the fly:
| |
ArangoDB Schema
ArangoDB requires you to define vertex and edge collections:
| |
NebulaGraph Schema
NebulaGraph requires explicit schema definition before any data insertion:
| |
Performance Characteristics
When to Choose Each Database
| Scenario | Recommended | Why |
|---|---|---|
| Prototype / learn graph databases | Neo4j | Best documentation, largest community, Cypher is intuitive |
| Small to medium graphs (< 1B edges) | Neo4j or ArangoDB | Single-node performance is excellent for moderate datasets |
| Multi-model workloads | ArangoDB | Graph + document + key-value in one engine reduces infrastructure complexity |
| Large-scale graphs (10B+ edges) | NebulaGraph | Horizontal scale-out with storage-compute separation |
| Graph algorithms (PageRank, Louvain, shortest path) | ArangoDB | Built-in Pregel framework in Community Edition |
| Existing SQL team | ArangoDB or NebulaGraph | AQL and nGQL are SQL-inspired and feel familiar |
| Kubernetes-native deployment | NebulaGraph | Purpose-built for cloud-native with independent service scaling |
| Full-text search alongside graph | ArangoDB | Built-in full-text indexes without additional services |
Resource Requirements (Minimum)
| Database | RAM | CPU | Disk | Notes |
|---|---|---|---|---|
| Neo4j CE | 4 GB | 2 cores | 10 GB | JVM-based; heap size is the primary tuning knob |
| ArangoDB | 2 GB | 2 cores | 5 GB | RocksDB storage engine; V8 engine for AQL functions |
| NebulaGraph | 8 GB | 4 cores | 20 GB | Three services running; each needs its own resources |
Backup and Recovery
Neo4j
| |
ArangoDB
| |
NebulaGraph
| |
Monitoring
All three databases expose metrics that integrate with Prometheus and Grafana.
Neo4j Metrics
| |
ArangoDB Metrics
ArangoDB includes a built-in _admin/metrics endpoint:
| |
Add to prometheus.yml:
| |
NebulaGraph Metrics
| |
Security Hardening
Regardless of which database you choose, follow these practices for production self-hosted deployments:
- Never use default passwords — change all default credentials immediately after first startup
- Bind to localhost or internal network only — use a reverse proxy (Traefik, Caddy, or Nginx Proxy Manager) for external access with TLS termination
- Enable TLS for client connections — all three support encrypted connections between clients and the database server
- Restrict network access — use firewall rules to limit which IPs can reach database ports
- Enable audit logging — track all queries and administrative actions for compliance
- Regular backups — automate daily backups and test restoration procedures monthly
- Keep the database updated — graph databases receive regular security patches; subscribe to release notifications
Example: Reverse Proxy with Caddy
| |
Migration Between Graph Databases
If you need to migrate data between databases, the recommended approach is:
- Export to a neutral format — CSV or JSON with explicit node and edge files
- Transform the data — adapt property names, types, and relationship directions
- Import into the target — use the target database’s bulk import tool
Example: Export from Neo4j to CSV
| |
Example: Import CSV into NebulaGraph
| |
| |
Final Verdict
Choose Neo4j if you value the largest ecosystem, the most intuitive query language, and have a dataset that fits on a single node. It is the safest choice for teams new to graph databases. The Community Edition is sufficient for development, prototyping, and moderate production workloads.
Choose ArangoDB if you need graph capabilities alongside document or key-value storage in a single engine. Its built-in graph algorithms, full-text search, and SQL-like query language make it the most versatile option. The active-active cluster mode (Enterprise) provides high availability.
Choose NebulaGraph if you are dealing with massive graphs that exceed single-node capacity. Its storage-compute separation architecture, horizontal scalability, and cloud-native design make it the only practical choice for billion-node, trillion-edge graphs at scale.
All three are production-ready, open source, and fully self-hostable. The best choice depends on your data volume, team expertise, and infrastructure requirements.
Frequently Asked Questions (FAQ)
Which one should I choose in 2026?
The best choice depends on your specific requirements:
- For beginners: Start with the simplest option that covers your core use case
- For production: Choose the solution with the most active community and documentation
- For teams: Look for collaboration features and user management
- For privacy: Prefer fully open-source, self-hosted options with no telemetry
Refer to the comparison table above for detailed feature breakdowns.
Can I migrate between these tools?
Most tools support data import/export. Always:
- Backup your current data
- Test the migration on a staging environment
- Check official migration guides in the documentation
Are there free versions available?
All tools in this guide offer free, open-source editions. Some also provide paid plans with additional features, priority support, or managed hosting.
How do I get started?
- Review the comparison table to identify your requirements
- Visit the official documentation (links provided above)
- Start with a Docker Compose setup for easy testing
- Join the community forums for troubleshooting