Code-only knowledge graph with vector search. A simplified version of Graphify focused on analyzing codebases using tree-sitter AST extraction and LanceDB vector embeddings.
- This is graphify clone
- Data structure is much more simplified than Graphify
- Only ingests code
- No other formats such as video is supported
- You can use Semantic search
- Multi-language code analysis: Supports 26+ languages via tree-sitter (Python, JavaScript, TypeScript, Go, Rust, Java, C/C++, Ruby, Scala, PHP, etc.)
- Deterministic AST extraction: No LLM calls, fast and reproducible
- Community detection: Groups related code via Leiden/Louvain clustering
- Vector search: Find code by natural language queries, powered by sentence-transformers + LanceDB
- Graph traversal: Explore relationships: calls, imports, data flow
# Install dependencies in a virtual environment
uv sync --all-extras
# Or without Leiden (fallback to Louvain)
uv sync
# (Optional) Download embedding model for offline use
./download-model.sh # Linux/macOS
powershell -File download-model.ps1 # Windowscd code-knowledge
pip install -e ".[leiden]"This installs:
networkx≥3.0 — graph algorithmstree-sitter≥0.23.0 + 21 language parserslancedb— vector databasesentence-transformers— embeddings (all-MiniLM-L6-v2)graspologic(optionalleidenextra) — Leiden clustering (Linux/macOS, Python <3.13; Windows falls back to Louvain)
Analyze a codebase and build the graph + vector store:
uv run code-knowledge update /path/to/codeOutputs:
code-knowledge-out/graph.json— 6-field minimal schema for nodes and edgescode-knowledge-out/vectors/— LanceDB vector store
uv run code-knowledge query "what handles authentication"Returns top-10 matching nodes with metadata and neighbors.
uv run code-knowledge path "Session" "AuthToken"uv run code-knowledge explain "ValidateToken"Shows metadata, outgoing calls, and incoming references.
uv run code-knowledge index# Add to PATH for direct execution
uv tool install --editable .
code-knowledge update /path/to/code
code-knowledge query "what handles authentication"By default, code-knowledge downloads the embedding model (22MB) from Hugging Face Hub on first use. To avoid the HF warning and use offline:
# Download model once (requires internet)
./download-model.sh # Linux/macOS
powershell -File download-model.ps1 # Windows
# Then all subsequent uses work offline
uv run code-knowledge update /path/to/code
uv run code-knowledge query "..."The model is cached in the project at .cache/all-MiniLM-L6-v2/ and reused for all future runs. This makes the project portable and CI/CD-friendly:
- For CI/CD: Commit
.cache/to git or download it as part of setup - For development: Run
download-model.shonce, then work offline - Size: ~350 MB (all-MiniLM-L6-v2 + tokenizers)
| Field | Type | Example |
|---|---|---|
id |
str | "session_validatetoken" |
label |
str | "ValidateToken" |
source_file |
str | "auth/session.py" |
source_location |
str | "L42" |
contributor |
str | null | null |
community |
int | 2 |
| Field | Type | Example |
|---|---|---|
from |
str | "session_validatetoken" |
to |
str | "session_authtoken" |
relation |
str | "calls" |
Built-in relation types: calls, imports, imports_from, contains, inherits, extends, implements, references, cites, conceptually_related_to, shares_data_with, semantically_similar_to, rationale_for
Each node is embedded as:
"{label} community {community_id} {relation} {neighbor_label} {relation} {neighbor_label} ..."
Example:
"Session community 0 imports authutil imports crypto contains validatetoken"
The embedding captures semantic context: the node's purpose (label), its community cohort, and its direct relationships.
detect(root) → find code files
↓
extract(files) → AST extraction via tree-sitter
↓
build_graph(extractions) → merge into NetworkX DiGraph
↓
cluster(G) → community detection (Leiden/Louvain)
↓
export(G, communities) → graph.json
↓
sync_vectors(G) → LanceDB vector store
Each stage is independent; you can skip vector syncing if you only need the graph.
uv sync --all-extras
uv run code-knowledge update .
uv run code-knowledge query "what extracts code"
uv run code-knowledge explain "extract"detect.py— code file discovery and filteringextract.py— tree-sitter AST walkers (21 languages)build.py— merge extractions into NetworkX graphcluster.py— Leiden/Louvain community detectionexport.py— export to graph.jsonvector.py— LanceDB sync and semantic searchcache.py— per-file extraction cache (by SHA256)security.py— path and label sanitizationvalidate.py— schema validation__main__.py— CLI commands
- Install
tree-sitter-<lang>package - Add
extract_<lang>(path)function inextract.py - Register in the
_DISPATCHdict - Add to
CODE_EXTENSIONSindetect.py
Typical results on a 50K-file Python monorepo:
| Stage | Time | Notes |
|---|---|---|
| detect | 2s | filesystem walk |
| extract | 60s | 50 files/sec single-threaded |
| build | 5s | merge + dedup |
| cluster | 8s | 500K edges |
| export | 3s | JSON serialization |
| vector | 25s | embedding + LanceDB upsert |
Total: ~2 minutes for 50K files / 100K nodes.
- Tree-sitter coverage: Functional code idioms (closures, pipes, higher-order functions) are partially supported; anonymous lambdas are skipped. Semantic extraction is code-based only (no LLM).
- Large graphs: Vector search scales to ~1M nodes; UI graph visualization is limited to 5K nodes.
- Precision: Extraction is AST-based (deterministic but not perfect). Cross-file call resolution is heuristic.
MIT