Configuration
Semango reads semango.yml. You don’t need to write it by hand—semango init creates a fully-commented config with defaults.
semango initRecommended starting config (aligned with current loaders)
embedding:
provider: local
model: onnx-models/bge-small-en-v1.5-onnx
batch_size: 48
concurrent: 4
gpu: true
model_cache_dir: "${SEMANGO_MODEL_DIR:=~/.cache/semango}"
lexical:
enabled: true
index_path: ./semango/index/bleve
bm25_k1: 1.2
bm25_b: 0.75
hybrid:
vector_weight: 0.7
lexical_weight: 0.3
fusion: linear
files:
include:
- "**/*.{md,txt,go,js,ts,py,jsx,tsx}"
- "**/*.pdf"
- "**/*.csv"
- "**/*.tsv"
- "**/*.json"
- "**/*.jsonl"
exclude:
- ".git/**"
- "node_modules/**"
- "vendor/**"
chunk_size: 1000
chunk_overlap: 200
# Prevents indexing from hanging forever on a single PDF.
pdf_timeout_seconds: 900
# Emit a periodic heartbeat while extracting PDF text.
pdf_progress_interval_seconds: 30
server:
host: 0.0.0.0
port: 8181
ui:
enabled: true
tabular:
max_rows_embedded: 1000
sampling: random
min_text_tokens: 5
delimiter: ""semango init writes a more extensive template (including reranker, plugins, and mcp). Those sections are parsed but not implemented in the current codebase.
Section-by-section reference
embedding
provider:localoropenai.model: local ONNX model ID or OpenAI model name.dim: (Optional) Manual dimension override. Supports truncation for Matryoshka-compatible models.batch_size,concurrent: throughput controls.model_cache_dir: where ONNX models are stored.api_key,api_key_env,base_url,base_url_env: used foropenaiprovider.
If provider is omitted, the code defaults to openai.
lexical
enabled: enable BM25.index_path: Bleve index location.bm25_k1,bm25_b: BM25 tuning.
reranker
Parsed but not used in the current query pipeline. Keep disabled unless you are modifying the code.
hybrid
vector_weight,lexical_weight: fusion weights.fusion: currentlylinear(RRF not implemented in code).
files
Controls what gets indexed. The default include list is broader than what is currently supported by loaders (see Ingestion).
PDF-specific options:
pdf_timeout_seconds: max time allowed for extracting text from a single PDF before skipping it.pdf_progress_interval_seconds: how often to log a “still extracting” heartbeat for PDFs.
server
host,port: HTTP binding.auth: parsed but not enforced in the current server.tls_cert,tls_key: parsed; TLS is not wired up in the current HTTP server.
plugins
Parsed but not implemented in the current pipeline.
ui
Toggle the embedded UI.
mcp
Parsed but not implemented in the current build. See MCP status.
tabular
Controls CSV/JSON/JSONL ingestion.
max_rows_embedded: cap per file.sampling:randomorstratified.min_text_tokens: minimum tokens per row.delimiter: CSV delimiter (e.g.,"\t"for TSV).
Schema (secondary)
Semango validates configs using a CUE schema. This is for tooling and validation, not required reading.
- Download: config.cue