Documentation

Flatseek is a disk-first, serverless search platform. Index structured data (CSV, JSON, JSONL) using a trigram inverted index and query via REST API or web dashboard — without managing a search cluster.

Installation

Requirements

  • Python 3.10+
  • pip or pip3
  • git (for installer script)

Install via pip

pip install flatseek

Install via installer script

curl -fsSL flatseek.io/install.sh | sh

This installs both the flatseek Python package and the flatlens dashboard to ~/.local/share/flatlens.

Install from source

git clone https://github.com/flatseek/flatseek.git
cd flatseek
pip install -e .

Verify installation

flatseek --version
Note: Flatseek is both a Python package (for programmatic access) and a CLI tool. The CLI is the primary interface for building indices and starting servers.

Quick Start

1. Build an index from CSV

flatseek build ./data/people.csv -o ./data

This creates a search index in ./data. Flatseek auto-detects column types (name, email, phone, date, etc.).

2. Start the API server + dashboard

cd ./data
flatseek serve

Starts on http://localhost:8000 by default.

  • API: http://localhost:8000
  • Dashboard: http://localhost:8000/dashboard
  • API Docs: http://localhost:8000/docs

3. Open the dashboard

Navigate to http://localhost:8000/dashboard. Select your index from the dropdown, enter a query, and explore results.

4. Search from CLI

flatseek search ./data "name:*john*"
flatseek search ./data "city:jakarta AND gender:L"
flatseek stats ./data

5. Search from Python

from flatseek import Flatseek

client = Flatseek("http://localhost:8000")
result = client.search(index="people", q="*john*", size=20)
print(result.total)
for doc in result.docs:
    print(doc)

Architecture

Flatseek consists of four components:

  • Flatseek Core — Disk-first trigram indexing engine. Reads from memory-mapped files without loading into heap.
  • Flatseek API — FastAPI-based REST layer with Elasticsearch-compatible endpoints.
  • Flatlens Dashboard — Web UI for uploading files, running queries, visualizing aggregations, and map views.
  • Python Client — Dual-mode client supporting both API mode (HTTP) and direct mode (local files).

Index Structure

When you build an index, Flatseek creates this directory structure:

./data/
├── index/              # Trigram posting lists (binary, memory-mapped)
│   └── *.bin
├── docs/               # Document store (compressed JSON)
│   └── *.zlib
├── column_map.json     # Column type mappings
├── manifest.json       # Index metadata
└── stats.json          # Index statistics

CLI: build

Build a trigram index from CSV, JSON, JSONL, or directory of files.

Synopsis

flatseek build <csv_dir> [options]

Options

OptionDescription
-o, --outputOutput directory (default: ./data)
-m, --mapPath to column_map.json
-s, --sepCSV delimiter (default: comma). Use '#' for hash-separated files.
--columnsComma-separated column names for headerless files
-w, --workersNumber of parallel workers (default: 1). Use >1 for multi-core builds.
--datasetDataset label (e.g. 'people', 'accounts')
--dedupSkip duplicate rows during indexing
--dedup-fieldsDedup on specific columns only (e.g. --dedup-fields phone,nik)
--daemonMemory mode: never write prefix buffers to disk at checkpoint. Faster but uses more RAM.
--estimateSample 5,000 rows before indexing to show speed/ETA estimate

Examples

# Basic build
flatseek build ./people.csv -o ./data

# Multi-file directory
flatseek build ./csv_folder/ -o ./data

# Hash-delimited file
flatseek build ./data.csv -o ./data -s '#'

# Headerless CSV
flatseek build ./data.csv -o ./data --columns "id,name,email,phone"

# Parallel build (4 workers)
flatseek build ./large.csv -o ./data -w 4

# With ETA estimate
flatseek build ./data.csv -o ./data --estimate

# Daemon mode (uses more RAM, faster indexing)
flatseek build ./data.csv -o ./data --daemon
Parallel builds: Use -w N to spawn N parallel workers. Each worker processes a portion of the data. Works for both single files (byte-range split) and multi-file directories.

Column Type Detection

Flatseek auto-detects semantic column types based on header names and sample values:

  • name — Person names (with phonetic normalization)
  • email — Email addresses
  • phone — Phone numbers (normalized to digits only)
  • birthday — Birth dates
  • gender — Gender codes (L/G/P normalized to F/M)
  • city — City names
  • province — Province/region names
  • address — Full addresses
  • date — Date/datetime fields
  • number — Numeric values
  • status — Status enums
  • id_number — ID numbers (NIK, KTP, etc.)
  • string — Plain text (trigram-indexed)

CLI: serve

Start the API server and Flatlens dashboard, serving data from the current directory.

Synopsis

flatseek serve [options]

Options

OptionDescription
-d, --dataData directory to serve (default: current directory)
-p, --portPort number (default: 8000)
--hostHost to bind to (default: 0.0.0.0)
--no-reloadDisable auto-reload on code changes

Examples

# Start on default port 8000 (serves current directory)
flatseek serve

# Serve a specific data directory
flatseek serve -d ./data

# Start on custom port
flatseek serve -p 9000 -d ./my-index

# Bind to localhost only
flatseek serve --host 127.0.0.1

Also available: api

Start API server only (no dashboard):

flatseek api -d ./data

CLI: classify

Detect and set column semantic types without building an index.

Synopsis

flatseek classify <csv_dir> [options]

Options

OptionDescription
-o, --outputOutput path for column_map.json
-s, --sepCSV delimiter
--columnsComma-separated column names

Examples

flatseek classify ./data.csv -o ./column_map.json
flatseek classify ./csv_folder/

CLI: stats

Show index statistics.

Synopsis

flatseek stats <data_dir>

Output includes

  • Total documents indexed
  • Number of columns
  • Column names and types
  • Index size on disk
  • Indexing date

CLI: compress

Compress index files in-place with zlib. Run after building or when adding data is complete.

Warning: Do not run compress if you plan to do incremental builds. The builder appends raw bytes and cannot extend a compressed file without re-reading it.

Synopsis

flatseek compress <data_dir> [options]

Options

OptionDescription
-l, --levelCompression level 1-9 (default: 6). Higher = smaller but slower.
-w, --workersParallel workers (default: min(8, cpu_count))

Examples

flatseek compress ./data
flatseek compress ./data -l 9
flatseek compress ./data -w 4

CLI: encrypt / decrypt

Encrypt or decrypt index files in-place with ChaCha20-Poly1305.

Synopsis

flatseek encrypt <data_dir> [options]
flatseek decrypt <data_dir> [options]

Options

OptionDescription
--passphraseEncryption/decryption passphrase. If omitted, prompted interactively.

Examples

flatseek encrypt ./data --passphrase "mysecretpass"
flatseek decrypt ./data --passphrase "mysecretpass"
How encryption works: A random 32-byte salt is stored in encryption.json. The passphrase is derived via PBKDF2-HMAC-SHA256 (600,000 iterations). All .bin and .zlib files are encrypted with ChaCha20-Poly1305. There is no recovery without the passphrase.

CLI: dedup

Remove duplicate documents from an existing index. Works on both single-worker and parallel builds.

Synopsis

flatseek dedup <data_dir> [options]

Options

OptionDescription
--fieldsComma-separated columns to use for fingerprinting (default: all non-meta fields)
--dry-runReport duplicates without making changes
-w, --workersParallel workers for rewrite phase

Examples

# Dedup on all fields
flatseek dedup ./data

# Dedup on specific fields
flatseek dedup ./data --fields phone,nik

# Dry run (report only)
flatseek dedup ./data --dry-run
Resume support: If dedup is interrupted, re-running with the same arguments will resume from the checkpoint at dedup_checkpoint.json.

CLI: delete

Delete an index directory quickly using parallel rm -rf on subdirectories.

Synopsis

flatseek delete <data_dir> [options]

Options

OptionDescription
-y, --yesSkip confirmation prompt
-w, --workersParallel workers (default: min(16, cpu_count))

Examples

flatseek delete ./data
flatseek delete ./data --yes
flatseek delete ./data -w 8

CLI: join

Perform a cross-dataset join on a shared field.

Synopsis

flatseek join <data_dir> <query_a> <query_b> --on <field>

Examples

flatseek join ./data "_dataset:people" "_dataset:accounts" --on phone
flatseek join ./data "_dataset:orders" "_dataset:customers" --on customer_id
Datasets: The join uses the _dataset field to distinguish between datasets. When building an index, use --dataset to label documents (e.g. flatseek build ./people.csv --dataset people).

CLI: chat

Interactive natural language query interface powered by Ollama.

Synopsis

flatseek chat <data_dir> [options]

Options

OptionDescription
--modelLLM model name (default: qwen2.5-coder)
--api-baseOllama API base URL (default: http://localhost:11434/v1)

Examples

flatseek chat ./data
flatseek chat ./data --model llama3
Requirements: Ollama must be running locally with a model pulled. Install from ollama.com.

Dashboard: Upload Data

Click + Upload in the top-right corner of the Dashboard to open the upload wizard.

Supported file formats

  • CSV — comma, semicolon, tab, pipe, colon, or hash delimiters (auto-detected)
  • JSON — array of objects: [{"name":"Alice","age":30}, ...]
  • JSONL — one object per line (newline-delimited)
  • XLS / XLSX — Excel files with sheet selection
File size: 500 MB per file via Dashboard. For larger files, use the CLI.

Step 1: Select File

Drag and drop files onto the dropzone, or click Browse Files. Multiple files can be selected.

Step 2: Preview & Mapping

After selecting a file, Flatseek parses the first few rows and shows a preview.

Column Mapping Options

  • Exclude — skip this column during indexing
  • Insert As — rename the column at insertion time
  • Type — click the type badge to set the semantic type

Format Options

  • Separator — auto-detected, but can be manually overridden
  • First row is header — checkbox (default: on)
  • Edit Headers — modify column names as JSON array

Step 3: Configure Index

  • Upload to — "Create New Index" or "Existing Index"
  • Index Name — lowercase, letters/numbers/underscores only
  • Encrypt Index — enable password protection
  • ID Field — specify a column as document ID (auto-generated if empty)
  • Batch Size — documents sent per batch (default: 5000)

Step 4: Upload Progress

Live progress bar shows documents processed, throughput (docs/sec), and ETA.

Step 5: Complete

Summary shows total documents indexed and index size on disk.

Dashboard: Aggregations

Go to the Aggregations tab to build aggregate summaries.

Aggregation Types

TypeDescriptionOutput
termsGroup by unique valuesTop N terms with counts
statsStatistical summaryCount, min, max, sum, avg
date_histogramTime-series groupingCounts per time bucket
avgAverage valueSingle numeric average
minMinimum valueSingle numeric minimum
maxMaximum valueSingle numeric maximum
sumSum of valuesSingle numeric total
cardinalityUnique value countSingle approximate count

Chart Types

Toggle between Bar, Line, Donut, and Pie charts. Use Table view for raw data.

Common Use Cases

GoalFieldTypeSize
Top 10 citiescityterms10
Daily signups (30 days)created_atdate_histogram30
Average order valueorder_amountavg
Unique customerscustomer_idcardinality
Age statisticsagestats

Dashboard: Map View

Go to the Map tab to plot geo-tagged documents on an interactive Leaflet map.

Requirements

  • Two numeric fields: latitude (-90 to 90) and longitude (-180 to 180)
  • Up to 50,000 documents per map view

Combined lat,lng Field

If your data has a single field like "-6.2088, 106.8456" or "-6.2088 106.8456", use the combined lat,lng field selector instead of selecting latitude and longitude separately.

Map Controls

  • Marker clustering — nearby points cluster automatically
  • Click cluster — zoom in
  • Click marker — see document details in popup
  • Size input — control max documents plotted (1–50,000)

Dashboard: Index Management

Click the cluster status button (top-right) to view all indices and their stats.

Index Actions

  • View stats — document count, index size, columns, type mappings
  • Rename — click the rename button, enter new name
  • Encrypt — click encrypt button, enter passphrase twice
  • Decrypt — click decrypt button, enter passphrase
  • Delete — click delete button, confirm deletion

Encrypted Indices

When accessing an encrypted index, a password modal appears. Enter the passphrase to authenticate. The session stores the password in memory for subsequent queries on that index.

Upload Progress

While uploading, the index card shows upload progress with live stats (documents processed, files done, ETA).

Query Syntax Reference

Basic Operators

PatternDescriptionExample
wordMatch documents containing this wordjohn → all docs with "john"
*ordWildcard at start*ohn → "john", "mohn"
wo*dWildcard in middlej*doe → "jackson doe"
"exact phrase"Exact phrase match"john doe"
word1 AND word2Both words requiredjohn AND jakarta
word1 OR word2Either word matchesjohn OR jane
NOT wordExclude wordNOT john
( )Grouping(john OR jane) AND jakarta

Field Prefixes

PatternDescriptionExample
name:valueField contains valuecity:jakarta
name:*value*Field wildcardemail:*@gmail.com
field:[min TO max]Range query (date/number fields)birthday:[1990-01-01 TO 1999-12-31]

Special Characters

These characters have special meaning and must be escaped with \ to search literally:

+ - && || ! ( ) { } [ ] ^ " ~ * ? : \ /

Example: to search for john+doe, use john\+doe

Query Examples

Basic Search

*john*              # All documents containing "john"
"john doe"          # Exact phrase match
john doe            # Both words must appear (implicit AND)

Field-Specific

name:john                  # name field = john
city:jakarta              # city field = jakarta
email:*@gmail.com         # emails from gmail
phone:*1234*              # phone containing "1234"

Boolean Logic

john AND jakarta          # Both must appear
john OR jane              # Either appears
john AND NOT jakarta      # john but not jakarta
city:jakarta AND gender:L # Both conditions

Wildcards

*john*                   # Contains john anywhere
j*                        # Starts with j
*doe                     # Ends with doe
na*e                      # na followed by e (na + anything + e)

Grouping

(john OR jane) AND jakarta     # Either john or jane, in jakarta
name:(john OR doe)             # john or doe in name field

Date Ranges (date fields only)

birthday:[1990-01-01 TO 1999-12-31]   # Born in the 1990s
created_at:[2024-01-01 TO 2024-12-31]  # Created in 2024

Column Types

Column types determine how values are indexed and queried.

TypeIndexingQuery StyleExample
TEXTTrigrams + tokenizationWildcard, containsFree text, descriptions
KEYWORDExact valueEquals, terms aggregationTags, status, category
DATEISO date (YYYYMMDD)Range queries, date_histogramcreated_at, birthday
FLOATNumeric valueRange, stats aggregationprice, latitude
INTInteger valueRange, stats aggregationage, quantity
BOOLBooleanEqualsis_active, is_verified
ARRAYJSON arrayContainstags, interests
OBJECTJSON objectDot-path accessaddress.city, profile.name

Semantic Types

Flatseek also auto-detects semantic types (name, email, phone, etc.) which apply additional normalization:

  • name — Double Metaphone phonetic matching
  • email — Domain extraction for domain queries
  • phone — Digits-only normalization (strips +, -, spaces)
  • gender — L/G/P normalized to F/M

API Overview

The Flatseek API is a REST interface with Elasticsearch-compatible endpoints. It runs on port 8000 by default when using flatseek serve.

Base URL

http://localhost:8000

API Documentation

Interactive API docs (Swagger UI) available at:

http://localhost:8000/docs

Authentication

For encrypted indices, include the passphrase in the X-Index-Password header:

curl -H "X-Index-Password: mypass" \
     http://localhost:8000/my_index/_search?q=*john*

API: Upload

Bulk Index Documents

POST /{index}/_bulk
[{"name": "John", "city": "Jakarta"}, {"name": "Jane", "city": "Bandung"}]

Supports JSON array of documents. Returns summary of indexed count and errors.

Upload Progress

GET /{index}/_upload_progress

Returns current progress of an in-progress upload.

Flush Index

POST /{index}/_flush

Force-write all pending buffers to disk.

API: Index Management

List Indices

GET /_indices

Create Index

PUT /{index_name}
{
  "description": "My index",
  "encrypt": true
}

Delete Index

DELETE /{index_name}

Rename Index

POST /{old_name}/_rename
{"new_name": "new_index_name"}

Create/Update Mapping

PUT /{index_name}/_mapping
{
  "columns": [
    {"name": "id", "type": "INT"},
    {"name": "name", "type": "TEXT"},
    {"name": "email", "type": "KEYWORD"}
  ]
}

API: Encryption

Check if Index is Encrypted

GET /{index}/_is_encrypted

Authenticate (Decrypt)

POST /{index}/_authenticate
{"passphrase": "mysecretpass"}

Logout (Clear Session)

DELETE /{index}/_authenticate

Encrypt Index

POST /{index}/_encrypt
{"passphrase": "mysecretpass"}

Returns a job ID. Poll progress with:

GET /{index}/_encrypt_progress?job_id={job_id}

Decrypt Index

POST /{index}/_decrypt
{"passphrase": "mysecretpass"}

Python Package: Client

Installation

pip install flatseek

API Mode (HTTP client)

from flatseek import Flatseek

client = Flatseek("http://localhost:8000")

# Search
result = client.search(index="people", q="*john*", size=20)
print(f"Found: {result.total}")
for doc in result.docs:
    print(doc["name"], doc["city"])

# Count
count = client.count(index="people", q="city:jakarta")
print(f"Jakarta residents: {count.count}")

# Aggregations
result = client.aggregate(
    index="people",
    q="*",
    body={
        "aggs": {
            "by_city": {"terms": {"field": "city", "size": 10}}
        }
    }
)

# Index a document
client.index(index="people", body={"name": "John", "city": "Jakarta"})

# Bulk index
client.bulk_insert(index="people", docs=[
    {"name": "Alice", "city": "Bandung"},
    {"name": "Bob", "city": "Surabaya"},
])

# Cluster health
health = client.cluster()
print(health)

Response Objects

  • Response — search results with .hits, .total, .docs
  • CountResponse — count result with .count
  • AggsResponse — aggregation results with .total, .aggs

Python Package: Direct Mode

Direct mode accesses the index files directly without needing the API server. Faster for local scripts.

from flatseek import Flatseek

# Open local index directory
qe = Flatseek("./data")
qe = Flatseek("./data", index="people")  # named sub-index

# Search
result = qe.search(q="name:*john*", size=10)

# Count
count = qe.count(q="city:jakarta")

# Aggregations
result = qe.aggregate(q="name:*john*", aggs={
    "by_city": {"terms": {"field": "city", "size": 10}},
    "birth_stats": {"stats": {"field": "birthday"}}
})
print(result.aggs["by_city"]["buckets"])

Python Package: Aggregations

Terms Aggregation

result = qe.aggregate(q="*", aggs={
    "top_cities": {"terms": {"field": "city", "size": 10}}
})
buckets = result.aggs["top_cities"]["buckets"]
for b in buckets:
    print(f"{b['key']}: {b['doc_count']}")

Stats Aggregation

result = qe.aggregate(q="*", aggs={
    "age_stats": {"stats": {"field": "age"}}
})
stats = result.aggs["age_stats"]
print(f"Count: {stats['count']}, Avg: {stats['avg']}, Min: {stats['min']}, Max: {stats['max']}")

Date Histogram

result = qe.aggregate(q="*", aggs={
    "daily_signups": {"date_histogram": {"field": "created_at", "interval": "day"}}
})

Multiple Aggregations

result = qe.aggregate(q="*", aggs={
    "by_city": {"terms": {"field": "city", "size": 10}},
    "by_gender": {"terms": {"field": "gender"}},
    "avg_age": {"avg": {"field": "age"}},
    "age_range": {"stats": {"field": "age"}}
})