CocoIndex is the world's first open-source engine that supports both custom transformation logic and incremental updates specialized for data indexing.
With CocoIndex, users declare the transformation, CocoIndex creates & maintains an index, and keeps the derived index up to date based on source update, with minimal computation and changes.If you're new to CocoIndex π€, we recommend checking out the π Documentation and β‘ Quick Start Guide. We also have a
- Install CocoIndex Python library
pip install cocoindex
-
Setup Postgres with pgvector extension; or bring up a Postgres database using docker compose:
- Make sure Docker Compose is installed: docs
- Start a Postgres SQL database for cocoindex using our docker compose config:
docker compose -f <(curl -L https://raw.githubusercontent.com/cocoindex-io/cocoindex/refs/heads/main/dev/postgres.yaml) up -d
Follow Quick Start Guide to define your first indexing flow. A common indexing flow looks like:
@cocoindex.flow_def(name="TextEmbedding")
def text_embedding_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope):
# Add a data source to read files from a directory
data_scope["documents"] = flow_builder.add_source(cocoindex.sources.LocalFile(path="markdown_files"))
# Add a collector for data to be exported to the vector index
doc_embeddings = data_scope.add_collector()
# Transform data of each document
with data_scope["documents"].row() as doc:
# Split the document into chunks, put into `chunks` field
doc["chunks"] = doc["content"].transform(
cocoindex.functions.SplitRecursively(
language="markdown", chunk_size=300, chunk_overlap=100))
# Transform data of each chunk
with doc["chunks"].row() as chunk:
# Embed the chunk, put into `embedding` field
chunk["embedding"] = chunk["text"].transform(
cocoindex.functions.SentenceTransformerEmbed(
model="sentence-transformers/all-MiniLM-L6-v2"))
# Collect the chunk into the collector.
doc_embeddings.collect(filename=doc["filename"], location=chunk["location"],
text=chunk["text"], embedding=chunk["embedding"])
# Export collected data to a vector index.
doc_embeddings.export(
"doc_embeddings",
cocoindex.storages.Postgres(),
primary_key_fields=["filename", "location"],
vector_index=[("embedding", cocoindex.VectorSimilarityMetric.COSINE_SIMILARITY)])
It defines an index flow like this:
Go to the examples directory to try out with any of the examples, following instructions under specific example directory.
Example | Description |
---|---|
Text Embedding | Index text documents with embeddings for semantic search |
Code Embedding | Index code embeddings for semantic search |
PDF Embedding | Parse PDF and index text embeddings for semantic search |
Manual Extraction | Extract structured information from a manual using LLM |
More coming and stay tuned! If there's any specific examples you would like to see, please let us know in our Discord community π±.
For detailed documentation, visit Cocoindex Documentation, including a Quickstart guide.
We love contributions from our community β€οΈ. For details on contributing or running the project for development, check out our contributing guide.
Welcome with a huge coconut hug π₯₯βqΛπ€. We are super excited for community contributions of all kinds - whether it's code improvements, documentation updates, issue reports, feature requests, and discussions in our Discord.
Join our community here:
- π Star us on GitHub
- π¬ Start a GitHub Discussion
- π Join our Discord community
- π Follow us on X
- π Follow us on LinkedIn
βΆοΈ Subscribe to our YouTube channel- π Read our blog posts
CocoIndex is Apache 2.0 licensed.