GitHub - cocoindex-io/cocoindex: Extract, Transform, Index Data. CocoIndex is the world's first open-source engine that supports both custom transformation logic and incremental updates specialized for data indexing.

Extract, Transform, Index Data. Easy and Fresh. 🌴

CocoIndex is the world's first open-source engine that supports both custom transformation logic and incremental updates specialized for data indexing.

With CocoIndex, users declare the transformation, CocoIndex creates & maintains an index, and keeps the derived index up to date based on source update, with minimal computation and changes.

Quick Start:

If you're new to CocoIndex 🤗, we recommend checking out the 📖 Documentation and ⚡ Quick Start Guide. We also have a ▶️ quick start video tutorial for you to jump start.

Setup

Install CocoIndex Python library

pip install cocoindex

Setup Postgres with pgvector extension; or bring up a Postgres database using docker compose:
- Make sure Docker Compose is installed: docs
- Start a Postgres SQL database for cocoindex using our docker compose config:
```
docker compose -f <(curl -L https://raw.githubusercontent.com/cocoindex-io/cocoindex/refs/heads/main/dev/postgres.yaml) up -d
```

Start your first indexing flow!

Follow Quick Start Guide to define your first indexing flow. A common indexing flow looks like:

@cocoindex.flow_def(name="TextEmbedding")
def text_embedding_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope):
    # Add a data source to read files from a directory
    data_scope["documents"] = flow_builder.add_source(cocoindex.sources.LocalFile(path="markdown_files"))

    # Add a collector for data to be exported to the vector index
    doc_embeddings = data_scope.add_collector()

    # Transform data of each document
    with data_scope["documents"].row() as doc:
        # Split the document into chunks, put into `chunks` field
        doc["chunks"] = doc["content"].transform(
            cocoindex.functions.SplitRecursively(
                language="markdown", chunk_size=300, chunk_overlap=100))

        # Transform data of each chunk
        with doc["chunks"].row() as chunk:
            # Embed the chunk, put into `embedding` field
            chunk["embedding"] = chunk["text"].transform(
                cocoindex.functions.SentenceTransformerEmbed(
                    model="sentence-transformers/all-MiniLM-L6-v2"))

            # Collect the chunk into the collector.
            doc_embeddings.collect(filename=doc["filename"], location=chunk["location"],
                                   text=chunk["text"], embedding=chunk["embedding"])

    # Export collected data to a vector index.
    doc_embeddings.export(
        "doc_embeddings",
        cocoindex.storages.Postgres(),
        primary_key_fields=["filename", "location"],
        vector_index=[("embedding", cocoindex.VectorSimilarityMetric.COSINE_SIMILARITY)])

It defines an index flow like this:

Play with existing example and demo

Go to the examples directory to try out with any of the examples, following instructions under specific example directory.

Example	Description
Text Embedding	Index text documents with embeddings for semantic search
Code Embedding	Index code embeddings for semantic search
PDF Embedding	Parse PDF and index text embeddings for semantic search
Manual Extraction	Extract structured information from a manual using LLM

More coming and stay tuned! If there's any specific examples you would like to see, please let us know in our Discord community 🌱.

📖 Documentation

For detailed documentation, visit Cocoindex Documentation, including a Quickstart guide.

🤝 Contributing

We love contributions from our community ❤️. For details on contributing or running the project for development, check out our contributing guide.

👥 Community

Welcome with a huge coconut hug 🥥⋆｡˚🤗. We are super excited for community contributions of all kinds - whether it's code improvements, documentation updates, issue reports, feature requests, and discussions in our Discord.

Join our community here:

License

CocoIndex is Apache 2.0 licensed.

Name		Name	Last commit message	Last commit date
Latest commit History 88 Commits
.cargo		.cargo
.github		.github
dev		dev
docs		docs
examples		examples
python/cocoindex		python/cocoindex
src		src
.env.lib_debug		.env.lib_debug
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Extract, Transform, Index Data. Easy and Fresh. 🌴

Quick Start:

Setup

Start your first indexing flow!

Play with existing example and demo

📖 Documentation

🤝 Contributing

👥 Community

License

About

Releases 6

Packages

Contributors 4

Languages

License

cocoindex-io/cocoindex

Folders and files

Latest commit

History

Repository files navigation

Extract, Transform, Index Data. Easy and Fresh. 🌴

Quick Start:

Setup

Start your first indexing flow!

Play with existing example and demo

📖 Documentation

🤝 Contributing

👥 Community

License

About

Topics

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases 6

Packages 0

Contributors 4

Languages

Packages