NHacker Next
  • new
  • past
  • show
  • ask
  • show
  • jobs
  • submit
Launch HN: Chonkie (YC X25) – Open-Source Library for Advanced Chunking
mritchie712 22 days ago [-]
We (https://www.definite.app/) have a use case I'd imagine is common for people building agents.

When a user works with our agent, they may end up with a large conversation thread (e.g. 200k+ tokens) with many SQL snippets, query results and database metadata (e.g. table and column info).

For example, if they ask "show me any companies that were heavily engaged at one point, but I haven't talked to in the last 90 days". This will pull in their schema (e.g. Hubspot), run a bunch of SQL, show them results, etc.

I want to allow the agent to search previous threads for answers so they don't need to have the conversation again, but chunking up the existing thread is non-trivial (e.g. you don't want to separate the question and answer, you may want to remove errors while retaining the correction, etc.).

Do you have any plans to support "auto chunking" for AI message[0] threads?

0 - e.g. https://platform.openai.com/docs/api-reference/messages/crea...

snyy 22 days ago [-]
> you may want to remove errors while retaining the correction

Double clicking on this, are these messages you’d want to drop from memory because they’re not part of the actual content (e.g. execution errors or warnings)? That kind of cleanup is something Chonkie can help with as a pre-processing step.

If you can share an example structure of your message threads, I can give more specific guidance. We've seen folks use Chonkie to chunk and embed AI chat threads — treating the resulting vector store as long-term memory. That way, you can RAG over past threads to recover context without redoing the conversation.

P.S. If HN isn’t ideal for going back and forth, feel free to send me an email at shreyash@chonkie.ai.

mritchie712 22 days ago [-]
> We've seen folks use Chonkie to chunk and embed AI chat threads

yep, that's what we're looking for. We'll give it a shot!

I think it's worth creating a guide for this use case. Seems like something many people would want to do and the input should be very similar across your users.

mbeissinger 21 days ago [-]
You might want to check out the conversational chunking from this paper:

On Memory Construction and Retrieval for Personalized Conversational Agents https://arxiv.org/abs/2502.05589

yawnxyz 22 days ago [-]
I'm curious if chunking is different for embeddings vs. for "agentic retrieval" e.g. an AI or a person operates like a Librarian; they look up in an index at what resources to look up, get the relevant bits, then piece them together into a cohesive narrative whole — would we do any chunking at all for this, or does this purely rely on the way the DB is setup? I think for certain use cases, even a single DB record could be too large for context windows, so maybe chunking might need to be done to the record? (e.g. a db of research papers)
snyy 22 days ago [-]
Great questions!

Chunking fundamentals remain the same whether you're doing traditional semantic search or agentic retrieval. The key difference lies in the retrieval strategy, not the chunking approach itself.

For quality agentic retrieval, you still need to create a knowledge base by chunking documents, generating embeddings, and storing them in a vector database. You can add organizational structure here—like creating separate collections for different document categories (Physics papers, Biology papers, etc.)—though the importance of this organization depends on the size and diversity of your source data.

The agent then operates exactly as you described: it queries the vector database, retrieves relevant chunks, and synthesizes them into a coherent response. The chunking strategy should still optimize for semantic coherence and appropriate context window usage.

Regarding your concern about large DB records: you're absolutely right. Even individual research papers often exceed context windows, so you'd still need to chunk them into smaller, semantically meaningful pieces (perhaps by section, abstract, methodology, etc.). The agent can then retrieve and combine multiple chunks from the same paper or across papers as needed.

The main advantage of agentic retrieval is that the agent can make multiple queries, refine its search strategy, and iteratively build context—but it still relies on well-chunked, embedded content in the underlying vector database.

elpalek 22 days ago [-]
Do you have a benchmark for comparing different chunking methods? Your existing benchmark is to compare different libraries.
snyy 22 days ago [-]
We don't yet, but our library comes with a visualization tool that you can use to compare chunkers directly. https://docs.chonkie.ai/python-sdk/utils/visualizer
zackify 22 days ago [-]
You guys should steal the ideas I had in mind and partially implemented on https://github.com/zackify/revect

Similar to you I saw a lot of bloated projects out there. Mine is 90mb container.

I want to do what your project does but in addition have extensions for every day apps that index into a db.

Your private database for all ai interactions.

I also have a cloud version using the mcp auth spec, but it’s all for fun and probably not worth releasing.

Do you have any plans to do further use cases such as this?

snyy 22 days ago [-]
We want to be the platform that connects documents to AI for all applications. Consequently, we want to cover all use cases, including the ones you mentioned :)
ChromaticPanic 21 days ago [-]
When I was looking at your library last week, It didn't look like there was a direct way to use my own embedding model endpoints. For example, I run snowflake arctic embed in vllm and it would be good be able to use it with Chonkie's semantic chunkers.
amir_karbasi 22 days ago [-]
Looks great! I had looked at Chonkie a few months back, but didn't need it in our pipelines. I was just writing a POC for an agentic chunker this week to handle various formatting and chunking requirements. I'll give Chonkie a shot!
snyy 22 days ago [-]
Awesome! Keep us posted :)
whoaanni2 21 days ago [-]
Congrats on the launch and all the best. For PDF, does it convert directly to markdown using deterministic approaches of is compatible with reducto/unstructured/llamaparse? How does it fit with these players?
pj_mukh 22 days ago [-]
Super cool!

It looks like size and speed is your major advantage. In our RAG pipeline we run the chunking process async as an onboarding type process. Is Chonkie primarily for people looking to process documents in some sort of real-time scenario?

snyy 22 days ago [-]
In addition to size and speed we also offer the most variety of chunking strategies!

Typically, our current users fall into one of two categories:

- People who are running async chunking but need access to a strategy not supported in langchain/llamaIndex. Sometimes speed matters here too, especially if the user has a high volume of documents

- people who need real time chunking. Super useful for apps like codegen/code review tools.

greymalik 22 days ago [-]
You’re part of YC but this is open source - how do you plan to make money off of it?
snyy 22 days ago [-]
As mentioned in the other reply, we have a cloud/on-prem offering that comes with a managed ETL pipeline built on top of our OSS offering.
tevon 22 days ago [-]
Looks like they will have a cloud offering, and mentioned in this post are on-prem and managed offerings
Andugal 22 days ago [-]
Congratulations for the launch!

You said that Chonkie works with multiple vector stores. I was wondering what RAG database HN uses? Do you need a specialized one (like Chroma) or is Postgres just fine?

gavmor 22 days ago [-]
Does HN even use a RAG database? What for? They don't even maintain their own search[0].

0. https://hn.algolia.com/

snyy 22 days ago [-]
Not sure what HN uses :)

If you want agents/LLMs to be able to find relevant data based on similarity to queries, vectorDBs like Chroma (or even pgVector) are great.

ketzo 22 days ago [-]
I’m building out a side project where I need to ingest + chunk a lot of HTML — wrote my own(terrible) hunker naively thinking that would be easy :’)

Definitely gonna give this a try!

_epps_ 22 days ago [-]
Excited to try this out! Also +1 for Moo Deng-ish mascot.
hweller 22 days ago [-]
Congratulations on the launch! would be awesome to see support for MongoDB Atlas as one of the vector stores and Voyage AI as an embedding provider if you are interested. I can imagine quite a few customers that would prefer a lightweight interface for chunking- lmk how I can help make that happen from the Mongo side!
snyy 22 days ago [-]
We're working on Mongo integrations!
hweller 13 days ago [-]
awesome! let me know if I can be helpful in any way in connecting with Mongo resources
elliot07 22 days ago [-]
Chonkie is great software. Congrats on the launch! Has been a pleasure to use so far.
snyy 22 days ago [-]
Thank you :)
blef 22 days ago [-]
> - Code Chunking: Chunks code files by creating an AST and finding ideal split points.

I'd be interested to use it for SQL, did you try? Does it works well with it? I'm not familiar with the tree-sitter library

snyy 22 days ago [-]
Yes :) Code chunker is fantastic for SQL
petesergeant 21 days ago [-]
It would be cool to have examples of what the distinct chunks each approach takes looks like. They should just be -- essentially -- paragraphs, right?
gazagoal 21 days ago [-]
Is it easily extensible? For instance, when chunking PDF-converted-texts, is it possible to apply transformation or attach metadata to chunks?
tevon 22 days ago [-]
Was just looking into chunking strategies today, this looks great! Will update with any feedback.
snyy 22 days ago [-]
Awesome! Keep us posted!
pzo 22 days ago [-]
Is this only for node (how about bun/deno)? Have it been tested to work with react native?
snyy 22 days ago [-]
Node and Bun should work. Haven't tested on Deno yet.

We rely on the huggingface/transformers library which might be too heavy for a react-native app.

olavfosse 22 days ago [-]
Very cool!

What's the story for chunking PDFs?

We've been using Marker and handling markdown->chunks manually.

snyy 22 days ago [-]
Pretty much what you described. Convert the PDF to Markdown, join content across pages so that its all one string, then chunk it. Our evals show this approach works best.
dbworku 22 days ago [-]
Very cool. Dope maintainers and project!
babuloseo 22 days ago [-]
I like the mascot.
esafak 22 days ago [-]
I can't help but think of the SNL sketch. https://www.youtube.com/watch?v=hRGKSwsD7ac
Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact
Rendered at 17:39:19 GMT+0000 (UTC) with Wasmer Edge.