Node Parsers / Text Splitters
Learn how to use Node Parsers and Text Splitters to extract data from documents.
Node parsers are a simple abstraction that take a list of Document
objects, and chunk them into Node
objects, such that each node is a specific chunk of the parent document. When a document is broken into nodes, all of it's attributes are inherited to the children nodes (i.e. metadata
, text and metadata templates, etc.). You can read more about Node
and Document
properties here.
By default, we will use Settings.nodeParser
to split the document into nodes. You can also assign a custom NodeParser
to the Settings
object.
SentenceSplitter
The SentenceSplitter
is the default NodeParser
in LlamaIndex. It will split the text from a Document
into sentences.
The underlying text splitter will split text by sentences. It can also be used as a standalone module for splitting raw text.
MarkdownNodeParser
The MarkdownNodeParser
is a more advanced NodeParser
that can handle markdown documents. It will split the markdown into nodes and then parse the nodes into a Document
object.
The output metadata will be something like:
CodeSplitter
The CodeSplitter
is a more advanced NodeParser
that can handle code documents.
It will split the code by AST nodes and then parse the nodes into a Document
object.
Try it out ⬇️