Logo

Node Parsers / Text Splitters

Learn how to use Node Parsers and Text Splitters to extract data from documents.

Node parsers are a simple abstraction that take a list of Document objects, and chunk them into Node objects, such that each node is a specific chunk of the parent document. When a document is broken into nodes, all of it's attributes are inherited to the children nodes (i.e. metadata, text and metadata templates, etc.). You can read more about Node and Document properties here.

By default, we will use Settings.nodeParser to split the document into nodes. You can also assign a custom NodeParser to the Settings object.

SentenceSplitter

The SentenceSplitter is the default NodeParser in LlamaIndex. It will split the text from a Document into sentences.

import {  } from '@llamaindex/readers/text'
import {  } from 'llamaindex';
import {  } from 'llamaindex';
 
const  = new ();
.
GlobalSettings.nodeParser: NodeParser<TextNode<Metadata>[] | Promise<TextNode<Metadata>[]>>
nodeParser
= ;

The underlying text splitter will split text by sentences. It can also be used as a standalone module for splitting raw text.

import {  } from "llamaindex";
 
const  = new ({ : 1 });
 
const 
const texts: string[]
texts
= .("Hello World");

MarkdownNodeParser

The MarkdownNodeParser is a more advanced NodeParser that can handle markdown documents. It will split the markdown into nodes and then parse the nodes into a Document object.

import {  } from "llamaindex";
import {  } from '@llamaindex/readers/markdown'
 
const  = new ();
const  = new ();
 
const  = await .('path/to/file.md');
const 
const parsedDocuments: TextNode<Metadata>[]
parsedDocuments
= ();

The output metadata will be something like:

[
  TextNode {
    id_: '008e41a8-b097-487c-bee8-bd88b9455844',
    metadata: { 'Header 1': 'Main Header' },
    excludedEmbedMetadataKeys: [],
    excludedLlmMetadataKeys: [],
    relationships: { PARENT: [Array] },
    hash: 'KJ5e/um/RkHaNR6bonj9ormtZY7I8i4XBPVYHXv1A5M=',
    text: 'Main Header\nMain content',
    textTemplate: '',
    metadataSeparator: '\n'
  },
  TextNode {
    id_: '0f5679b3-ba63-4aff-aedc-830c4208d0b5',
    metadata: { 'Header 1': 'Header 2' },
    excludedEmbedMetadataKeys: [],
    excludedLlmMetadataKeys: [],
    relationships: { PARENT: [Array] },
    hash: 'IP/g/dIld3DcbK+uHzDpyeZ9IdOXY4brxhOIe7wc488=',
    text: 'Header 2\nHeader 2 content',
    textTemplate: '',
    metadataSeparator: '\n'
  },
  TextNode {
    id_: 'e81e9bd0-121c-4ead-8ca7-1639d65fdf90',
    metadata: { 'Header 1': 'Header 2', 'Header 2': 'Sub-header' },
    excludedEmbedMetadataKeys: [],
    excludedLlmMetadataKeys: [],
    relationships: { PARENT: [Array] },
    hash: 'B3kYNnxaYi9ghtAgwza0ZEVKF4MozobkNUlcekDL7JQ=',
    text: 'Sub-header\nSub-header content',
    textTemplate: '',
    metadataSeparator: '\n'
  }
]

CodeSplitter

The CodeSplitter is a more advanced NodeParser that can handle code documents. It will split the code by AST nodes and then parse the nodes into a Document object.

import {  } from '@llamaindex/readers/text'
import {  } from '@llamaindex/node-parser/code'
import  from "tree-sitter";
import  from "tree-sitter-typescript";
 
const  = new ();
.(. as .);
const  = new ({
	: () => ,
});
const  = new ();
const  = await .('path/to/file.ts');
 
const 
const parsedDocuments: TextNode<Metadata>[]
parsedDocuments
= ();

Try it out ⬇️

API Reference

On this page