Loading Data
Loading data using Readers into Documents
Before you can start indexing your documents, you need to load them into memory.
A reader is a module that loads data from a file into a Document
object.
To install readers call:
We offer readers for different file formats.
Additionally the following loaders exist without separate documentation:
AssemblyAIReader
transcribes audio using AssemblyAI.- AudioTranscriptReader: loads entire transcript as a single document.
- AudioTranscriptParagraphsReader: creates a document per paragraph.
- AudioTranscriptSentencesReader: creates a document per sentence.
- AudioSubtitlesReader: creates a document containing the subtitles of a transcript.
- NotionReader loads Notion pages.
- SimpleMongoReader loads data from a MongoDB.
Check the LlamaIndexTS Github for the most up to date overview of integrations.
SimpleDirectoryReader
LlamaIndex.TS supports easy loading of files from folders using the SimpleDirectoryReader
class.
It is a simple reader that reads all files from a directory and its subdirectories and delegates the actual reading to the reader specified in the fileExtToReader
map.
Currently, the following readers are mapped to specific file types:
- TextFileReader:
.txt
- PDFReader:
.pdf
- CSVReader:
.csv
- MarkdownReader:
.md
- DocxReader:
.docx
- HTMLReader:
.htm
,.html
- ImageReader:
.jpg
,.jpeg
,.png
,.gif
You can modify the reader three different ways:
overrideReader
overrides the reader for all file types, including unsupported ones.fileExtToReader
maps a reader to a specific file type. Can override reader for existing file types or add support for new file types.defaultReader
sets a fallback reader for files with unsupported extensions. By default it isTextFileReader
.
SimpleDirectoryReader supports up to 9 concurrent requests. Use the numWorkers
option to set the number of concurrent requests. By default it runs in sequential mode, i.e. set to 1.
Example
Tips when using in non-Node.js environments
When using @llamaindex/readers
in a non-Node.js environment (such as Vercel Edge, Cloudflare Workers, etc.)
Some classes are not exported from top-level entry file.
The reason is that some classes are only compatible with Node.js runtime, (e.g. PDFReader
) which uses Node.js specific APIs (like fs
, child_process
, crypto
).
If you need any of those classes, you have to import them instead directly through their file path in the package.
As the PDFReader
is not working with the Edge runtime, here's how to use the SimpleDirectoryReader
with the LlamaParseReader
to load PDFs:
Note: Reader classes have to be added explicitly to the
fileExtToReader
map in the Edge version of theSimpleDirectoryReader
.
You'll find a complete example with LlamaIndexTS here: https://github.com/run-llama/create_llama_projects/tree/main/nextjs-edge-llamaparse
Load file natively using Node.js Customization Hooks
We have a helper utility to allow you to import a file in Node.js script.