Skip to content

Preprocessing Documents

sieves provides several preprocessing tasks to prepare your documents for downstream processing. These tasks handle common operations like:

  • Parsing various document formats (PDF, DOCX, etc.)
  • Chunking long documents into manageable pieces

Document Parsing

Note: Ingestion libraries are optional and not installed by default. To use document ingestion, install them manually or install the ingestion extra:

pip install "sieves[ingestion]"

You can also install individual libraries directly (e.g., pip install docling).

Using Ingestion

The Ingestion task uses the docling or alternatively the marker libraries to parse various document formats:

Basic document ingestion
from sieves import Pipeline, tasks, Doc

# Create a document parser
parser = tasks.preprocessing.Ingestion()

# Create a pipeline with the parser
pipeline = Pipeline([parser])

# Process documents (requires actual PDF/DOCX files)
docs = [
    Doc(uri="path/to/document.pdf"),
    Doc(uri="path/to/another.docx")
]
# Note: Ingestion requires actual files and optional dependencies
# Install with: pip install "sieves[ingestion]"

It is possible to choose a specific output format between the supported (Markdown, HTML, JSON) and pass custom Docling or Marker converters in the converter parameter:

Custom converter with export format
from sieves import Pipeline, tasks, Doc

# Create a document parser with custom export format
parser = tasks.preprocessing.Ingestion(export_format="html")

# Create a pipeline with the parser
pipeline = Pipeline([parser])

# Process documents (requires actual PDF/DOCX files)
docs = [
    Doc(uri="path/to/document.pdf"),
    Doc(uri="path/to/another.docx")
]
# Note: Ingestion requires actual files and optional dependencies
# Install with: pip install "sieves[ingestion]"

Document Chunking

Long documents often need to be split into smaller chunks for processing by language models. sieves provides two chunking options:

Using Chunking

The Chunking task uses the chonkie library for intelligent document chunking:

Token-based chunking with Chonkie
import chonkie
import tokenizers
from sieves import Pipeline, tasks, Doc

# Create a tokenizer for chunking
tokenizer = tokenizers.Tokenizer.from_pretrained("bert-base-uncased")

# Create a token-based chunker
chunker = tasks.Chunking(
    chunker=chonkie.TokenChunker(tokenizer, chunk_size=512, chunk_overlap=50)
)

# Create and run the pipeline
pipeline = Pipeline(chunker)
doc = Doc(text="Your long document text here...")
chunked_docs = list(pipeline([doc]))

# Access the chunks
for chunk in chunked_docs[0].chunks:
    print(f"Chunk: {chunk}")

Combining Preprocessing Tasks

You can combine multiple preprocessing tasks in a pipeline. Here's an example that parses a PDF using the Ingestion task (using Docling as default) and then chunks it:

Combined preprocessing pipeline
from sieves import tasks, Doc, Pipeline
import chonkie
import tokenizers

# Create a tokenizer and chunker
tokenizer = tokenizers.Tokenizer.from_pretrained("bert-base-uncased")
chunker = tasks.Chunking(
    chunker=chonkie.TokenChunker(tokenizer, chunk_size=512, chunk_overlap=50)
)

# Create a pipeline
pipeline = Pipeline(chunker)

# Process a document with text
doc = Doc(text="This is a long document that will be split into chunks. " * 100)
processed_doc = list(pipeline([doc]))[0]

# Access the chunks
print(f"Number of chunks: {len(processed_doc.chunks)}")
for i, chunk in enumerate(processed_doc.chunks):
    print(f"Chunk {i}: {chunk[:100]}...")  # Print first 100 chars of each chunk

Customizing Preprocessing

Progress

Progress bars are shown at the pipeline level. Tasks do not expose progress options.

Metadata

Tasks can include metadata about their processing. Enable this with include_meta:

Enable metadata inclusion
from sieves import tasks

parser = tasks.preprocessing.Ingestion(include_meta=True)

Access the metadata in the document's meta field:

Access preprocessing metadata
doc = processed_docs[0]
print(doc.meta["Ingestion"])  # Access parser metadata
print(doc.meta["Chunker"])  # Access chunker metadata