Preprocessing Documents
sieves provides several preprocessing tasks to prepare your documents for downstream processing. These tasks handle common operations like:
- Parsing various document formats (PDF, DOCX, etc.)
- Chunking long documents into manageable pieces
Document Parsing
Note: Ingestion libraries are optional and not installed by default. To use document ingestion, install them manually or install the ingestion extra:
pip install "sieves[ingestion]"
You can also install individual libraries directly (e.g., pip install docling).
Using Ingestion
The Ingestion task uses the docling or alternatively the marker libraries to parse various document formats:
from sieves import Pipeline, tasks, Doc
# Create a document parser
parser = tasks.preprocessing.Ingestion()
# Create a pipeline with the parser
pipeline = Pipeline([parser])
# Process documents (requires actual PDF/DOCX files)
docs = [
Doc(uri="path/to/document.pdf"),
Doc(uri="path/to/another.docx")
]
# Note: Ingestion requires actual files and optional dependencies
# Install with: pip install "sieves[ingestion]"
It is possible to choose a specific output format between the supported (Markdown, HTML, JSON) and pass custom Docling or Marker converters in the converter parameter:
from sieves import Pipeline, tasks, Doc
# Create a document parser with custom export format
parser = tasks.preprocessing.Ingestion(export_format="html")
# Create a pipeline with the parser
pipeline = Pipeline([parser])
# Process documents (requires actual PDF/DOCX files)
docs = [
Doc(uri="path/to/document.pdf"),
Doc(uri="path/to/another.docx")
]
# Note: Ingestion requires actual files and optional dependencies
# Install with: pip install "sieves[ingestion]"
Document Chunking
Long documents often need to be split into smaller chunks for processing by language models. sieves provides two chunking options:
Using Chunking
The Chunking task uses the chonkie library for intelligent document chunking:
import chonkie
import tokenizers
from sieves import Pipeline, tasks, Doc
# Create a tokenizer for chunking
tokenizer = tokenizers.Tokenizer.from_pretrained("bert-base-uncased")
# Create a token-based chunker
chunker = tasks.Chunking(
chunker=chonkie.TokenChunker(tokenizer, chunk_size=512, chunk_overlap=50)
)
# Create and run the pipeline
pipeline = Pipeline(chunker)
doc = Doc(text="Your long document text here...")
chunked_docs = list(pipeline([doc]))
# Access the chunks
for chunk in chunked_docs[0].chunks:
print(f"Chunk: {chunk}")
Combining Preprocessing Tasks
You can combine multiple preprocessing tasks in a pipeline. Here's an example that parses a PDF using the Ingestion task (using Docling as default) and then chunks it:
from sieves import tasks, Doc, Pipeline
import chonkie
import tokenizers
# Create a tokenizer and chunker
tokenizer = tokenizers.Tokenizer.from_pretrained("bert-base-uncased")
chunker = tasks.Chunking(
chunker=chonkie.TokenChunker(tokenizer, chunk_size=512, chunk_overlap=50)
)
# Create a pipeline
pipeline = Pipeline(chunker)
# Process a document with text
doc = Doc(text="This is a long document that will be split into chunks. " * 100)
processed_doc = list(pipeline([doc]))[0]
# Access the chunks
print(f"Number of chunks: {len(processed_doc.chunks)}")
for i, chunk in enumerate(processed_doc.chunks):
print(f"Chunk {i}: {chunk[:100]}...") # Print first 100 chars of each chunk
Customizing Preprocessing
Progress
Progress bars are shown at the pipeline level. Tasks do not expose progress options.
Metadata
Tasks can include metadata about their processing. Enable this with include_meta:
from sieves import tasks
parser = tasks.preprocessing.Ingestion(include_meta=True)
Access the metadata in the document's meta field:
doc = processed_docs[0]
print(doc.meta["Ingestion"]) # Access parser metadata
print(doc.meta["Chunker"]) # Access chunker metadata