Preprocessing Documents
sieves
provides several preprocessing tasks to prepare your documents for downstream processing. These tasks handle common operations like:
- Parsing various document formats (PDF, DOCX, etc.)
- Chunking long documents into manageable pieces
- Cleaning and standardizing text
Document Parsing
Note: Ingestion libraries are optional and not installed by default. To use document ingestion, install them manually or install the ingestion
extra:
pip install "sieves[ingestion]"
You can also install individual libraries directly (e.g., pip install docling
).
Using Ingestion
The Ingestion
task uses the docling or alternatively the marker libraries to parse various document formats:
from sieves import Pipeline, tasks, Doc
# Create a document parser
parser = tasks.preprocessing.Ingestion()
# Create a pipeline with the parser
pipeline = Pipeline([parser])
# Process documents
docs = [
Doc(uri="path/to/document.pdf"),
Doc(uri="path/to/another.docx")
]
processed_docs = list(pipeline(docs))
# Access the parsed text
for doc in processed_docs:
print(doc.text)
It is possible to choose a specific output format between the supported (Markdown, HTML, JSON) and pass custom Docling or Marker converters in the converter
parameter:
from sieves import Pipeline, tasks, Doc
from docling.document_converter import DocumentConverter
# Create a document parser
parser = tasks.preprocessing.Ingestion(converter=DocumentConverter(), export_format="html")
# Create a pipeline with the parser
pipeline = Pipeline([parser])
# Process documents
docs = [
Doc(uri="path/to/document.pdf"),
Doc(uri="path/to/another.docx")
]
processed_docs = list(pipeline(docs))
# Access the parsed text
for doc in processed_docs:
print(doc.text)
Document Chunking
Long documents often need to be split into smaller chunks for processing by language models. sieves
provides two chunking options:
Using Chonkie
The Chonkie
task uses the chonkie library for intelligent document chunking:
import chonkie
import tokenizers
from sieves import Pipeline, tasks, Doc
# Create a tokenizer for chunking
tokenizer = tokenizers.Tokenizer.from_pretrained("bert-base-uncased")
# Create a token-based chunker
chunker = tasks.preprocessing.Chonkie(
chunker=chonkie.TokenChunker(tokenizer, chunk_size=512, chunk_overlap=50)
)
# Create and run the pipeline
pipeline = Pipeline([chunker])
doc = Doc(text="Your long document text here...")
chunked_docs = list(pipeline([doc]))
# Access the chunks
for chunk in chunked_docs[0].chunks:
print(f"Chunk: {chunk}")
Combining Preprocessing Tasks
You can combine multiple preprocessing tasks in a pipeline. Here's an example that parses a PDF using the Ingestion task (using Docling as default) and then chunks it with Chonkie:
from sieves import Pipeline, tasks, Doc
import chonkie
import tokenizers
# Create the preprocessing tasks
parser = tasks.preprocessing.Ingestion()
tokenizer = tokenizers.Tokenizer.from_pretrained("bert-base-uncased")
chunker = tasks.preprocessing.Chonkie(
chunker=chonkie.TokenChunker(tokenizer, chunk_size=512, chunk_overlap=50)
)
# Create a pipeline with both tasks
pipeline = Pipeline([parser, chunker])
# Process a document
doc = Doc(uri="path/to/document.pdf")
processed_doc = list(pipeline([doc]))[0]
# Access the chunks
print(f"Number of chunks: {len(processed_doc.chunks)}")
for i, chunk in enumerate(processed_doc.chunks):
print(f"Chunk {i}: {chunk[:100]}...") # Print first 100 chars of each chunk
Customizing Preprocessing
Progress
Progress bars are shown at the pipeline level. Tasks do not expose progress options.
Metadata
Tasks can include metadata about their processing. Enable this with include_meta
:
parser = tasks.preprocessing.Ingestion(include_meta=True)
Access the metadata in the document's meta
field:
doc = processed_docs[0]
print(doc.meta["Ingestion"]) # Access parser metadata
print(doc.meta["Chunker"]) # Access chunker metadata
Best Practices
- Document Size: When working with large documents:
- Always use chunking to break them into manageable pieces
- Consider the chunk size based on your model's context window
-
Use appropriate chunk overlap to maintain context across chunks
-
Error Handling: When parsing documents:
- Handle potential parsing errors gracefully
- Verify that documents were parsed successfully before chunking
-
Check that the chunked text maintains document integrity
-
Pipeline Order: When combining tasks:
- Always parse documents before chunking
- Consider adding cleaning steps between parsing and chunking
-
Validate the output of each step before proceeding
-
Text Cleaning:
- Choose cleaning functions based on your document types
- Apply cleaning functions in a logical order (e.g., remove bullets before normalizing whitespace)
- Test cleaning functions on sample documents to ensure they don't remove important content