Preprocessing Documents

sieves provides several preprocessing tasks to prepare your documents for downstream processing. These tasks handle common operations like: - Parsing various document formats (PDF, DOCX, etc.) - Chunking long documents into manageable pieces - Cleaning and standardizing text

Document Parsing

Using OCR

The OCR task uses the docling or alternatively the marker libraries to parse various document formats:

from sieves import Pipeline, tasks, Doc

# Create a document parser
parser = tasks.preprocessing.OCR()

# Create a pipeline with the parser
pipeline = Pipeline([parser])

# Process documents
docs = [
    Doc(uri="path/to/document.pdf"),
    Doc(uri="path/to/another.docx")
]
processed_docs = list(pipeline(docs))

# Access the parsed text
for doc in processed_docs:
    print(doc.text)

It is possible to choose a specific output format between the supported (Markdown, HTML, JSON) and pass custom Docling or Marker converters in the converter parameter:

from sieves import Pipeline, tasks, Doc

from docling.document_converter import DocumentConverter

# Create a document parser
parser = tasks.preprocessing.OCR(converter=DocumentConverter(), export_format="html")

# Create a pipeline with the parser
pipeline = Pipeline([parser])

# Process documents
docs = [
    Doc(uri="path/to/document.pdf"),
    Doc(uri="path/to/another.docx")
]
processed_docs = list(pipeline(docs))

# Access the parsed text
for doc in processed_docs:
    print(doc.text)

Using Unstructured

The Unstructured task uses the unstructured library, which provides robust document parsing capabilities:

from sieves import Pipeline, tasks, Doc
from unstructured.cleaners.core import (
    clean_extra_whitespace
)

# Create an unstructured parser with cleaning functions
parser = tasks.preprocessing.Unstructured(
    # Add cleaning functions to process the text
    cleaners=(clean_extra_whitespace,)  # Normalize whitespace
)

# Create and run the pipeline
pipeline = Pipeline([parser])
docs = [Doc(text="● This is a dummy   document.®")]
processed_docs = list(pipeline(docs))

Text Cleaning Functions

The Unstructured library provides several cleaning functions that you can combine:

from unstructured.cleaners.core import (
    clean,                    # Combined cleaning function
    clean_bullets,           # Remove bullet points (•, -, etc.)
    clean_ordered_bullets,   # Remove numbered bullets (1., a., etc.)
    clean_extra_whitespace,  # Normalize whitespace
    clean_dashes,           # Remove dashes
    clean_non_ascii_chars,  # Remove non-ASCII characters
    clean_trailing_punctuation,  # Remove trailing punctuation
    bytes_string_to_string  # Convert byte strings to normal strings
)

Document Chunking

Long documents often need to be split into smaller chunks for processing by language models. sieves provides two chunking options:

Using Chonkie

The Chonkie task uses the chonkie library for intelligent document chunking:

import chonkie
import tokenizers
from sieves import Pipeline, tasks, Doc

# Create a tokenizer for chunking
tokenizer = tokenizers.Tokenizer.from_pretrained("bert-base-uncased")

# Create a token-based chunker
chunker = tasks.preprocessing.Chonkie(
    chunker=chonkie.TokenChunker(tokenizer, chunk_size=512, chunk_overlap=50)
)

# Create and run the pipeline
pipeline = Pipeline([chunker])
doc = Doc(text="Your long document text here...")
chunked_docs = list(pipeline([doc]))

# Access the chunks
for chunk in chunked_docs[0].chunks:
    print(f"Chunk: {chunk}")

Combining Preprocessing Tasks

You can combine multiple preprocessing tasks in a pipeline. Here's an example that parses a PDF using the OCR task (using Docling as default) and then chunks it with Chonkie:

from sieves import Pipeline, tasks, Doc
import chonkie
import tokenizers

# Create the preprocessing tasks
parser = tasks.preprocessing.OCR()
tokenizer = tokenizers.Tokenizer.from_pretrained("bert-base-uncased")
chunker = tasks.preprocessing.Chonkie(
    chunker=chonkie.TokenChunker(tokenizer, chunk_size=512, chunk_overlap=50)
)

# Create a pipeline with both tasks
pipeline = Pipeline([parser, chunker])

# Process a document
doc = Doc(uri="path/to/document.pdf")
processed_doc = list(pipeline([doc]))[0]

# Access the chunks
print(f"Number of chunks: {len(processed_doc.chunks)}")
for i, chunk in enumerate(processed_doc.chunks):
    print(f"Chunk {i}: {chunk[:100]}...")  # Print first 100 chars of each chunk

Customizing Preprocessing

Progress Bars

All preprocessing tasks support progress bars. You can enable/disable them:

parser = tasks.preprocessing.OCR(show_progress=True)
chunker = tasks.preprocessing.Chonkie(
    chunker=chonkie.TokenChunker(tokenizer),
    show_progress=True
)

Metadata

Tasks can include metadata about their processing. Enable this with include_meta:

parser = tasks.preprocessing.OCR(include_meta=True)

Access the metadata in the document's meta field:

doc = processed_docs[0]
print(doc.meta["OCR"])  # Access parser metadata
print(doc.meta["Chunker"])  # Access chunker metadata

Best Practices

Document Size: When working with large documents:
Always use chunking to break them into manageable pieces
Consider the chunk size based on your model's context window
Use appropriate chunk overlap to maintain context across chunks
Error Handling: When parsing documents:
Handle potential parsing errors gracefully
Verify that documents were parsed successfully before chunking
Check that the chunked text maintains document integrity
Pipeline Order: When combining tasks:
Always parse documents before chunking
Consider adding cleaning steps between parsing and chunking
Validate the output of each step before proceeding
Text Cleaning:
Choose cleaning functions based on your document types
Apply cleaning functions in a logical order (e.g., remove bullets before normalizing whitespace)
Test cleaning functions on sample documents to ensure they don't remove important content