Skip to content

Preprocessing Documents

sieves provides several preprocessing tasks to prepare your documents for downstream processing. These tasks handle common operations like: - Parsing various document formats (PDF, DOCX, etc.) - Chunking long documents into manageable pieces - Cleaning and standardizing text

Document Parsing

Note: Ingestion libraries are optional and not installed by default. To use document ingestion, install them manually or install the ingestion extra:

pip install "sieves[ingestion]"

You can also install individual libraries directly (e.g., pip install docling).

Using Ingestion

The Ingestion task uses the docling or alternatively the marker libraries to parse various document formats:

from sieves import Pipeline, tasks, Doc

# Create a document parser
parser = tasks.preprocessing.Ingestion()

# Create a pipeline with the parser
pipeline = Pipeline([parser])

# Process documents
docs = [
    Doc(uri="path/to/document.pdf"),
    Doc(uri="path/to/another.docx")
]
processed_docs = list(pipeline(docs))

# Access the parsed text
for doc in processed_docs:
    print(doc.text)

It is possible to choose a specific output format between the supported (Markdown, HTML, JSON) and pass custom Docling or Marker converters in the converter parameter:

from sieves import Pipeline, tasks, Doc

from docling.document_converter import DocumentConverter

# Create a document parser
parser = tasks.preprocessing.Ingestion(converter=DocumentConverter(), export_format="html")

# Create a pipeline with the parser
pipeline = Pipeline([parser])

# Process documents
docs = [
    Doc(uri="path/to/document.pdf"),
    Doc(uri="path/to/another.docx")
]
processed_docs = list(pipeline(docs))

# Access the parsed text
for doc in processed_docs:
    print(doc.text)

Document Chunking

Long documents often need to be split into smaller chunks for processing by language models. sieves provides two chunking options:

Using Chonkie

The Chonkie task uses the chonkie library for intelligent document chunking:

import chonkie
import tokenizers
from sieves import Pipeline, tasks, Doc

# Create a tokenizer for chunking
tokenizer = tokenizers.Tokenizer.from_pretrained("bert-base-uncased")

# Create a token-based chunker
chunker = tasks.preprocessing.Chonkie(
    chunker=chonkie.TokenChunker(tokenizer, chunk_size=512, chunk_overlap=50)
)

# Create and run the pipeline
pipeline = Pipeline([chunker])
doc = Doc(text="Your long document text here...")
chunked_docs = list(pipeline([doc]))

# Access the chunks
for chunk in chunked_docs[0].chunks:
    print(f"Chunk: {chunk}")

Combining Preprocessing Tasks

You can combine multiple preprocessing tasks in a pipeline. Here's an example that parses a PDF using the Ingestion task (using Docling as default) and then chunks it with Chonkie:

from sieves import Pipeline, tasks, Doc
import chonkie
import tokenizers

# Create the preprocessing tasks
parser = tasks.preprocessing.Ingestion()
tokenizer = tokenizers.Tokenizer.from_pretrained("bert-base-uncased")
chunker = tasks.preprocessing.Chonkie(
    chunker=chonkie.TokenChunker(tokenizer, chunk_size=512, chunk_overlap=50)
)

# Create a pipeline with both tasks
pipeline = Pipeline([parser, chunker])

# Process a document
doc = Doc(uri="path/to/document.pdf")
processed_doc = list(pipeline([doc]))[0]

# Access the chunks
print(f"Number of chunks: {len(processed_doc.chunks)}")
for i, chunk in enumerate(processed_doc.chunks):
    print(f"Chunk {i}: {chunk[:100]}...")  # Print first 100 chars of each chunk

Customizing Preprocessing

Progress

Progress bars are shown at the pipeline level. Tasks do not expose progress options.

Metadata

Tasks can include metadata about their processing. Enable this with include_meta:

parser = tasks.preprocessing.Ingestion(include_meta=True)

Access the metadata in the document's meta field:

doc = processed_docs[0]
print(doc.meta["Ingestion"])  # Access parser metadata
print(doc.meta["Chunker"])  # Access chunker metadata

Best Practices

  1. Document Size: When working with large documents:
  2. Always use chunking to break them into manageable pieces
  3. Consider the chunk size based on your model's context window
  4. Use appropriate chunk overlap to maintain context across chunks

  5. Error Handling: When parsing documents:

  6. Handle potential parsing errors gracefully
  7. Verify that documents were parsed successfully before chunking
  8. Check that the chunked text maintains document integrity

  9. Pipeline Order: When combining tasks:

  10. Always parse documents before chunking
  11. Consider adding cleaning steps between parsing and chunking
  12. Validate the output of each step before proceeding

  13. Text Cleaning:

  14. Choose cleaning functions based on your document types
  15. Apply cleaning functions in a logical order (e.g., remove bullets before normalizing whitespace)
  16. Test cleaning functions on sample documents to ensure they don't remove important content