Getting Started
This guide will help you get started with using sieves
for zero-shot and few-shot NLP tasks with structured generation.
Basic Concepts
sieves
is built around four main concepts:
- Documents (
Doc
): The basic unit of text that you want to process. A document can be created from text or a URI. - Engines: Components responsible for generating structured output using various LLM backends (outlines, DSPy, LangChain, etc.)
- Tasks: NLP operations you want to perform on your documents (classification, information extraction, etc.)
- Pipeline: A sequence of tasks that process your documents
Quick Start Example
Here's a simple example that performs text classification:
import outlines
from sieves import Pipeline, engines, tasks, Doc
# Create a document
doc = Doc(text="Special relativity applies to all physical phenomena in the absence of gravity.")
# Initialize the engine (using a small but capable model)
engine = engines.outlines_.Outlines(
model=outlines.models.transformers("HuggingFaceTB/SmolLM-135M-Instruct")
)
# Create and run the pipeline
pipeline = Pipeline([tasks.predictive.Classification(labels=["science", "politics"], engine=engine)])
# Print the classification result
for doc in pipeline([doc]):
print(doc.results)
Working with Documents
Documents can be created in several ways:
# From text
doc = Doc(text="Your text here")
# From a file (requires docling)
doc = Doc(uri="path/to/your/file.pdf")
# With metadata
doc = Doc(
text="Your text here",
meta={"source": "example", "date": "2025-01-31"}
)
Advanced Example: PDF Processing Pipeline
Here's a more involved example that:
- Parses a PDF document
- Chunks it into smaller pieces
- Performs information extraction on each chunk
import outlines
import chonkie
import tokenizers
import pydantic
from sieves import Pipeline, engines, tasks, Doc
# Create a tokenizer for chunking
tokenizer = tokenizers.Tokenizer.from_pretrained("bert-base-uncased")
# Initialize components
chunker = tasks.preprocessing.Chonkie(
chunker=chonkie.TokenChunker(tokenizer, chunk_size=512, chunk_overlap=50)
)
# Initialize an engine for information extraction
engine = engines.outlines_.Outlines(model=outlines.models.transformers("HuggingFaceTB/SmolLM-135M-Instruct"))
# Define the structure of information you want to extract
class PersonInfo(pydantic.BaseModel):
name: str
age: int | None = None
occupation: str | None = None
# Create an information extraction task
extractor = tasks.predictive.InformationExtraction(
entity_type=PersonInfo,
engine=engine
)
# Create the pipeline
pipeline = Pipeline([
chunker,
extractor
])
# Process a PDF document
doc = Doc(text="Marie Curie died at the age of 66 years.")
results = list(pipeline([doc]))
# Access the extracted information
for result in results:
print(result.results["InformationExtraction"])
Supported Engines
sieves
supports multiple engines for structured generation:
Each engine has a different set of supported models, pros and cons. Choose the engine that best fits your use case and model requirements.