Chunking
Allows chunking of documents into segments.
Chonkie
Bases: Task
Chunker wrapping the chonkie library.
Source code in sieves/tasks/preprocessing/chunkers.py
13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 |
|
id
property
Returns task ID. Used by pipeline for results and dependency management.
Returns:
Type | Description |
---|---|
str
|
Task ID. |
__call__(docs)
Split documents into chunks.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
docs
|
Iterable[Doc]
|
Documents to split. |
required |
Returns:
Type | Description |
---|---|
Iterable[Doc]
|
Split documents. |
Source code in sieves/tasks/preprocessing/chunkers.py
31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 |
|
__init__(chunker, task_id=None, show_progress=True, include_meta=False)
Initialize chunker.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
task_id
|
str | None
|
Task ID. |
None
|
show_progress
|
bool
|
Whether to show progress bar for processed documents. |
True
|
include_meta
|
bool
|
Whether to include meta information generated by the task. |
False
|
Source code in sieves/tasks/preprocessing/chunkers.py
16 17 18 19 20 21 22 23 24 25 26 27 28 29 |
|
deserialize(config, **kwargs)
classmethod
Generate Task instance from config.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
config
|
Config
|
Config to generate instance from. |
required |
kwargs
|
dict[str, Any]
|
Values to inject into loaded config. |
{}
|
Returns:
Type | Description |
---|---|
Task
|
Deserialized Task instance. |
Source code in sieves/tasks/core.py
56 57 58 59 60 61 62 63 64 |
|
serialize()
Serializes task.
Returns:
Type | Description |
---|---|
Config
|
Config instance. |
Source code in sieves/tasks/core.py
50 51 52 53 54 |
|
NaiveChunker
Bases: Task
Chunks by sentence counts. Only for test purposes.
Source code in sieves/tasks/preprocessing/chunkers.py
63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 |
|
id
property
Returns task ID. Used by pipeline for results and dependency management.
Returns:
Type | Description |
---|---|
str
|
Task ID. |
__call__(docs)
Split documents into chunks.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
docs
|
Iterable[Doc]
|
Documents to split. |
required |
Returns:
Type | Description |
---|---|
Iterable[Doc]
|
Split documents. |
Source code in sieves/tasks/preprocessing/chunkers.py
82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 |
|
__init__(interval, task_id=None, show_progress=True, include_meta=False)
Initialize chunker.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
interval
|
int
|
Token count interval for chunks. |
required |
task_id
|
str | None
|
Task ID. |
None
|
show_progress
|
bool
|
Whether to show progress bar for processed documents. |
True
|
include_meta
|
bool
|
Whether to include meta information generated by the task. |
False
|
Source code in sieves/tasks/preprocessing/chunkers.py
66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 |
|
deserialize(config, **kwargs)
classmethod
Generate Task instance from config.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
config
|
Config
|
Config to generate instance from. |
required |
kwargs
|
dict[str, Any]
|
Values to inject into loaded config. |
{}
|
Returns:
Type | Description |
---|---|
Task
|
Deserialized Task instance. |
Source code in sieves/tasks/core.py
56 57 58 59 60 61 62 63 64 |
|
serialize()
Serializes task.
Returns:
Type | Description |
---|---|
Config
|
Config instance. |
Source code in sieves/tasks/core.py
50 51 52 53 54 |
|