-
Notifications
You must be signed in to change notification settings - Fork 381
Open
Labels
enhancementNew feature or requestNew feature or requesthelp wantedExtra attention is neededExtra attention is neededp2 (backlog)Nice to have featuresNice to have features
Description
Is your feature request related to a problem?
Currently, Daft lacks native support for consuming data directly from Apache Kafka, which creates significant limitations for real-time data processing scenarios. Users working with streaming data pipelines are forced to implement workaround solutions like:
- Writing Kafka data to intermediate storage (e.g., Parquet files) before loading to Daft
- Creating custom Python consumers with
kafka-python/confluent-kafkafollowed by manual DataFrame conversions - Relying on external stream processing engines before feeding data to Daft
These approaches introduce unnecessary latency, complexity and potential data consistency issues in streaming workflows.
Describe the solution you'd like
We propose implementing first-class Kafka support in Daft with:
- Native Kafka DataSource integration supporting both batch and streaming modes
- Structured Streaming capabilities including:
- Offset management (automatic checkpointing)
- Consumer group support
- Exactly-once processing semantics
- Schema inference from:
- Kafka message headers
- Embedded schemas (Avro/Protobuf via Schema Registry)
- Integration with existing DataFrame API:
df = daft.read_kafka( bootstrap_servers="kafka:9092", topics=["iot-sensors"], consumer_group="daft-processor", starting_offsets="earliest" ).where(col("sensor_type") == "temperature")
Describe alternatives you've considered
No response
Additional Context
No response
Would you like to implement a fix?
Yes
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or requesthelp wantedExtra attention is neededExtra attention is neededp2 (backlog)Nice to have featuresNice to have features