Skip to content

Add Kafka Data Source Support for Streaming Data Processing #4603

@huleilei

Description

@huleilei

Is your feature request related to a problem?

Currently, Daft lacks native support for consuming data directly from Apache Kafka, which creates significant limitations for real-time data processing scenarios. Users working with streaming data pipelines are forced to implement workaround solutions like:

  1. Writing Kafka data to intermediate storage (e.g., Parquet files) before loading to Daft
  2. Creating custom Python consumers with kafka-python/confluent-kafka followed by manual DataFrame conversions
  3. Relying on external stream processing engines before feeding data to Daft

These approaches introduce unnecessary latency, complexity and potential data consistency issues in streaming workflows.

Describe the solution you'd like

We propose implementing first-class Kafka support in Daft with:

  • Native Kafka DataSource integration supporting both batch and streaming modes
  • Structured Streaming capabilities including:
    • Offset management (automatic checkpointing)
    • Consumer group support
    • Exactly-once processing semantics
  • Schema inference from:
    • Kafka message headers
    • Embedded schemas (Avro/Protobuf via Schema Registry)
  • Integration with existing DataFrame API:
    df = daft.read_kafka(
      bootstrap_servers="kafka:9092",
      topics=["iot-sensors"],
      consumer_group="daft-processor",
      starting_offsets="earliest"
    ).where(col("sensor_type") == "temperature")

Describe alternatives you've considered

No response

Additional Context

No response

Would you like to implement a fix?

Yes

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requesthelp wantedExtra attention is neededp2 (backlog)Nice to have features

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions