-
Notifications
You must be signed in to change notification settings - Fork 139
Closed
Labels
bugSomething isn't workingSomething isn't working
Description
Describe the bug
When using a pyarrow.dataset as your source and performing a dataframe count operation you get an error.
To Reproduce
You can point the below snippet to any parquet file.
from datafusion import SessionContext
import pyarrow.dataset as ds
ctx = SessionContext()
file_path = "/some-path/datafusion-python/examples/tpch/data/lineitem.parquet"
pyarrow_dataset = ds.dataset([file_path])
ctx.register_dataset("pyarrow_dataset", pyarrow_dataset)
df = ctx.table("pyarrow_dataset").select("l_orderkey", "l_partkey", "l_linenumber")
df.limit(3).show()
df.count()
This generates the following output. The show is to demonstrate the file is read appropriately.
DataFrame()
+------------+-----------+--------------+
| l_orderkey | l_partkey | l_linenumber |
+------------+-----------+--------------+
| 1 | 155190 | 1 |
| 1 | 67310 | 2 |
| 1 | 63700 | 3 |
+------------+-----------+--------------+
Traceback (most recent call last):
File "/Users/tsaucer/src/personal/arrow_rs_dataset_read/count_dataset_read.py", line 16, in <module>
df.count()
File "/Users/tsaucer/src/personal/datafusion-python/python/datafusion/dataframe.py", line 507, in count
return self.df.count()
^^^^^^^^^^^^^^^
Exception: External error: Arrow error: External error: ArrowException: Invalid argument error: must either specify a row count or at least one column
Expected behavior
count() should return the number of rows in this dataset.
Work around is to aggregate and count
from datafusion import col, functions as f
df.aggregate([], [f.count(col("l_orderkey"))]).show()
Additional context
In my investigation, I found that we register arrow datasets by creating a TableProvider in src/dataset.rs and then the execution calls happen in src/dataset_exec.rs.
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working