Member-only story
Query Any Data, Anywhere-Simplify Workflows with DuckDB
Engineering is all about simplicity. We often use lightweight technologies to solve problems instead of overly complex ones, wherever we can replace them.
DuckDB is a very good example. It doesn’t need any complex setup. It’s like SQLite’s big brother.
SQLite is used for OLTP, but DuckDB is designed for OLAP.
DuckDB is especially useful for data processing, particularly with file systems like CSV and Parquet. The best part is that we can directly query cloud files without downloading them. We can run SQL queries on these files for easy data transformations.
Official Documentation: https://duckdb.org/docs/api/python/overview.html
Usage of DuckDB
- Embedded Analytics: Load a CSV and query directly using SQL.
- Local Analytics: Run queries on Parquet files.
- Data Science Workflows: Use SQL for data preprocessing and model preparation.
- Ad-Hoc Queries: Perform quick analytics on large datasets.
import duckdb
# Query a CSV file directly
query = """
SELECT *
FROM 'sample.csv'
WHERE column1 > 100
"""
result = duckdb.query(query).to_df()
print(result)
# Query a Parquet file locally
query = """
SELECT *…