Chat with your database or your datalake (SQL, CSV, parquet). PandasAI makes data analysis conversational using LLMs and RAG.
-
Updated
Oct 28, 2025 - Python
Chat with your database or your datalake (SQL, CSV, parquet). PandasAI makes data analysis conversational using LLMs and RAG.
Continuously updated paper list on advancements in Data Agents. Companion repo to our paper "A Survey of Data Agents: Emerging Paradigm or Overstated Hype?"
An IDE and translation engine for detection engineers and threat hunters. Be faster, write smarter, keep 100% privacy.
A Data Platform built for AWS, powered by Kubernetes.
Visual Asset Management System (VAMS) is a purpose-built, AWS native solution for the management and distribution of traditional to specialized visual assets used in physical AI and spatial computing.
This repository will help you to learn about databricks concept with the help of examples. It will include all the important topics which we need in our real life experience as a data engineer. We will be using pyspark & sparksql for the development. At the end of the course we also cover few case studies.
Simplified ETL process in Hadoop using Apache Spark. Has complete ETL pipeline for datalake. SparkSession extensions, DataFrame validation, Column extensions, SQL functions, and DataFrame transformations
A library to accelerate ML and ETL pipeline by connecting all data sources
Python idiomatic SDK for Cortex™ Data Lake.
This solution helps you deploy ETL processes and data storage resources to create an Insurance Lake using Amazon S3 buckets for storage, AWS Glue for data transformation, and AWS CDK Pipelines. It is originally based on the AWS blog Deploy data lake ETL jobs using CDK Pipelines, and complements the InsuranceLake Infrastructure project
An AWS based solution using AWS CloudWatch and AWS Lambda based on Python to automatically terminate AWS EMR clusters that have been idle for a specified period of time.
This solution helps you deploy ETL processes and data storage resources to create an Insurance Lake using Amazon S3 buckets for storage, AWS Glue for data transformation, and AWS CDK Pipelines. It is originally based on the AWS blog Deploy data lake ETL jobs using CDK Pipelines, and complements the InsuranceLake ETL with CDK Pipelines project.
OEDI Data Lake Access
Building a poor man's data lake: Exploring the Power of Polars and Delta Lake
Pipelines de dados da Secretaria Municipal de Saúde
Sample data store project to be hosted on a remote server or cluster. CICD using GitHub actions for SSH Deploy to remote server for docker compose.
Add a description, image, and links to the datalake topic page so that developers can more easily learn about it.
To associate your repository with the datalake topic, visit your repo's landing page and select "manage topics."