Build software better, together

sinaptik-ai / pandas-ai

Chat with your database or your datalake (SQL, CSV, parquet). PandasAI makes data analysis conversational using LLMs and RAG.

data-science data csv sql database ai pandas data-visualization data-analysis datalake text-to-sql gpt-4 llm

Updated Oct 28, 2025
Python

HKUSTDial / awesome-data-agents

Star

Continuously updated paper list on advancements in Data Agents. Companion repo to our paper "A Survey of Data Agents: Emerging Paradigm or Overstated Hype?"

data-science database data-discovery data-analytics data-analysis data-integration data-management data-preparation data-cleaning datalake unstructured-data text-to-sql nl2sql report-generation tableqa data-storytelling nl2vis llm dataagent

Updated Dec 23, 2025
Python

UncoderIO / Uncoder_IO

Star

An IDE and translation engine for detection engineers and threat hunters. Be faster, write smarter, keep 100% privacy.

translation xdr siem sigma datalake edr threathunting roota uncoder uncoderio

Updated Dec 2, 2025
Python

neuralinkcorp / datarepo

Star

data-warehouse datawarehouse datalake delta-lake

Updated Sep 25, 2025
Python

awslabs / aws-orbit-workbench

Star

A Data Platform built for AWS, powered by Kubernetes.

kubernetes aws jupyter analytics gpu jupyterhub data-analysis redshift mach workbench datalake dataengineering eks eks-cluster orbit-workbench

Updated Jul 24, 2023
Python

awslabs / visual-asset-management-system

Star

Visual Asset Management System (VAMS) is a purpose-built, AWS native solution for the management and distribution of traditional to specialized visual assets used in physical AI and spatial computing.

metadata pipelines spatial-data 3d 2d datalake digital-asset-management spatial-computing extended-reality physical-ai

Updated Dec 6, 2025
Python

martandsingh / ApacheSpark

Star

This repository will help you to learn about databricks concept with the help of examples. It will include all the important topics which we need in our real life experience as a data engineer. We will be using pyspark & sparksql for the development. At the end of the course we also cover few case studies.

sql database spark hive hadoop etl pyspark data-engineering spark-streaming data-analysis databricks datalake spark-sql timetravel apachespark etl-pipeline deltalake

Updated Sep 26, 2025
Python

vim89 / datapipelines-essentials-python

Star

Simplified ETL process in Hadoop using Apache Spark. Has complete ETL pipeline for datalake. SparkSession extensions, DataFrame validation, Column extensions, SQL functions, and DataFrame transformations

python big-data spark apache-spark hadoop etl xml python3 xml-parsing pyspark data-pipeline datalake hadoop-mapreduce spark-sql etl-framework hadoop-hdfs etl-pipeline etl-components

Updated May 6, 2023
Python

hifxit / dataligo

Star

A library to accelerate ML and ETL pipeline by connecting all data sources

python database nosql datawarehouse datalake etl-pipeline ml-pipeline

Updated May 3, 2023
Python

PaloAltoNetworks / pan-cortex-data-lake-python

Star

Python idiomatic SDK for Cortex™ Data Lake.

Updated Mar 24, 2025
Python

aws-solutions-library-samples / aws-insurancelake-etl

Star

This solution helps you deploy ETL processes and data storage resources to create an Insurance Lake using Amazon S3 buckets for storage, AWS Glue for data transformation, and AWS CDK Pipelines. It is originally based on the AWS blog Deploy data lake ETL jobs using CDK Pipelines, and complements the InsuranceLake Infrastructure project

aws insurance glue datalake cdk

Updated Nov 18, 2025
Python

abdullahkhawer / aws-auto-terminate-idle-emr

Star

An AWS based solution using AWS CloudWatch and AWS Lambda based on Python to automatically terminate AWS EMR clusters that have been idle for a specified period of time.

Updated Jun 5, 2024
Python

aws-solutions-library-samples / aws-insurancelake-infrastructure

Star

This solution helps you deploy ETL processes and data storage resources to create an Insurance Lake using Amazon S3 buckets for storage, AWS Glue for data transformation, and AWS CDK Pipelines. It is originally based on the AWS blog Deploy data lake ETL jobs using CDK Pipelines, and complements the InsuranceLake ETL with CDK Pipelines project.

aws insurance datalake cdk