data-deduplication

Star

Here are 10 public repositories matching this topic...

sail-sg / sailcraft

Star

🚢 Data Toolkit for Sailor Language Models

data-deduplication data-cleaning

Updated Feb 24, 2025
Python

gagan3012 / PolyDeDupe

Sponsor

Star

PolyDeDupe: Multi-Lingual Data Deduplication

multilingual nlp data-deduplication

Updated Dec 22, 2025
Python

Sievio turns GitHub, local repos, and web PDFs into clean JSONL for LLM pretraining, fine-tuning, and RAG. It offers structure-aware chunking, reliable Unicode decoding, pluggable QC and safety checks, plus optional dataset cards and deduplication.

python data-deduplication dataset-creation data-pipelines repository-mining jsonl github-repos rag text-preprocessing quality-filtering code-mining llm llm-training llm-datasets

Updated Dec 27, 2025
Python

dffdgdg / FindDuplicates

Star

Этот проект представляет собой мощный инструмент для поиска и анализа дублирующихся файлов в указанной директории. Программа позволяет эффективно выявлять одинаковые файлы на основе их содержимого, используя алгоритм хеширования SHA-256. Она поддерживает настройку параметров, таких как минимальный размер файла для проверки и игнорирование определен

python hashing productivity multithreading data-deduplication file-system sha256 file-management system-utility cli-tool dev-tools file-deduplication file-comparison disk-cleanup command-line-utility duplicate-file-finder

Updated Feb 14, 2025
Python

Anveshika06 / VIT-VTAS-TY-2022

Star

data-deduplication hashing-algorithm

Updated Jan 7, 2023
Python

anirudh-69 / Financial-Data-ETL-Workflow

Star

ETL workflow for stock data processing using Mage and PostgreSQL

python etl docker-compose postgresql data-deduplication data-engineering stock-market data-processing financial-data data-modeling data-cleaning data-aggregation api-integration alpha-vantage mage-ai

Updated Jan 17, 2025
Python

RayanGAtech / HR-Roster-Change-Data-Capture-Pipeline

Star

The HR Roster Change Detection Pipeline is an automated solution for processing HR roster data. Leveraging Apache Airflow and PostgreSQL, it enables seamless data ingestion, deduplication, and change detection, streamlining HR operations.

python open-source automation etl postgresql data-deduplication data-engineering data-pipelines apache-airflow roster-management workforce-analytics scalable-solutions hr-technology delta-detection hr-data-processing

Updated Dec 4, 2024
Python

SHAILY24 / business-record-matcher

Star

Automated business record matching using fuzzy algorithms (RapidFuzz) and browser automation (Playwright)

python record-linkage pandas fuzzy-matching data-deduplication data-cleaning web-automation playwright rapidfuzz

Updated Nov 18, 2025
Python

fabriziosalmi / text-boundaries

Sponsor

Star

A Python-based tool for preprocessing, cleaning, and analyzing text datasets, designed to filter, deduplicate, sort data, and generate statistical insights.

machine-learning natural-language-processing data-validation data-deduplication data-preprocessing data-sorting data-automation dataset-cleaning text-data-analysis dataset-boundaries data-statistics-generation

Updated Oct 31, 2025
Python

ammons-datalabs / Evaluate-SCAIL

Star

SCAIL/P-SCAIL: Petabyte-scale encrypted deduplication with segment chunks and sorted indexing

python metadata performance backup encryption storage cython data-deduplication deduplication scail petabyte-scale

Updated Dec 22, 2025
Python

Improve this page

Add a description, image, and links to the data-deduplication topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the data-deduplication topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data-deduplication

Here are 10 public repositories matching this topic...

sail-sg / sailcraft

gagan3012 / PolyDeDupe

JochiRaider / sievio

dffdgdg / FindDuplicates

Anveshika06 / VIT-VTAS-TY-2022

anirudh-69 / Financial-Data-ETL-Workflow

RayanGAtech / HR-Roster-Change-Data-Capture-Pipeline

SHAILY24 / business-record-matcher

fabriziosalmi / text-boundaries

ammons-datalabs / Evaluate-SCAIL

Improve this page

Add this topic to your repo