🚢 Data Toolkit for Sailor Language Models
-
Updated
Feb 24, 2025 - Python
🚢 Data Toolkit for Sailor Language Models
PolyDeDupe: Multi-Lingual Data Deduplication
Sievio turns GitHub, local repos, and web PDFs into clean JSONL for LLM pretraining, fine-tuning, and RAG. It offers structure-aware chunking, reliable Unicode decoding, pluggable QC and safety checks, plus optional dataset cards and deduplication.
Этот проект представляет собой мощный инструмент для поиска и анализа дублирующихся файлов в указанной директории. Программа позволяет эффективно выявлять одинаковые файлы на основе их содержимого, используя алгоритм хеширования SHA-256. Она поддерживает настройку параметров, таких как минимальный размер файла для проверки и игнорирование определен
ETL workflow for stock data processing using Mage and PostgreSQL
The HR Roster Change Detection Pipeline is an automated solution for processing HR roster data. Leveraging Apache Airflow and PostgreSQL, it enables seamless data ingestion, deduplication, and change detection, streamlining HR operations.
Automated business record matching using fuzzy algorithms (RapidFuzz) and browser automation (Playwright)
A Python-based tool for preprocessing, cleaning, and analyzing text datasets, designed to filter, deduplicate, sort data, and generate statistical insights.
SCAIL/P-SCAIL: Petabyte-scale encrypted deduplication with segment chunks and sorted indexing
Add a description, image, and links to the data-deduplication topic page so that developers can more easily learn about it.
To associate your repository with the data-deduplication topic, visit your repo's landing page and select "manage topics."