God, Love, News, Event, Entertainment, Amebo,..... All about Bringing out the best in you...
Show HN: SemHash – Fast Semantic Text Deduplication for Cleaner Datasets https://ift.tt/5g8p3ju
Show HN: SemHash – Fast Semantic Text Deduplication for Cleaner Datasets We’ve just open-sourced SemHash, a lightweight package for semantic text deduplication. It lets you effortlessly clean up your datasets and avoid pitfalls caused by duplicate samples in semantic search, RAG, and machine learning. Main Features: - Fast and hardware friendly: Deduplicate datasets with millions of records in minutes, on a CPU. - Flexible: Works on single or multiple datasets (e.g., train/test deduplication), and multi-column data (e.g., Question-Answering datasets). - Lightweight: Minimal dependencies (largest is NumPy). - Explainable: Easily inspect duplicates and what caused them, and view the lowest similarity duplicates to adjust the threshold based on your dataset. We found that text deduplication is more complex than it appears, so we built SemHash to simplify the process. Duplicate samples can skew model training, reduce generalization, and cause train-test leakage—leading to unreliable results. Techniques like minhash handle exact or near-exact duplicates, but semantic deduplication also catches semantically redundant samples, which we believe is an important aspect of deduplication. Furthermore, it’s not trivial to see why something was removed with minhash, which we also believe is important. We already found some interesting results on some well known datasets in our benchmarks which are included in the repo. We are curious to hear your feedback! Do you currently deduplicate your datasets before training, and what techniques do you use? https://ift.tt/oe4PViO January 12, 2025 at 06:20AM
Subscribe to:
Post Comments (Atom)
Show HN: voidDB – A transactional key-value DB written in Go for 64-bit Linux https://ift.tt/tNRyhdn
Show HN: voidDB – A transactional key-value DB written in Go for 64-bit Linux https://ift.tt/qpDZoSA January 31, 2025 at 03:45AM
-
HOMILY FOR FRIDAY, 14TH JUNE, 2024 TENTH WEEK IN ORDINARY TIME 1KING 19:9a. 11-16; GOSPEL: MATT 5:27-32 The conscience of man is where moral...
-
HOMILY FOR TUESDAY, 11TH WEEK IN ORDINARY TIME 1Kings 21:17-29; Matt 5:43-48 The last phrase of the gospel passage says “Be perfect just a...
-
submitted by /u/Erik_John09 [link] [comments] source https://www.reddit.com/r/worldnews/comments/gb8omp/brazil_president_jair_bolson...
No comments:
Post a Comment