Arabic Machine Learning

non-profit

https://github.com/ARBML

arabicml2

arbml

Activity Feed Request to join this org

AI & ML interests

Arabic NLP, computer vision, etc.

Recent Activity

abdeljalilELmajjodi authored a paper about 2 months ago

AtlasOCR: Building the First Open-Source Darija OCR Model with Vision Language Models

KhloudJ authored a paper 2 months ago

ASCAT: An Arabic Scientific Corpus and Benchmark for Advanced Translation Evaluation

KhloudJ authored a paper 2 months ago

Abjad-Kids: An Arabic Speech Classification Dataset for Primary Education

View all activity

abdeljalilELmajjodi

authored a paper about 2 months ago

AtlasOCR: Building the First Open-Source Darija OCR Model with Vision Language Models

Paper • 2604.08070 • Published Apr 9 • 3

KhloudJ

authored 3 papers 2 months ago

ASCAT: An Arabic Scientific Corpus and Benchmark for Advanced Translation Evaluation

Paper • 2604.00015 • Published Mar 10

Abjad-Kids: An Arabic Speech Classification Dataset for Primary Education

Paper • 2603.20255 • Published Mar 11

SyriSign: A Parallel Corpus for Arabic Text to Syrian Arabic Sign Language Translation

Paper • 2603.29219 • Published Mar 31

KhloudJ

authored a paper 3 months ago

AraModernBERT: Transtokenized Initialization and Long-Context Encoder Modeling for Arabic

Paper • 2603.09982 • Published Feb 10

BounharAbdelaziz

authored a paper 5 months ago

YaPO: Learnable Sparse Activation Steering Vectors for Domain Adaptation

Paper • 2601.08441 • Published Jan 13 • 8

BounharAbdelaziz

submitted a paper to Daily Papers 5 months ago

YaPO: Learnable Sparse Activation Steering Vectors for Domain Adaptation

Paper • 2601.08441 • Published Jan 13 • 8

obadx

authored a paper 5 months ago

Automatic Pronunciation Error Detection and Correction of the Holy Quran's Learners Using Deep Learning

Paper • 2509.00094 • Published Aug 27, 2025 • 2

AymanMansour

authored 2 papers 5 months ago

End-to-End Automatic Speech Recognition model for the Sudanese Dialect

Paper • 2212.10826 • Published Dec 21, 2022

Doing More with Less: Data Augmentation for Sudanese Dialect Automatic Speech Recognition

Paper • 2601.06802 • Published Jan 11

Ruqiya

authored a paper 6 months ago

MMTEB: Massive Multilingual Text Embedding Benchmark

Paper • 2502.13595 • Published Feb 19, 2025 • 49

adamm-hf

posted an update 7 months ago

Post

1419

The #1 trending AI/ML dataset today 🏆

Massive scale, diversity and end-to-end potential from nvidia !
nvidia/PhysicalAI-Autonomous-Vehicles

adamm-hf

posted an update 7 months ago

Post

838

The new King 👑has arrived!

Moonshot AI now the top model on Hugging Face 🔥
moonshotai/Kimi-K2-Thinking

adamm-hf

posted an update 7 months ago

Post

2889

💸🤑You don’t need 100 GPUs to train something amazing!

Our Smol Training Playbook teaches you a better path to world-class LLMs, for free!

Check out the #1 trending space on 🤗 :
HuggingFaceTB/smol-training-playbook

BounharAbdelaziz

authored a paper 7 months ago

Shorter but not Worse: Frugal Reasoning via Easy Samples as Length Regularizers in Math RLVR

Paper • 2511.01937 • Published Nov 2, 2025 • 16

nouamanetazi

posted an update 7 months ago

Post

4897

After training 𝐒𝐦𝐨𝐥𝐋𝐌𝟑 on 𝟑𝟖𝟒 𝐇𝟏𝟎𝟎𝐬 for nearly a month, I've come to realize something most people overlook: 𝐢𝐧𝐟𝐫𝐚𝐬𝐭𝐫𝐮𝐜𝐭𝐮𝐫𝐞 𝐢𝐬 𝐭𝐡𝐞 𝐦𝐚𝐤𝐞-𝐨𝐫-𝐛𝐫𝐞𝐚𝐤 𝐟𝐚𝐜𝐭𝐨𝐫 𝐢𝐧 𝐋𝐋𝐌 𝐭𝐫𝐚𝐢𝐧𝐢𝐧𝐠. 🔥

Everyone talks about model architecture and data quality. And yes, those matter immensely. But here's what nobody tells you: when your training run fails at 2 AM because of mysterious 𝐍𝐂𝐂𝐋 𝐞𝐫𝐫𝐨𝐫𝐬, or when your expensive GPU cluster is running at 𝟔𝟎% 𝐞𝐟𝐟𝐢𝐜𝐢𝐞𝐧𝐜𝐲, the problem isn't your model. It's most probably a 𝐦𝐢𝐬𝐮𝐬𝐞 𝐨𝐟 𝐭𝐡𝐞 𝐡𝐚𝐫𝐝𝐰𝐚𝐫𝐞. 🛠️

Questions that seemed simple but had no clear answers: Why is 𝐌𝐨𝐄 𝐭𝐫𝐚𝐢𝐧𝐢𝐧𝐠 𝐬𝐥𝐨𝐰𝐞𝐫 𝐭𝐡𝐚𝐧 𝐝𝐞𝐧𝐬𝐞 𝐦𝐨𝐝𝐞𝐥𝐬? Which 𝐍𝐂𝐂𝐋 𝐟𝐥𝐚𝐠𝐬 should we actually set? How often should we checkpoint without killing throughput?

That's why we built 𝐓𝐡𝐞 𝐒𝐦𝐨𝐥 𝐓𝐫𝐚𝐢𝐧𝐢𝐧𝐠 𝐏𝐥𝐚𝐲𝐛𝐨𝐨𝐤 📖: a complete guide covering everything from model architecture and data curation to the SmolLM3 training marathon, post-training techniques, and crucially, the 𝐢𝐧𝐟𝐫𝐚𝐬𝐭𝐫𝐮𝐜𝐭𝐮𝐫𝐞 𝐥𝐚𝐲𝐞𝐫 that most teams get wrong.

We validated real vs theoretical bandwidth across the entire stack: 𝐇𝐁𝐌𝟑 𝐡𝐢𝐭𝐭𝐢𝐧𝐠 𝟑 𝐓𝐁/𝐬, 𝐍𝐕𝐋𝐢𝐧𝐤 𝟒.𝟎 𝐫𝐞𝐚𝐜𝐡𝐢𝐧𝐠 𝟕𝟖𝟔 𝐆𝐁/𝐬, 𝐏𝐂𝐈𝐞 𝐆𝐞𝐧𝟒 𝐚𝐭 𝟏𝟒.𝟐 𝐆𝐁/𝐬. Then we ran collective operations across 𝟏𝟐𝟖 𝐆𝐏𝐔𝐬 (16 nodes, 8xH100s each) and measured how performance degrades at scale: all-reduce drops from 𝟒𝟖𝟎 𝐆𝐁/𝐬 on a single node to 𝟑𝟐𝟎-𝟑𝟓𝟎 𝐆𝐁/𝐬 across 16 nodes.

If you've ever wondered why your training runs are slower than they should be, or you're planning to scale up and want to avoid expensive mistakes, this guide might save you weeks of debugging.

𝐓𝐡𝐞 𝐒𝐦𝐨𝐥 𝐓𝐫𝐚𝐢𝐧𝐢𝐧𝐠 𝐏𝐥𝐚𝐲𝐛𝐨𝐨𝐤: https://lnkd.in/e5MKXUHS

Shared with ❤️ by the HuggingFace team