This is the official UI for UrduGPT — a custom-built English → Urdu translator powered by a Transformer-based LLM trained from scratch using PyTorch.

UrduGPT is a research and production-friendly language model built step-by-step using:
- Raw dataset from Hugging Face (English–Urdu parallel corpus)
- Byte-Pair Encoding (BPE) tokenizers trained from scratch
- Transformer architecture inspired by "Attention Is All You Need"
- PyTorch for model building & training
- Streamlit for live translation web app
- Sentence translation with beam or greedy decoding
- Token-by-token display with confidence scores
- Translation history session & export (CSV, PDF, Word)
- Local branding (logo, favicon)
- Deployable on Streamlit Cloud or Hugging Face Spaces
python urdugpt_step1_dataset.pyLoads and trims parallel corpus (English–Urdu) from Hugging Face.
python urdugpt_step2_tokenizer.pyTrains BPE tokenizers for both source (English) and target (Urdu) languages.
python urdugpt_step2_dataloader.pyCreates PyTorch-compatible dataset & dataloaders.
Model code is inside urdugpt_step8_transformer.py, built from scratch:
- Embedding + Positional Encoding
- Multi-head Attention
- FeedForward + AddNorm
- Encoder and Decoder stacks
- Final Projection layer
python urdugpt_step9_train.py- Uses cross-entropy loss
- Trains for N epochs and saves checkpoints (./urdugpt/model_{epoch}.pt)
python urdugpt_translate.pyInteractive terminal-based translation using latest model checkpoint.
streamlit run urdugpt_web_app.pyClean web-based frontend with history, export, and visual confidence scores.
pip install -r requirements.txtMake sure the model is trained and tokenizer files exist. Then run:
streamlit run urdugpt_web_app.py- Clone this to a public GitHub repository
- Go to https://streamlit.io/cloud
- Click New App → select your repo →
urdugpt_web_app.py - Set Python version and add
requirements.txt - Hit Deploy 🎉
This project will be used:
- To demonstrate building LLMs from scratch
- As a template for multilingual translation apps
- To support fine-tuning for Urdu/Indic NLP research
We’ll invite contributors to:
- Extend to other language pairs (e.g., English → Pashto, Hindi, Bangali, Panjabi)
- Improve UI/UX (add voice input, transliteration)
- Add dataset upload & training interface
- Learn Transformer internals end-to-end
- Translate with your own trained model (no API needed)
- Run entirely offline or host on open platforms
- Extendable to many other NLP tasks
urdugpt_web_app.py # Streamlit UI
urdugpt_utils.py # Config loader (YAML)
urdugpt_step1_dataset.py
urdugpt_step2_tokenizer.py
urdugpt_step2_dataloader.py
urdugpt_step8_transformer.py
urdugpt_step9_train.py
urdugpt_translate.py
config.yaml # All hyperparameters & paths
favicon.ico # UI icon
urdugpt.png # Logo for UI
This project is proudly created and maintained by Fayaz Khan.
MIT License. Contributions welcome.
Made with ❤️ for Urdu speakers & NLP builders.