Urdu Word List Sources

Jul 4, 2008 at 8:12 PM
We will discuss different sources for urdu words here. Following are some possibilities:
  1. Word List from CRULP
  2. Word List Compiled at Urdu Mehfil
  3. Start from Scratch by Crawling Urdu Web Sites
Each source has pros and cons

Word List from CRULP

A huge list (100,000+ words) already compiled by a Govt of Pakistan institution, gives a good head start

On manual inspection, I found that the list has a lot of mistakes that need to be corrected.
The list has too many word missing that are in common usage inspite of being so huge.
The list has been published under "Creative Commons 3" license and if we use this list, we will also have to keep everything under the same license

Word List Compiled at Urdu Mehfil

A reasonable size (11,000+ words) list already compiled by volunteers at Urdu Mehfil
The list seems to have been manually verified and corrected\
We can chose whatever license we want

Some volunteers at Urdu Mehfil are using Arabic / Persian keyboard layouts. This results in wrong unicodes for some characters that are shared between Urdu, Arabic and Persian.

Start from Scratch by Crawling Urdu Web Sites

We can chose whatever license we want

Huge task. The automatically generated lists will have to be verified manually

Jul 4, 2008 at 8:27 PM
I have started to scan pages from two web sites and I am verifying the word list manually as I go. If we decide to merge with CURLP list or URDU mehfil list later, we can do that. I am using the following

Columns from Jang:
As I verify manually, I remove some of the English words transliterated in Urdu, specially the ones for which good Urdu alternated are available and widely accepted. I am leaving English words which do not have widely accepted Urdu alternates. Coming up with good alternates for these words is a separate effort in my understanding and I do not want to slow this spell checker project for that.

Featured Articles from Urdu Wikipedia:
I am using these articles with the assumption that as they are featured, the adminitsrtaors have already reviewed them and manually spell checked. So these articles are most likely to contain correctly spelled good quality urdu words.

Another good source of Urdu vocabulary that comes to mind is BBC Urdu. I have not started this yet, but I am very inclined to include this source unless someone points to any major issues with this.

Jul 20, 2008 at 9:24 AM
آداب عرض ہے
الفاظ کی فہرست بڑھانے کے سلسلے میں آپ اردو فورمز کو بھی مدنظر رکھیں تاکہ روزمرہ کی اردو کے الفاظ بھی شامل ہوسکیں۔
اس سلسلے میں آپ اگر اردو محفل پر ہی ایک پوسٹ کردیں تو مل جل کر ہم انشاءاللہ ایک لاکھ کے قریب الفاظ کی فہرست جنریٹ کرسکتے ہیں۔ نئے الفاظ کو پرانے الفاظ سے ملا کر ملتے جلتے حذف کردئیے جائیں گے جبکہ بعد میں نظرثانی کے لیے بھی کئی اراکین کا تعاون حاصل ہوسکتا ہے۔
انگریزی الفاظ کے سلسلے میں گزارش ہے کہ عمومی استعمال کے الفاظ کو شامل ہی رہنے دیں تو بہتر ہے۔ کونسا لفظ استعمال کرنا ہے اس بات کا مجاز صرف زبان کا بولنے والا ہے اگر ایک لفظ استعمال ہورہا ہے تو اسے لغت میں بھی شامل ہونا چاہیے۔ دوسری صورت میں لغت کا تاثر یہ قائم ہوگا کہ یہ نامکمل ہے۔