Technical R&D Projects  
   

 

Funded Projects

Urdu Nastalique Optical Character Recognition System
 
Principal Investigator:
Al-Khawarizmi Institute of Computer Science, UET, Lahore
www.kics.edu.pk
Project Director:
Dr. Sarmad Hussain sarmad@cantab.net
Project Details:
Start Date: December, 2011 Duration: 15 months
Project Cost: PKR 29.14 million Project Funding: PKR 29.14 million
Project Status: In progress.
Technical Progress Reports Submitted:
NA
Pending Reports:
None.
Deliverables Submitted:
NA
Pending Deliverables:

None.
Financial Audit Report: NA
Project URL: http://www.cle.org.pk
 

 

Executive Summary

With English language literacy less than 10%, current online content is largely inaccessible to most Pakistanis due to the language barrier. In addition to this, even for those who can understand the generally available online English language content, much of it is culturally irrelevant. There is already significant amount of content published in Urdu, the lingua franca of Pakistanis, in the form of books, magazines, etc. Much of this content is written in Nastalique writing style. An Optical Character Recognition system (OCR) will allow to scan this content and to quickly convert it for online publishing, in editable and searchable format. Thus, a Nastalique OCR for Urdu, which can eventually be extended for other languages of Pakistan, would provide the necessary impetus required for effectively bringing the much needed culturally relevant indigenous content on line. OCR system will also allow to search through the existing scanned text posted online. Developing this technology will a lso enable access to published material to print disabled Pakistanis (blind and illiterate) as book readers etc. can be developed using this OCR and integrating it with an Urdu Text to Speech system. Therefore, this technology has much promise both commercially and for socio-economic benefit of Pakistani citizens. The project will also train resources in the area of Human Language Technology (HLT), an emerging area worldwide, which integrates research from speech, script and language processing domains for the benefit of people.

   
 
 
 

Copyrights (C) National ICT R&D Fund