|
Executive Summary
With English language literacy less than 10%, current online content is largely inaccessible to most Pakistanis due to the
language barrier. In addition to this, even for those who can understand the generally available online English language
content, much of it is culturally irrelevant. There is already significant amount of content published in Urdu, the lingua
franca of Pakistanis, in the form of books, magazines, etc. Much of this content is written in Nastalique writing style.
An Optical Character Recognition system (OCR) will allow to scan this content and to quickly convert it for online publishing,
in editable and searchable format. Thus, a Nastalique OCR for Urdu, which can eventually be extended for other languages of Pakistan,
would provide the necessary impetus required for effectively bringing the much needed culturally relevant indigenous content on
line. OCR system will also allow to search through the existing scanned text posted online. Developing this technology will a
lso enable access to published material to print disabled Pakistanis (blind and illiterate) as book readers etc. can be developed
using this OCR and integrating it with an Urdu Text to Speech system. Therefore, this technology has much promise both commercially
and for socio-economic benefit of Pakistani citizens. The project will also train resources in the area of Human Language Technology
(HLT), an emerging area worldwide, which integrates research from speech, script and language processing domains for the benefit of people.
|