In what is a painstaking process, Wikipedians are digitizing Indian language, out-of-copyright texts online, trying to address the comparative paucity of Indic language texts online. Wikisource is a repository of documents and archived material that serves as a reference source for Wikipedia, and a means of improving access to information sources. Of the 64 languages Wikisource is available in,  8 are Indian: Tamil (stats), Malayalam (stats), Telugu (stats), Kannada (stats), Sanskrit (stats), Marathi (stats), Bengali (stats) and Gujarati (stats). What’s particularly notable about this digitization is that the texts are being typed out by volunteers on their own time, one word at a time.

How It Began

Users were adding bhajans of Mirabai to Wikipedia, but according to Wikipedia’s policies, recipes, poems and song lyrics belong to Wikibooks or Wikisource, Noopur Raval, Communications Consultant (India Program) at the Wikimedia Foundation told MediaNama. One user raised this issue, and following discussions, it was decided to create a Wikisource for Gujarati. The first text to be digitized, though, was Rachnatmak Karyakram, a book by Mahatma Gandhi. The project, involving the digitization of 60 pages, took six volunteers a week. This was followed by another project, the digitization of Gandhi’s autobiography, with a group of 13 people typing out the book over a month.

Identification & Prioritization Of Texts For Digitization

Selection of text for digitization is entirely community driven: they decide what is important. Editors put up a notice for the project, and user participation is sought. For example, the Gujarati Wikisource editors chose a text by Mahatma Gandhi. The community has an intensive process for checking if a book is out of copyright, either using the publication date, and there are mailing lists which discuss when books go out of copyright. “It’s not as if there is a shortage of texts that are out of copyright,” Hisham Mundol, Consultant (India Program) at the Wikimedia Foundation said, adding that “The kind of projects that the community is undertaking (at present) involves iconic books, where you know the author and the publisher.”

Overcoming Technological Challenges

Mundol points out that the process of digitization is brutal, compounded by the fact that there is no reasonably functional OCR (Optical Character Recognition) in Indic languages. Texts are thus manually typed out, followed by a phase of correction and proofreading. In comparison, English texts can be scanned and uploaded and OCR’ed. The lack of tools points towards an issue which Wikipedia faces with Indic languages. “If a MediaWiki tool comes to an English language project, the possibility of implementing it, the kind of people using it, all of that happens very quickly, because most of this is written English. It takes time to localize it. For a bug to be filed for a local language project takes a lot more time. That gap makes for a lot of difference: how many people (use it), how easily is the work done, the kind of ease, at every step you need people who know the language to work with people who know the technology,” says Raval.

Still, the situation with Indic language fonts has improved over the past year according to Mundol:”The font input problem is no longer the burning issue. There’s been an increase in the volumes on Indic language scripts, emails, mobiles. We’re seeing a doubling of readership of our Indic languages.” One reason for the increase, according to Raval, has been the implementation of a multiple input tool called Narayam, integrating both Inscript and transliteration.

Reducing Entry Barrier & Involving Schools

The Wikisource project is really small in India right now, but it plays an important role: “It allows people to enter the Wikimedia world of projects in a much easier manner than editing a Wikipedia article. Wikisource is much more accessible,” Mundol says.

In Kerala, community members involved schools in the process of digitizing Ramchandra Vilasam. “As a part of the 7th or 8th standard, the school curriculum encourages typing in Malayalam. So the community members work with the teacher, and instead of 40 students typing out the same two pages that they would have done in a class assignment, they split a book between them, and each types out a separate page. It’s great because if everyone gave in the same page, it would go to the recycle bin quite promptly,” Mundol said, adding that “We are looking at involving more schools, and discussions are on with Malayalam schools and colleges.”

The Culture Of Knowledge & The Importance Of Community For Wikimedia

“It’s interesting to see how a culture develops, not just editors and technology, but the whole interaction that builds up the identity of a community,” Raval says. “When Gujarati Wikisource or Marathi Wikisource comes to your mind, you’re actually thinking of a bunch of people you don’t necessarily know, and their attitudes towards knowledge, and why they would go out of their way, spend hours, just to make sure that the knowledge that they think is important in a language should survive and be digitized, and they’ll go through the pain to make it available.” While each project has an individual taking responsibility as a project manager, and a group gets created around each project, since it is volunteer work, whenever someone has exams or has other work, someone else compensates.

The focus on fostering an involved community often determines Wikimedia’s approach: “The temptation could be to take a bunch of the 4 million articles on the English Wikipedia, and run it through a translator. Very quickly, you can build a huge content base (in Indic languages), but it does nothing for that community. We’ve seen that it not only does no good, it does a great deal of harm because they no longer feel that this is actually their project. It’s about ‘I wrote this paragraph’, or ‘I contributed to this article’,” Mundol says. “Anything that Wikimedia does, we encourage the participation of an individual member as much much more important than anything else because community members edit, contribute, and no technology solution can get you that.”

