wordpress blog stats
Connect with us

Hi, what are you looking for?

Reviving Telugu Folklore with AI: How Swecha is Creating a Telugu Chatbot

A software community in the Telugu-speaking regions of India is working towards creating a Telugu alternative to ChatGPT. Here’s a closer look at the various aspects of training a Large Language Model (LLM) and the issues such an exercise wrestles with.

This image was created using DALL-E.

“O King, answer my question truthfully. If you don’t talk despite knowing the answer, your head will split. If you answer, then I will escape.” For those who grew up listening to folklore from their elders, especially in the southern regions of India, these lines may stir the dormant memories of Vikram-Betaal stories. These were tales where an evil spirit posed questions of morals, logic and ethics before a king hell-bent on completing a promised task. In Telugu-speaking regions especially, elders used these stories, published in the Chandamama magazine, to teach young children about values and reasoning. Now in 2023, the Telugu youth use these stories again as a means to teach not children but a large language model (LLM) chatbot, Ganesh, General Secretary of the Swecha community in Telangana told MediaNama.

Taking its name from the Telugu word for “freedom,” Swecha is a free and open-source software movement that was started in 2005 by Kiran Chandra to create a local operating system for the Telugu community. Years later, the community now has an ambitious goal to create a Telugu alternative for popular LLMs like ChatGPT.

According to C. Chaitanya, a Swecha member driving the LLM project, “The idea [to create an Indic language LLM] germinated from the comment of Sam Altman that its hopeless to build a ChatGPT kind of LLM in India with $10 million dollars. I took that as a challenge and threw the challenge to the Swecha community. The advantage India has is that we have 1.4 billion human minds. Using those, can’t we create one AI?” Swecha’s chatbot would be among the first Indic language models created by an Indian start-up to compete with big players like Open AI.

Regional alternatives can make computing accessible

In the early days of the movement, Swecha leaders like Chandra realised language and affordability can pose as hurdles to accessing computing. These hurdles kept the Telugu working class from accessing the new technology. Moreover, computers at that time were mostly provided by Microsoft Windows, which wasn’t concerned with creating Telugu interfaces or similar regional features.

Now with a membership strength of over 10,000 in Telangana, Swecha is working on creating the Telugu chatbot named ‘Vemana,’ named after the popular philosopher and poet in Telugu language. However, to achieve this goal, the organisation first needed to create a dataset. That is where the Vikram-Betaal and other Chandamama stories came in.

“We took the inspiration from a project called Tiny Stories. This is done in English. Tiny Stories are basically these three-line, four-line stories; very small, very simple, coherent. When you run this story through the LLM, you train an LLM with this as a target and it comes up with new stories. That’s the basic idea. So, because there’s no such thing [in Telugu culture] and we have to start somewhere, we thought it’s best to start with the Chandamama stories,” said Ganesh.

Indic language LLMs can open doors for the local population

Ganesh said that Swecha saw the challenge to create a Telugu LLM as a chance to give even a remote, Telugu-speaking farmer an opportunity to do something as unique as prompt engineering.

“Today anybody can use ChatGPT… there is a huge section of the job market that has recently opened up – prompt engineering, which is nothing [but prompting] ChatGPT into giving you the answer that you want. The only skill that you need to do prompt engineering is [speaking] English. Now, why can’t some farmer who is in a remote area of Telangana, become a prompt engineer? We would like to imagine a future where his whole farm is [connected to the] Internet of Things network. He uses the technology to make his yield more productive and whatnot. Let’s say there is cattle on the other side and the farmer needs to know where a particular cattle is or control when the water enters the field. Why can’t you do that with the voice-enabled Telegu speech, which is a prompt? The only thing that is stopping him is that he doesn’t know English,” said Ganesh.

Building a dataset for the Telugu LLM

Swecha gathered the PDFs of the Chandamama stories from the publication’s archive website. However, these were old scanned images. Even after running these through optical character recognition (OCR) processes, Ganesh said the text still required cleaning and proofreading. So to complete this mammoth task, Swecha on November 16, 2023, organised a datathon wherein around 7,500 people from across 30 colleges, organisations, and institutions, gathered to clean the dataset.

“40,000-plus pages were corrected, proofread. That means all of Chandamama stories were proofread and done in a single day,” said Ganesh, adding that the organisation will now be working on building the LLM using this data which will be called ‘Chandamama Kathalu.’

According to Chaitanya, the datathon showed that it’s possible to collect high-quality datasets at a very low cost. “The whole planning to execution took only 2 weeks and entirely driven by volunteers,” he said. The datasets were uploaded on the Hugging Face platform along with the open-source toolkit for data collection. This toolkit can now be used by others to create similar datasets, said Chaitanya.

Larger hurdles in the LLM ecosystem

On hearing about the Telugu LLM, Varshul Gupta, Co-Founder of DubVerse an AI-driven video dubbing and content creation platform, said, “the creation of a Telugu GPT or Indic GPT is very, very necessary, is very much in the need of the hour.” This is because the current LLMs are extremely costly in non-English. Explaining how an Indic language LLM can solve this, Gupta told MediaNama, “Even if you were to type your name in English versus in Telugu or Marathi, your token size increases casually by about 8-10X. Your costing increases 8-10X. 500 words in English versus 500 words in Telugu will be 8-10X costly. That’s the technical limitation that this Telugu LLM will end up solving. That’s the prime value proposition that we bring the primary cost down.”

However, Ankit Prasad, Founder of the Bobble AI, was more sceptical about the regional LLM’s success. Speaking to MediaNama he said, “I feel that Indic language specific LLM is not a moat at all because LLM by its very nature is trained on a large set of data, which includes all language data. That’s good enough for them to generate responses in any language. Try asking a Tamil question to ChatGPT, it will answer. Yes, today the answer might be a little weaker given probably I don’t know their configuration of data set, but probably the Tamil data set within their overall training data set might have a smaller percentage. But with time it will increase, they are continuously operating themselves. For a new startup to come do the LLM play, it doesn’t make sense for me unless they already have hundreds of millions of dollars in investment.”

Can Swecha compete with the big players in the AI sector? When asked how DubVerse would interact with such a regional LLM created by a local community, Gupta said that the LLM will have to be extremely competitive for businesses to switch from existing chatbots like ChatGPT.

“[In the case of] Open AI, we can trust the quality because they have this red teaming done, they have this security done, they have a huge amount of human workforce that have done cycles of RLHF [Reinforcement Learning from Human Feedback] on those models. I’m not sure if there was a startup from India who was to come up with their own version of even telco LLM, that will be this much secure, this much red teaming, this hate speech, and all of those things would have been removed. That’s hard for me to imagine. Even if I was to use them, it will not be like a flip switch for me, for us. We will have to do some testing at our end,” said Gupta. Still, Gupta applauded Swecha for trying to come up with a regional LLM at a time when only four countries have managed to ship their models so far – US, China, France, and Saudi Arabia.

Big techs already have a relationship within the developer community: Prasad pointed out that for an LLM, an important business model is attracting developers to build use-cases and generate revenue. Here, big techs have a significant advantage since they have access to large developer communities.

“There are so many developers who have been already using different developer APIs of Microsoft, Google, or Facebook to build their applications historically, and therefore they have a relationship.”

Further, Prasad said that start-ups may struggle with the computing aspect of the LLM models since “very few companies who have access to that humongous compute power.”

“In India, startups do not have that luxury of compute, do not have those extraordinary investments to build horizontal LLMs, but do have that talent to create an amazing user experience that is the workflow combined with fine teaming of the horizontal base model into a specific vertical. Most of the startups are doing that,” said Prasad.

Another pressing issue: Navigating copyright regulations

The Chandamama magazine stopped the publication of its physical copies in 2013. It was bought by Geodesic, a Mumbai-based software services provider company in 2007 to usher Chandamama stories into the digital era. However, the company was found defaulting on outstanding loans and was ordered to be wound up by the Bombay High Court in 2014 and ordered the sale of the magazine’s intellectual property (IP) rights in 2019. So, while the Chandamama stories may still be available on the archive website, the ownership status of the magazine raises questions on whether Swecha may navigate copyright lawsuits in future.

When speaking to MediaNama, Swecha said that they found copies of the Chandamama stories publicly available on an archived website. But this does not mean that the stories are not protected under copyright since an author holds copyright over their creation for the duration of their lifetime and 60 years after their death.

However, such cases of copyright infringement by chatbots have been cropping up in recent years. Writers like Julian Sancton and others in the Authors Guild have argued that companies like OpenAI use their writings to build their LLMs and teach the machine how to generate stories with similar syntax, style and theme. Writers argue that this should make such chatbots liable to a licensing fee. Meanwhile, the companies argue that the writers misconceive the scope of copyright.

Copyright infringement depends on the final product: According to Rahul Ajatshatru, a copyright lawyer, the scope of a copyright infringement in this case would depend on the final product generated by the LLM. “For assessing copyright infringement, how the impugned work is created is not really important, but what is the final expression and what impression it creates in the mind of the audience, is most relevant. Is that substantially similar to the original? If yes, then copyright can be claimed to be infringed… For anybody to succeed in an infringement action, the claimant must prove that the work that is challenged is substantially so similar to the original work that any common man comparing the two works will have little doubt that one is taken from the other…  whether it was done directly (with clear intention) or indirectly, through software that employs AI, may not be a tenable defence.”

He explained that this is based on the idea-expression dichotomy principle in India’s copyright law that states that a person cannot have copyright over an idea but they can claim copyright over the expression of an idea.

AI hallucinations may help avoid copyright infringement: AI hallucinations occur when AI (like an LLM) generates incorrect or nonsensical outputs that don’t align with reality or logical expectations. However, since the current Swecha LLM focuses on creating stories, this AI malfunction ends up becoming an advantage.

According to Ganesh, the goal with the current model is: “Tomorrow, if I run this machine and say ‘once upon a time’ [in Telugu, it] should actually give me a new story or make up a new moral conundrum. Now the interesting part is in the domain of stories, it’s okay if the model makes up new information. That is what you would call creativity.”

It may be noted that while copyright gives protection to humans, the same may not extend to AI. Ajatshatru doubts that copyright can be claimed for AI-generated content as it is not created by humans or are form of human expression. Not having any copyright, such work can be shared, copied and used by anyone without any infringement or copyright violation.

The road ahead for Swecha: GPUs pose a hurdle for start-ups in creating an effective LLM

Ganesh told MediaNama that after dataset creation, computing was the second task for Swecha. He agreed that this would be a bigger challenge for the community due to a hardware piece called a Graphic Processing Unit (GPU). These GPUs were originally meant for games, videos and animations but eventually became crucial for building machine learning models and AI models such as LLMs.

“Unless you are willing to wait for three days or two days for ChatGPT to come up with an answer, you would want to actually run this and do this in a GPU. There is a virtual monopoly on these GPUs, which is mostly from the NVIDIA Corporation [in the US]. Right now, even if I have [Rs.] six crores in my pocket, I cannot replicate this in scale because I am not able to even access these GPUs on the market. They’re simply not available. They are all gobbled up by America, China and so forth,” said Ganesh.

Moreover, the waitlist for these GPUs is three to six months, costing precious time for Swecha. As such, rather than depending on the GPUs, the movement is currently trying to come up with alternatives for GPUs.

“We personally believe it’s a matter of national security as well because tomorrow, NVIDIA or the US, which [is] basically where the NVIDIA’s intellectual property lies in, can decide we won’t sell these to India… if you build your bricks with the NVIDIA’s GPUs, your house might fall down because you’ll be dependent upon them. It’s high time that we invest in our own clusters,” said Ganesh, stressing the need for the Indian government to invest in such clusters.

Swecha looking for alternatives to GPU chips: As an alternative to GPUs, Swecha is currently considering using available devices like phones and laptops with good RAM, processors and even GPUs within them. Ganesh told MediaNama that these devices in their idle state can be used to create a network that can then potentially rival GPUs and train LLM-like machines.

“Now the question is, can I use these phones, for example, are idle whenever we sleep? Six hours a day, four hours a day, five hours a day. When you sleep, you put it in charging, they’re pretty much idle. They’re not doing anything. So why don’t we use these phones, computers, laptops, that are sitting idly by, network them all together and make a huge machine that can rival the GPUs. That can then, maybe not individually as a node but as a group crack the problem of training these huge machines,” said Ganesh.

If Swecha manages to pull this off, the feat may inspire similar regional LLMs in different parts of India. There have already been attempts by locals to help bring regional aspects to the LLM sector. In Karnataka, Harish Garg, a software engineer, developed a Kannada Gottu GPT, to help people learn Kannada, reported TS2. Although Garga had to use ChatGPT to create the tool, it highlights how Indians are already seeking to bridge the language barriers in AI. In Dubai, the QX Lab AI introduced an ASK QX application that incorporates over 100 languages including Indic languages like Hindi, Urdu, etc., to offer better language support.

All in all, Swecha appears confident that it’ll be able to have at least the Chandamama version of the LLM working in the coming weeks.  The organisation is also working on an international conference on digital ecosystems in March 2024 to discuss challenges and technologies in this field and encourage more people to join their initiative. Like Betaal, the project appears to be posing one conundrum after another before Swecha, and like King Vikramaditya, the Telugu software movement continues to work on its goal.


STAY ON TOP OF TECH NEWS: Our daily newsletter with the top story of the day from MediaNama, delivered to your inbox before 9 AM. Click here to sign up today!


Also Read:

Written By

I'm interested in the shaping and strengthening of rights in the digital space. I cover cybersecurity, platform regulation, gig worker economy. In my free time, I'm either binge-watching an anime or off on a hike.

Free Reads

News

"We believe the facts and the law are clearly on our side, and we will ultimately prevail," the company said on the enactment of...

News

Zuckerberg expressed confidence in monetizing AI through methods like ads and paid access to larger models, leveraging Meta's successful history with scaled technologies.

News

The data leakage comes on the same day as the Reserve Bank of India (RBI) restricted Kotak Mahindra Bank from onboarding customers over online/mobile...

MediaNama’s mission is to help build a digital ecosystem which is open, fair, global and competitive.

Views

News

NPCI CEO Dilip Asbe recently said that what is not written in regulations is a no-go for fintech entities. But following this advice could...

News

Notably, Indus Appstore will allow app developers to use third-party billing systems for in-app billing without having to pay any commission to Indus, a...

News

The existing commission-based model, which companies like Uber and Ola have used for a long time and still stick to, has received criticism from...

News

Factors like Indus not charging developers any commission for in-app payments and antitrust orders issued by India's competition regulator against Google could contribute to...

News

Is open-sourcing of AI, and the use cases that come with it, a good starting point to discuss the responsibility and liability of AI?...

You May Also Like

News

Google has released a Google Travel Trends Report which states that branded budget hotel search queries grew 179% year over year (YOY) in India, in...

Advert

135 job openings in over 60 companies are listed at our free Digital and Mobile Job Board: If you’re looking for a job, or...

News

By Aroon Deep and Aditya Chunduru You’re reading it here first: Twitter has complied with government requests to censor 52 tweets that mostly criticised...

News

Rajesh Kumar* doesn’t have many enemies in life. But, Uber, for which he drives a cab everyday, is starting to look like one, he...

MediaNama is the premier source of information and analysis on Technology Policy in India. More about MediaNama, and contact information, here.

© 2008-2021 Mixed Bag Media Pvt. Ltd. Developed By PixelVJ

Subscribe to our daily newsletter
Name:*
Your email address:*
*
Please enter all required fields Click to hide
Correct invalid entries Click to hide

© 2008-2021 Mixed Bag Media Pvt. Ltd. Developed By PixelVJ