In a significant step towards Indianising Artificial Intelligence (AI), earlier this month, Bengaluru-based Sarvam AI launched a product to change the AI landscape in the country. One of them is Sarvam 2B – an open-source large language model (LLM) that is proficient in 10 Indian languages.
Democratizing AI, making it accessible to every Indian regardless of linguistic and socio-economic background, and bridging the digital divide is the ultimate goal, one of the company’s founders posted, after the launch.
LLM became viral in the news last year due to the launch of the American company OpenAI’s GPT4 – an advanced version of the company’s ChatGPT which is said to be able to understand human emotions and respond accordingly. While ChatGPT, at present, allows interactions in Indian languages, including Tamil, Malayalam, and Hindi, it leaves much to be desired — especially in terms of understanding the nuances of language, dialects, idioms, and cultural references of Indian languages, among others.
To a question about this, the chatbot, in its part, said that this would be better demonstrated in an LLM designed from scratch than an AI model developed from publicly available data.
No lack of initiative
Developing an LLM in a language other than English is difficult, says Neha*, who has been working on the LLM for over four years. “There is a lot of data and digital content available in English, which is the basis for training the machine. As for Indian languages, the data is very limited. It will take a lot of time, and a lot of training to get to where English is placed today (for LLMs), ” he added.
However, efforts have been made for some time in the field by companies such as Sarvam, which released the ‘OpenHathi’ Hindi Language Model last year; initiatives from the Central government such as Bhashini which provides a variety of AI tools that allow access in Indian languages of choice; AI4Bharat, an initiative of the Indian Institute of Technology-Madras, among others.
Most recently, Thiagarajar College of Engineering (TCE) in Madurai launched a research center, Tamarai, for AI in Tamil.
This process is difficult because designing an efficient Indian language LLM requires accurate and authentic data. “(In most cases), the company or university working on this reaches the best university where the language in question is taught, contacts the faculty, and creates some literature on the language. The help of non-governmental organizations (NGOs) is requested to collect data (language) in the field. Agents are deployed in remote areas where the language is still spoken without the influence of other languages (such as English). The NGO will arrange meetings with residents, ask people to talk about some topic or domain, and record the conversation,” Neha said .
Transcription of the collected data is quite challenging, says Janki Nawale, a linguist at AI4Bharat, IIT-M, noting the problems faced in designing the ‘IndicVoices’ dataset, using IndicASR — the first automatic speech recognition model that supports all 22 languages (in the Eighth Schedule) — has been built.
“Projects such as IndicTrans and IndicVoices in AI4Bharat provide an opportunity for translators, linguists, native speakers, NGOs and local partners to participate in various linguistic tasks. It is difficult to translate and interpret data for machines because in most cases, annotations are done at the level of sentences or sentences without long semantic contexts. Sometimes, the translated sentences can be difficult to translate due to the syntactical limitations of the Indian languages, such as right-to-left scripts of Urdu and Manipuri; and the standard script cannot write common language words. Therefore, from a scientific perspective, certain annotation rules should be established to maintain data consistency between languages, and allow the freedom to capture the authenticity of the language without being restricted by these ‘rules’ for different applications. -different,” he said.
Technical issues are also a challenge. The graphics processing unit (GPU) is as important as the data for LLM, to process the huge amount of information that the machine is trained on. “LLM deals with billions of parameters, working with petabytes of data. To exercise them, one needs the H100 chip (produced by NVIDIA) to crunch large volumes of data, or machine learning models. Besides the expensive rates, there is a need for special RAM, resources, and motherboard, among other things, requires highly technical resources to put the Assembly together; and use it efficiently to train LLM,” said Ranjith Melarkode, founder of The Neural.ai.
Computation of tokens
Generally, the AI should break down the sentences or words that are provided into ‘tokens’.and the machine is known to generate less tokens for English as opposed to languages like Hindi or Tamil. Mr. Ranjith said, “Higher tokenization allows the model to better capture language nuances, and handle different inputs – which is much needed in Indic languages where words often share common roots between languages. Fidelity and flexibility this often requires computational and resource costs. It is important to find the right balance between model efficiency and fidelity (and cost).”
Ms. Nawale said that engaging many people in detailed tasks is difficult. “To get 20 minutes of content from people, one has to work with them for three to four hours, which some do not agree with,” he said, while also emphasizing that many people, as well as organizations, expand their cooperation when they understand that their efforts are aimed at to promote, digitize and preserve their language, which may be declining.
The process is also lengthy, says Sanjay Suryanarayanan, research engineer, AI4Bharat, IIT-M. Various factors, and domains (topics), must be looked at before the data is fed to the machine, so that the final product is more efficient. For example, to evaluate translation models (AI models designed to translate text-based content from one language to another), developers look to ‘gold standard parallel data’ — human-translated content, not machines — to train their models. “The translator manually translates the text from English to Indian language, and feeds it to the machine. After the translated text is created, the machine is asked to translate it back to English. This process is called back translation,” he said.
This is just a cog in the wheel of a larger set. In addition, Mr. Sanjay said, rapid engineering (where the instructions are structured in such a way that the AI model can implement the requests made) should be focused on. “The ultimate goal is to make AI models sophisticated,” he said.
Also, if an LLM is built, unless there is the right vision, team and operational experience, there is a chance of inadvertently releasing “bias (that the team) has” into the system, Mr. Ranjith said. “One team may not understand the work of another team. The data team may not interact with the user experience or legal compliance team. They may speak at a superficial level, but not at a deeper level, and there is concern that bias is introduced ( in the built model),” he said.
In addition, LLMs must be continuously trained and adjusted. The process is continuous. “Only then, will we achieve the accuracy we aim for,” he said.
Benefits
LLM India, if holistically designed, can have multiple applications. Neha* says that from interactive learning courses to chatbots to revitalization of languages such as Dogri can be done through this.
Hari Thiagarajan, chairman, TCE, said: “There is a lot of potential in (Indianizing AI) because the country has more than 20 languages (in the Eighth Schedule of the Constitution). (For Tamarai), Tamil is a classical language, and the Tamil diaspora is spread all over the world. Therefore, using Tamil will be a big benefit — which has not been done before. Also, it is a way to promote the language. Tomorrow, if Tamil LLM can do what ChatGPT English does, the industry will benefit, and the language will be preserved. ,” he said.