Cut through clutter: Tamil dataset to train AI models

Netizens engaging with AI models in Tamil or any regional language often come across incoherent translations, jumbled sentences, bizarre choices of words and poor grammar, but overlook them as the babels of a budding ecosystem. But not Raju Kandaswamy, a senior IT professional, who believes such errors throw a spanner in the works of Tamil’s linguistic integrity. “Current training datasets are heavily distributed in English language and therefore do not accurately represent Tamil language or its cultural context. Users over a period of time absorb and internalise these biases leading to slow erosion of cultural values,” he said. An increasing number of people, including senior citizens, rarely pick up books and consume Tamil content only through the internet or through speech. In a world increasingly mediated by LLMs, involved in web searches, shopping and education, this creates a problem. Kandaswamy is principal consultant at Thoughtworks in Coimbatore and part of AI Tamil Nadu, a non-profit community aiming to improve how AI models work in Tamil. The team is building a large-scale Tamil language dataset to train AI models, and is collaborating with authors and other organisations to curate large, high-quality datasets. Their plan is using these data repositories to fine-tune open-source models such as Meta’s LLama and make it available for anyone to build Tamil-specific models. He believes that these models can be used to deliver govt services, communicate welfare schemes, and enable vernacular education to the masses, especially the rural population. Abinaya Mahendiran, a natural language processing expert and member of AI Tamil Nadu, is leading the initiative named Vidhai. She too thinks it is crucial for preserving Tamil culture. “Access to high-quality datasets in less represented languages is limited, and Tamil is no exception. Machine-translated content is often inaccurate. So, we collect original Tamil texts such as books, essays and articles from various sources, clean them, and annotate them with the help of volunteers, students, linguists, retirees, and teachers. A trove of Tamil books and printed material is yet to be digitised,” she said.Today, many independent researchers and language enthusiasts are spending their own money to improve Tamil AI models. But as Abinaya notes, lack of computing resources and difficulty in mobilising volunteers are major barriers.The Tamil Virtual Academy (TVA) has a digital library of more than 1 lakh books containing around 1.5 crore pages, spanning subjects from science to history. It is also developing tools like syntactic parsers, morphological analysers, and ‘parts of speech’ taggers, resources critical for NLP research. Yet, fragmented efforts, siloed developments, and fuzzy copyright guidelines hinder collaboration. A senior official confirmed that TVA could collaborate with AI technologists, but ambiguity around copyright and fair use remains a bottleneck.Navaneeth Malingan, founder of AI Tamil Nadu, is attempting to bridge the ecosystem, by bringing together various elements — from scouting for students volunteers and linguists to getting access to computing resources through corporate sponsorship. He says these kinds of models are crucial for delivery of govt services for locals, while commercial AI models will be useful for most business cases. “The govt can use it to fill forms through voice, give instructions to farmers and teach Tamil to the younger generation. Various stakeholders including govt and companies should be brought together to build these models suitable for the use cases,” he asserted. The community is currently fine-tuning existing AI models to improve its performance in Tamil, but is ambitious about building one from scratch – albeit a small-domain focused model. It will use a tokenisation method inspired by Nannul, the 13th century Tamil grammar treatise, to better reflect the language’s morphological structure instead of the currently widely used Byte-Pair Encoding (BPE) method. Tokenisation refers to the process of breaking down text into smaller units called tokens. From adopting the printing press in the 1500s (the first in India), to adopting Unicode for the internet, Tamil has consistently been an early adopter of new communication technologies. Now, it should be able to find its place in the AI age.

Source link

What's Hot

India to see above-normal July rainfall; Northeast, East & South may face deficit: IMD | India News

Not terror, ‘awe of IIT brand’ drove youth to campus | Mumbai News

Usury gang held for kidnapping borrower’s father, cutting his two fingers | Chennai News

Usury gang held for kidnapping borrower’s father, cutting his two fingers | Chennai News

Adithya Ashok, New Zealand leg-spinner with Vellore roots, hones skills at Super Kings academy | Chennai News

Cops chase, nab two thieves in Coimbatore | Chennai News

Dravidian govts failed to build dams in Tamil Nadu: Premalatha Vijayakanth | Chennai News

Can TN raise a toast with local wines? | Chennai News

‘Future CEO must be master gardener, growing talent, leaders’ | Chennai News

Latest Posts

India to see above-normal July rainfall; Northeast, East & South may face deficit: IMD | India News

Not terror, ‘awe of IIT brand’ drove youth to campus | Mumbai News

Usury gang held for kidnapping borrower’s father, cutting his two fingers | Chennai News

What's Hot

Cut through clutter: Tamil dataset to train AI models | Chennai News

Related Posts

Subscribe to Updates