Saturday, February 21


Guwahati: By turning mathematics into a multilingual error-detection tool, IIT-Guwahati has created a formula that can quietly strengthen the foundations of digital knowledge — ensuring that in an AI-driven world, both humans and machines can trust Wikipedia — the world’s largest encyclopedia — more.Wikipedia is maintained by volunteers worldwide, and the developed method can help editors identify hidden typos and linking errors that might otherwise remain unnoticed for years.Their solution is a multilingual method that uses mathematical frequency patterns to detect and correct subtle errors in Wikipedia, ensuring more reliable information for both human readers and the artificial intelligence systems trained on it. This was showcased at the India AI Impact Summit 2026.Wikipedia is a cornerstone of digital knowledge, maintained by volunteers worldwide. But it isn’t flawless. A study by the IIT-Guwahati team found that 3–6% of all links contain mistakes — typos, misspellings, or extra words in the text that links one page to another.These “Surface Name Errors” may look trivial, but they quietly erode trust. For readers, they reduce credibility. For AI, which often uses Wikipedia as a training dataset, they can distort learning and weaken performance.To address this challenge, Prof Amit Awekar, who is an associate professor in the department of computer science and engineering, along with a M Tech student, Anuj Khare of batch of 2022, built a method that uses mathematical frequency patterns, making it adaptable across languages.The team built a three-step process. First, every link is broken down into four parts — the page it appears on, the page it points to, the word used, and the surrounding text. Next, the method applies a frequency test — a name is considered valid only if it appears at least 10 times and makes up 5% of all links to that page. Finally, flagged errors are classified as either simple typos — like “Gawahati” instead of “Guwahati” — or span errors, where extra or wrong words creep in.Speaking about the real-world application of the developed method, Prof Awekar, said, “This work shows us that we should not be trusting the data from the web blindly, both for human use and training AI models. Good data is the beginning of any good AI model and downstream application.”The method was tested on eight languages — English, Hindi, Sanskrit, Urdu, German, Italian, Marathi, and Gujarati — and proved accurate across all. When the team compared English Wikipedia snapshots from 2018 and 2022, they found that 30% of the errors flagged by their method had already been corrected by editors, validating its effectiveness.Even more striking, the Wikipedia community accepted over 99% of the manual corrections suggested by the researchers, the institute said.For everyday users, this means cleaner articles. For AI, it means stronger models built on trustworthy data. And for Wikipedia’s volunteer editors, it offers a scalable way to catch mistakes that might otherwise remain hidden for years.



Source link

Share.
Leave A Reply

Exit mobile version