MUMBAI, India, June 22 -- Intellectual Property India has published a patent application (202621047997 A) filed by Moresh Madhukar Mukhedkar; Anish Sahu; Vishwajeet Pawar; Ishaan Shaikh; Gayatri Patil; Vivek Patil; Vishal Patil; and Shakil Tamboli on April 15, 2026, for Visual Sense.
Inventors include Moresh Madhukar Mukhedkar; Anish Sahu; Vishwajeet Pawar; Ishaan Shaikh; Gayatri Patil; Vivek Patil; Vishal Patil; and Shakil Tamboli.
The application for the patent was published on June 12, 2026, under issue no. 24/2026.
Abstract: In today's world, the Internet has become one of the essential part of the everyday life. But for the people who are visually impaired (either partially or completly), navigating it independently is still a huge challenge. Most of the part of internet like some websites are not fully accessible to such users which may create a barrier to this community. Our project, Visual Sense, is an AI-based assistive tool designed to help visually impaired users navigate and experience users more freely and narrow the bridge for better accessibility and more independently. The system employs a pre-trained LLaVA (Large Language and Vision Assistant) model — an open-source vision-language model fine-tuned on GPT-generated multimodal instruction-following data — to visually interpret on-screen content such as images, buttons, and web page elements and describe them in simple, understandable language. Designed specifically to serve visually impaired users, the model delivers these descriptions through a voice output module, effectively bridging the gap between visual digital content and the user's ability to comprehend it without sight. The main idea behind Visual Sense is to go beyond what traditional screen can readers offers. Instead of just reading out raw text on a page, our system actually “understands” the visual context and gives meaningful, human-like descriptions. This makes it much more helpful in real-world situations where images or visual layouts carry important information. The tool operates through a four-step pipeline — capturing images, understanding their visual content, generating a natural language description, and delivering that description as speech. To produce accurate descriptions, the system combines a CNN-based visual encoder with a Transformer-based language decoder in an encoder-decoder architecture, supported by a visual attention mechanism that directs the model's focus toward the most relevant regions of an image. The resulting textual descriptions are converted to speech using text-to-speech technology, providing visually impaired users with an accessible and intuitive means of receiving visual information. By integrating computer vision, natural language processing, and speech synthesis within a unified multimodal framework, the system empowers visually impaired individuals with greater digital independence and an improved quality of life.
Disclaimer: Curated by HT Syndication.