SIVO

Sign to speech & speech to sign

Backend

Python, Flask, TensorFlow

Frontend

React-Native

Source Code

View on GitHub

01 / Overview

For my Final Year Project, I built an app called SIVO. Basically, I wanted to solve a real, everyday problem: the communication gap between the deaf community and hearing people.Instead of relying on a human interpreter, I built a two-way translator that fits right in your pocket. Here is how it works: you point your phone’s camera at someone using Pakistan Sign Language (PSL). The app tracks their hand and body movements, figures out what they are signing, and instantly speaks the words out loud. It also works in reverse—you can talk into the phone, and the app will translate your voice into sign language on the screen.

03 / The Hardest Path

Getting a heavy AI model to run in real-time on a mobile phone was honestly the hardest part. I built the mobile app using React Native, but I offloaded all the heavy AI processing—using tools like TensorFlow, MediaPipe, and OpenCV—to a custom Python cloud server.

04 / Challenges

If I’m being honest, getting an AI to recognize a single, isolated sign—like 'Hello'—is actually pretty straightforward. You show the model the gesture, it memorizes the hand landmarks, and you’re good to go.But real life isn’t a flashcard app. In the real world, people sign continuously and fluidly.When a user transitions from signing the word 'Boss' to the word 'Send,' their hands move through the air in a very unpredictable way. To a trained AI, that blurry, halfway-there hand movement looks like a completely different, random word. Instead of predicting my model was initially spitting out absolute gibberish like 'Boss... Make... Apple... Send.' The 'noise' between the signs was ruining the entire sentence.

05 / Solutions

I spent weeks banging my head against the wall trying to fix this. I dug through dozens of academic research papers on sign language recognition, and I spent hours prompting every AI tool I could find (ChatGPT, Claude, etc.) begging for a solution. But almost everything I found was either too theoretical or just flat-out didn't work for a real-time mobile app. What the AI Suggested: The AI models usually gave me standard, math-heavy computer science answers. The most common suggestion was to calculate Hand Speed (Velocity). The logic was: if the hands slow down or stop, the person is making a sign; if the hands are moving super fast, they are just transitioning between words, so the app should ignore those frames. Why it failed in my case: It sounded great in theory, but it was a disaster in practice. Every single person signs at a different speed. Some people sign incredibly fast and fluidly, while others take their time. The velocity thresholds were way too fragile. If someone was just naturally a fast signer, the app thought everything was a transition and predicted nothing. What the Academic Papers Suggested: When I turned to university research papers, I found a different trend. A lot of researchers "solved" this problem by introducing a "Null" or "Neutral" sign. They basically trained their AI to recognize when a person’s hands were resting at their sides, and they forced the user to drop their hands back to a resting position between every single word. Why it failed in my case: Technically, it worked, but practically? It was terrible. Imagine trying to have a normal, emotional conversation, but having to pause and drop your hands like a robot after every single word. It completely destroyed the natural flow of Pakistan Sign Language. My goal was to build an app for fluid human conversation, not a robotic lab experiment. Engineering My Own Solution I realized that the standard internet advice and academic tricks weren't going to cut it. I couldn't force deaf users to change how they talk just to make my app's job easier. The app had to adapt to them. I had to stop trying to predict single frames perfectly and start looking at the bigger picture. That is when I threw out the standard playbook and built my own Customized Sliding Window approach. Instead of looking at a single moment in time, I built a dynamic buffer that captures overlapping 30-frame blocks of video. The model evaluates the contextual flow of the entire movement, filtering out the transition noise naturally. But I didn't stop there. Knowing that even the best sliding window might occasionally output a messy string of words, I engineered a Server-Side Smart Matcher. If the sliding window caught the messy keywords "Boss... send... word," the server instantly cross-references it against a database of valid, grammatically correct sentences and outputs: "Boss send project report." By ignoring the standard "hand speed" and "null sign" advice and building a custom AI pipeline, I finally achieved a system that understands the intent of the user—reaching a 95% accuracy rate without ever asking them to slow down.

06 / Demo Video

Back to Projects