I was recently doing some development for an upcoming Building Jarvis article which involved sending audio captured with my laptop microphone to a 3rd-party API. I would send spoken audio to the service, and it would send back a text transcription. Cool stuff. But then, in one of my tests, I accidentally sent some audio where I hadn’t said anything. Surprisingly, I received a response back with transcribed text. The service had heard my TV; which was playing Star Trek with the volume on my sound system’s lowest audio setting from across the room, transcribed the episode’s dialog, and sent it back to me. Now keep in mind that I hadn’t just turned on the TV. I had been watching this show in the background throughout my testing. Up until this point, none of the transcriptions I received from the service had come back with TV dialog included. This leads me to believe that an algorithm was being applied to differentiate my voice from the background audio based on relative volume. Which is neat but also troubling. Because the background audio is still in there even if it doesn’t interfere with the task at hand. And what might that background audio be used for?

I was watching a Twitch stream recently where the streamer announced that they would be muting themselves while they typed in their password. I thought to myself, “That’s strange. I wonder why he said that.” Well, it turns out that AI can figure out your passwords based on keyboard acoustics alone. And while this has long been possible, AI is increasing the accuracy. Based on a paper from researchers at Durham University, University of Surrey, and Royal Holloway University of London:

When trained on keystrokes recorded by a nearby phone, the classifier achieved an accuracy of 95%, the highest accuracy seen without the use of a language model. When trained on keystrokes recorded using the video-conferencing software Zoom, an accuracy of 93% was achieved.

Yikes! No typing passwords near microphones or while on Zoom.

Next time you type in one of your passwords, listen to how it sounds. The keys on your keyboard do not have uniform acoustics. The spacebar sounds different from the backspace key. The home row keys sound different from the number keys. Shift keys sound different. Which side of the keyboard you are typing on sounds different depending on where you are listening. And that’s not all. Depending on the positioning of the specific letters or numbers you need to type, you likely press the keys at a different rate, with the home row being faster than keys that are more of a stretch.

As I personally pay attention to the sound of my own passwords, I’ve noticed that each one definitely sounds different. Each one has its own little musical rhythm announcing what it is. I’m sure, with a little practice, I would be able to identify which site I’m logging into just by how it sounds. So, it’s not a stretch of the imagination that AI can do the same. Keep in mind, though, that these issues are not just limited to keyboards. Acoustic attacks can also listen to your finger sliding around the screen of your smartphone.

File this under reasons why or not always-on microphones may be an issue. Be safe.


<
Previous Post
Building Jarvis in C#, Part 3: Prompting and Date/Time
>
Next Post
Building Jarvis in C#, Part 4: The Push-to-Talk Button