Paper - Privacy-Preserving Speech Processing via STPC and TEEs

  • bayerlPrivacyPreservingSpeechProcessing
    • author: Sebastian P Bayerl, Ferdinand Brasser, Christoph Busch, Tommaso Frassetto, Patrick Jauernig, Jascha Kolberg, Andreas Nautsch, Korbinian Riedhammer, Ahmad-Reza Sadeghi, Thomas Schneider, Emmanuel Stapf, Amos Treiber, Christian Weinert
    • title: Privacy-Preserving Speech Processing via STPC and TEEs
    • year: 20202020

    • Zotero Links:: Local Web
  • Blog Article:
    • What tech companies don't tell you: Siri and Alexa can work without invading your privacy.

    • "Hey Google, how's the weather today?".
    • These are the first words I mumble each morning, directed towards a speech recognition model trained by Google. As weird as that sounds, I am probably not alone here. According to statista.com, 3.25 billion devices with built in digital voice assistants were used worldwide as of 2019. Voice interfaces seem to have become "the next big thing".
    • However, by using these services we are giving tech companies unrestrained access to our speech data. While this thought alone makes some people uneasy, it also creates some serious security risks in the case of leaks. Attackers can use stolen voice data to obtain sensitive information about you and even perform impersonation attacks.
    • On top of that, banks like HSBC provide services that use a person's voice to confirm their identity. This makes data leaks even more critical. Replacing a stolen password is easy, but what about replacing your voice?
    • It seems then that if we are concerned about our data security, we have to wave goodbye to Alexa and friends.
    • Well, not quite! You see, with some clever engineering one can design systems that do not expose your sensitive data, while still providing you with the voice-activated weather forecasts you so desperately need.
    • In this article we will go over some of the technologies that can protect your speech data while using voice assistants. The concepts presented largely stem from the paper "Privacy Preserving Speech Processing via STPC and TEEs" published by Bayerl et al. in 2019.
    • Let’s first go over two basic concepts that will reoccur in the following paragraphs:
    • Models. Modern speech processing makes heavy use of machine learning models. A model describes an algorithm that can learn to identify patterns in data. In our case, they are trained with speech data provided by real humans. Afterward, they can recognize words from previously unseen voice samples. The finished models are most commonly located on the servers of service providers (Apple, Google, etc.), where speech processing happens.
    • Encryption. Simply put, encryption allows you to scramble data using a passphrase or "key". Doing so makes the data look like nonsense. The scrambled data can then only be "decrypted" and read by anyone possessing a special decryption key. This key can either be the same one used to encrypt the message (symmetrical encryption) or a separate, associated one (asymmetrical encryption). In the latter case, the encryption key is made public, while the one for decryption is kept private. In that setup anyone can encrypt the message and only the intended recipient can read it. The separation into a "private" and "public" key also allows for digital signing. Here, the private key is used to generate a signature for a given message, which can then be verified using the public key. The signature proves the identity of the sender and that the message was not tampered with. Let’s see what we can do with these concepts!
    • Homomorphic Encryption

    • A first idea for privacy-preservation would be to simply encrypt your speech data before transmission to the company. "But if encryption makes data unreadable without the key, how can the model gain any information from it?", I hear you ask. Great Question! This is where Homomorphic Encryption (HE) comes into play. HE is a special kind of encryption that allows you to perform mathematical operations on encrypted data. When subsequently decrypting and looking at the result, it seems like the operations were performed on the plaintext data. One can then transfer this concept to the calculations performed by speech processing models. And like that, voice recognition that fully protects your privacy is made possible! …Unfortunately, it is not that easy.
    • Analyzing one second of audio took a prototype implementation using HE more than three hours (mentioned here). So while this approach works in theory, it would end up transforming your voice assistant into more of a "pen pal assistant".
    • It seems like we have to think of something more sophisticated. Luckily, researchers found another way to keep your voice data entirely private with the help of a technology called "Trusted Execution Environments". And this time, Alexa will respond before you had the time to rewatch the entire LotR Trilogy.
    • Trusted Execution Environments

    • So, what exactly is a "Trusted Execution Environment" (TEE)?
    • In general, TEEs provide a secure environment on a computer in which software can be run. They do so by isolating applications from other programs via specialized hardware and by encrypting all critical operations. As a result, no other program can directly see or manipulate the computations performed in the TEE. (further reading regarding TEEs here).
    • What makes this interesting for our use case is the generation of a private asymmetrical encryption key inside the TEE. No other software on the system can access this key. If we now send data to the secure environment with asymmetrical encryption, it also can only be read and used** inside**. Not even the owner of the physical device can decipher our data!
    • Well, TEEs sound like just the right tool for protecting our privacy in interactions with our beloved voice assistants! Let's take a look at how we take this concept of secure environments and apply it to our problem of protecting speech data in practice.
    • VoiceGuard 

    • "VoiceGuard" (also discussed in this paper) is the result of researchers applying TEEs to the problem of privacy-preserving speech processing. It uses a variant of TEEs that is available on Intel processors, called "Intel SGX". This architecture refers to programs in isolation as "enclaves" and has one critical feature that will soon come in handy.
    • To note: In the cases of Siri and Alexa, both the model and the processing server belong to the same company. VoiceGuard allows for separation of the cloud server provider and the vendor owning the speech processing model. It not only protects the user's speech data from the server owner but also does the same for the model itself. However, even if the server owner and vendor are the same entity, your voice data is protected.
    • VoiceGuard can be described as working in three distinct phases. We will carefully review each of them to give you a better feeling for how the trusted environment is used here.
    • In the following exchange, both the user and the vendor possess symmetrical encryption keys, which they intend to share with the enclave.
    • The Preparation phase happens during the setup of your voice-powered app. The vendor encrypts his speech processing models symmetrically and saves them to the unsecured storage on the server (1). Afterward, he sends the code that will be run in the enclave (3) to a party trusted by the user (2). Here, it is ensured that the code does not leak any
    • private data of the user.
    • This checking is necessary because while the user can be sure that the input (his voice) is encrypted, he cannot
    • control how the enclave handles his data afterwards. For example, the third-party could ensure that the output
    • is exclusively sent to the user via a secure channel.
    • The Initialization phase is the most critical part to securing the privacy of the parties involved. First, the vendor's code is used to create the secure enclave. The TEE then creates an asymmetrical key pair, as we discussed some paragraphs ago. The resulting public key needs to be shared (1) with both the user and vendor, so that they can safely send their respective symmetric key to the enclave. 
    • This key exchange is required so that both the user and vendor can communicate with the enclave in a secure manner.
    • Here, we are faced with a problem: How can we be sure that the public key we just received is coming from the enclave? An impersonator could have provided a different key and hijack the communication. This is where an essential feature of Intel SGX comes into play: Remote Attestation. Intel has built a so-called "platform key" directly into each of its CPUs, yet another asymmetrical encryption key. In our case, this key is used to attach a digital signature to the enclave's public key. User and vendor can then check the authenticity of the public key + signature using Intel's provided infrastructure (2). Problem solved!
    • After verification, the user and vendor encrypt their own symmetrical keys and send them to the enclave (3). This point is important to understand: They encrypt their respective symmetrical keys with the enclave's asymmetric public key. Once the enclave knows both symmetric keys, our setup is complete.
    • After the preparation is done, we can move onto** Operation. **Let's say you ask Siri to search "Cute Cat Images" for you. Your voice data is encrypted with your key and sent to the server (1). There, using the key you provided, only the isolated code inside the enclave can access it. The enclave then uses the vendor's key to load the encrypted speech processing model from the storage (2). Seconds later, the model in the enclave translates your voice to the string "cute cat images" (3), without anyone at Apple knowing what you just said. The enclave now encrypts the message and sends it back to you (4). Finally, your phone puts it into your favorite search engine and your day is delighted by hundreds of adorable kittens.
    • Just like that, we have created a privacy-preserving voice assistant! (for real this time.)
    • Performance-wise, this approach is significantly better than HE: VoiceGuard only increased the computation time of the models by 1.5 to 2 times. While not perfect, the results show that privacy-protecting speech processing on untrusted servers is achievable in real-time.
    • However, one question remains that I want to take a look at in the next section.
    • Offline Model Guard

    • "If the main risk factor is exposing your speech data to the company's server, why not just process the voice data offline on the user's device?"
    • Interesting question! For one, the models often require a lot of processing power to compute. Mobile phones or voice assistants might therefore be an unsuitable platform for the evaluation of the voice data. But even when processing power is not a problem, the model itself can be considered to be sensitive data. The trained model is the intellectual property of the company providing the service and should, therefore, not be leaked. Furthermore, information about the training data can be extracted with so-called "membership inference attacks". These endanger the privacy of the people whose voices were used to train the model!
    • But wait, didn't we just discuss a technology that could allow computation on a system without exposing sensitive data (like the model) to the device owner? How about just putting the TEE onto the user's phone? This is the exact principle behind what the researchers coined "Offline Model Guard" (OMG).
    • OMG, at its core, is an adaptation of VoiceGuard with a few tweaks. As it runs on the user's phone, only the vendor needs to perform a key exchange. Additionally, smartphones are equipped with ARM processors. Instead of Intel SGX, OMG relies on Sanctuary for TEEs (based on the ARMs TrustZone). Sanctuary comes with two cool features: For one, Sanctuary can directly connect to the device's microphone. Users can therefore input voice data straight into the secure environment. Furthermore, unlike Intel SGX, the computation in Sanctuary adds no performance penalty. So even with OMG's added security, you don't have to wait for the response of your voice assistant any longer than usual!
    • So if your phone can handle these models, OMG is a great alternative to VoiceGuard and another interesting use of trusted execution environments.
    • Wrapping things up

    • Today we saw how TEEs could make Siri and Alexa respect your privacy. Of course, this article was only a brief introduction into the world of privacy-preserving speech processing. If you are now interested in the technical side of things, I suggest you go and give the paper a read. There, the authors discuss another technology used for secure speaker verification called "Secure Two-Part Computation". While we didn't have the time to go over it here, it presents a very different take on protecting your privacy. Instead of using a single server with a TEE, they simply fix the problem with two servers and some crazy mathematics.
    • But even if you are not the type for technical details, you should now have some material for your upcoming data privacy debates. Just lay low and wait for the next time you hear, "I know my privacy is important, but I just couldn't live without Alexa". Because once that moment comes, you can chime in and annoy amaze everyone with your vast knowledge of trusted execution environments. Thanks a lot for reading!