What is Voice to Speech?
Voice to Speech is an in-browser client-side service that converts a user's voice into synthesized speech live via speech recognition, with optional language translation.
It's not working!
Ensure that you have allowed microphone access for this website in your browser. If nothing is working, your browser may not be supported. Please try a known supported browser, such as Chrome, Edge, or Safari.
I can't hear anything!
Make sure you have selected the correct Output Device, and make sure that it is being routed properly. Ensure the volume is turned on in both the website and in the device itself. If you are still having issues, it could be that your selected Output Voice is not currently working. If so, try selecting a different voice.
What kinds of voices are there?
Currently, there are 6 sets of voices:
- Voice Set A [Default]: A default set of voices.
- Voice Set B [Special]: This set of voices supports pitch and rate adjustment.
- Voice Set C [StreamElements]: Voices from StreamElements.
- Voice Set D [WaveNet]: Voices generated using the WaveNet model.
- Voice Set E [TikTok]: Voices from TikTok, including some singing voices for your amusement.
- Built-in Voices: Voices built-in to your browser and/or computer with support for pitch and rate adjustment. May include custom voices you have installed on your computer if they are supported.
How can I adjust the pitch and rate of a voice?
Currently, only voices under Voice Set B and Built-in Voices support pitch and rate customization.
How do I change which microphone is used for speech recognition?
To change which microphone is used for speech recognition, look in your browser window for an option to select your microphone. Most commonly, you will find this option in a button that will show up in your address bar. Look for an icon that looks like a microphone or video camera.
How can I route this website's audio output to another program as a microphone input?
To route audio from this website to another program, you will need a
virtual audio router. You can download one for free here:
VB-CABLE Virtual Audio Device.
Once you have one installed, reload this page and select the new device as the output device from the
Settings menu. You can then use the corresponding microphone from other programs.
If you are still not getting audio through the right device, you may have to instead route the browser's audio from your computer's audio settings. For Windows, this can be found from
Settings > Sound Settings > App volume and device preferences. From here, set your browser's
Output device to your preferred device.
Please note that
Built-in Voices cannot directly use the selected audio device option, and will need to use the above method to route your browser's audio.
How can I connect the output text to the VRChat chatbox?
Download and run
OSCChatbox, then go to
Socket Setup in the
Settings menu and make sure
Socket is enabled. For the default settings, set the
Address to [
localhost
] and the
Port to [
3000
]. The program will then automatically route the output text from this website to the VRChat chatbox.
If you would like to make your own program to perform this connection, you will need to listen to the socket output and send OSC messages to the VRChat chatbox. For more information on how to set up the socket, see the
"How do I connect to the socket from another program?" section.
For documentation on VRChat's OSC system, take a look at the
OSC Overview. For information on the chatbox input specifically, visit
OSC as Input Controller.
What is Low-Latency mode?
Low-Latency mode will try to speak as soon as you pause from talking. To adjust how quickly it will speak, you can adjust the Latency in milliseconds to any value between [1
] and [2000
] (or a maximum of 2 seconds). Increase this time if you feel like you need more time to finish talking. If this is disabled, speech recognition will try to wait until you are completely finished talking before speaking.
What are Replacements?
Replacements are a method to replace words or phrases with different text. If your input speech contains a given Phrase, it will be swapped with the corresponding Replacement text. If you have translation enabled, the replacements will be applied before translation. Replacements will be applied one by one from top to bottom in priority. Keep this in mind if you have replacements that may conflict with each other.
Some things you can use this for include:
- Replacing text with punctuation
- Replacing text with emojis or other special symbols
- Correcting non-standard names that are being detected incorrectly as another word or words
- Sending an empty message through socket, which can be useful for things like clearing out a connected VRChat chatbox
How can I set up automatic translations?
There are a few options to customize how translations are handled. If you enable Translation from the Settings menu, your speech will be automatically translated to the specified Output Language before being spoken.
If you would like to use a different language than the default for an Output Voice, you can click the button above the output language to disable Sync to Voice.
If you would like the website to directly speak your untranslated input speech instead of the translated text, turn on Speak Input. The translated text will still be generated and logged in the transcript and emitted through the socket.
How do I connect to the socket from another program?
To enable the socket connection, go to Socket Setup in the Settings menu and make sure Socket is enabled. For the default settings, set the Address to [localhost
] and the Port to [3000
]. You can then hook the output from this website into other services by implementing a socket listener in your choice of programming language.
The following events are received:
[text
]
- [0] text (str)
: The text to speak
The following events are emitted:
[status
]
- [0] status (str)
: Current speech recognition status
[speech
]
- [0] speech (str)
: The text that was spoken
- [1] untranslatedSpeech (str)
: The spoken text before translation
- [2] translatedSpeech (str)
: The spoken text after translation
- [3] inputLang (str)
: The input language code
- [4] outputLang (str)
: The output language code
- [5] translateEnabled (bool)
: Whether translation is enabled
- [6] lowlatencyEnabled (bool)
: Whether low latency mode is enabled
- [7] ttsEnabled (bool)
: Whether the text was output using Text-to-Speech input instead of voice recognition
- [8] interimAddition (bool)
: Whether this text is being appended to a previous Low-Latency result
- [9] padSpacing (bool)
: If this is a Low-Latency interim addition, whether this language may need a space to be padded before connecting to previous interim text
I have a suggestion or bug to report!
Please open a new issue on
GitHub.