Introduction
This document plays a crucial role in helping developers implement Speech-to-Text (STT)/ Automated Speech Recognition (ASR) in their applications. The API is platform-agnostic, which means any device which has the ability to record the speech can use this API.
The API is organized around Websocket. All requests should be sent over secured WebSocket. All responses, including errors, are delivered in JSON format.
Speech-to-Text API
The Speech-to-Text accurately converts speech into text using an API powered by Floatbot’s AI technology. This solution will transcribe the speech of languages in real-time.
The solution is a fully managed and continually trained, which leverages machine learning to combine knowledge of grammar, language structure, and the composition of audio and voice signals to accurately transcribe the speech.
Prerequisites
The prerequisites to set-up and use the STT API are:
| WSS Port | Default |
| Speech Recognition Type | Continuous Recognition Type |
| No. of Channel | 1 |
Key features
The Speech-to-Text solution offers robust features that help you in delivering better user experience in products through voice commands:
- Accurate & Live Streaming Transcription - Obtain real-time speech recognitions resulting as the API processes the audio inputs treated from the application’s microphone. It can decide speech with high accuracy and confidence, even from the lower-quality audio input.
- Personalize Speech Model - Tailor speech recognition to transcribe domain specific terms and boost your transcription accuracy of specific words or phrases.
- Noise Resistance - The solution will decode moderate noisy audio data recorded in various environments without requiring additional noise cancellation.
- Content Filtering - Obscenity filter will detect inappropriate or unprofessional content in your audio data and filter out profane words in text output.
- Flexible Deployment – The API is platform agnostic and will support both the deployment model: Cloud-based deployment and on-premises deployment.
Getting the BOT ID and API key
All connections to Streaming STT API start as a WebSocket request. On successful authorization, the client can start sending binary WebSocket messages containing audio data in wav formats, if “auto_start_speech” passed in connection string is set to 1.
If “auto_start_speech” is not defined or set to zero in connection string, User needs to send “startSpeechData” in JSON Object, in response API will return status as ready.
As speech is detected, it returns the text format of the recognized speech content.
The “bot_id” in the query parameter will identify the API request. The API will recognize the default settings like transcription language, model by the “bot_id”, and “API_Key” for each customer account.
Open the below link and enter the details

After signing up, verify your email Id with us to avoid any kind of repercussion in the future.

Note: On receiving your request, the “bot_id” & “API_Key” will be forwarded to your email ID to test the API.
Authentication
The Streaming STT API will use the Bot Id and API key to authenticate requests. If API key is invalid or the query parameter is not present, the WebSocket connection will be closed.
Key notes on API details and architecture
The process to transcribe the continuous audio input:
- Open the connection with the STT service by defining “bot_id”, “api_key”, “language”, “model”
- In the API response, if the cause = Ready, then the connection is successfully established.
- Write the speech data into the upstream and continuously receive the transcribed data. (Note: In the response, if the final = false, then the audio is partially transcribed, and the service is still processing the input data.)
- Write --EOF-- into the upstream, to stop the recognition process. (Note: If you fail to write --EOF-- into the upstream, then the STT service will automatically terminate the recognition process.)
- In the API response, if the final = true, then the text received is considered as the final transcript.
Sample request and response
Initiating Speech Service:
wss://us.floatbot.ai/speech/streaming?language=en-US&bot_id=xxxxxxxxxxxxxxxxxxxxxxxxx&model=xxxxxxxx&api_key=xxxxxxxxxxxxx xxxxxxxxxxxxx
Note: 1) Floatbot will provide “bot_id” and “api_key”
2) Model name is case sensitive so type in capital letter only.
On successfully validating and establishing the connection:
{
"sender": "bot",
"cause": "login",
}
Send below JSON object before streaming audio data (if “auto_start_speech”=1 is not sent in connection):
{
"sockType": "startSpeechData",
"sampleRate": 44100,
}
Client should send audio buffer after receiving below response:
{
"request_id": "07387dde-da88-4874-8ad4-6fe6eade7a23",
"success": true,
"final": false,
"cause": "ready"
}
Partial Utterance - In-between an utterance:
{
"request_id": "07387dde-da88-4874-8ad4-6fe6eade7a23",
"final": false,
"text": "hello",
"cause": "partial"
}
The Final Successful Response:
{
"request_id": "07387dde-da88-4874-8ad4-6fe6eade7a23",
"final": true,
"text": "hello",
"cause": "EOF received”
}
Error Response: User will get below error for sending Invalid “bot_id” or “apikey”:
{
"success": false,
"cause": "Authentication failure: Invalid credentials."
}
User will get below error for unsupported language
{
"success": false,
"cause": "Unsupported language"
}
User will get below error for unsupported model
{
"success": false,
"cause": "Unsupported model"
} API References
Request URL
wss://us.floatbot.ai/speech/streaming?language=<language>&bot_id=<bot_id>&mo del=<model>&api_key=<apikey>&auto_start_speech=<0 or 1>&sample_rate=<audio sample rate>
Attribute details
Query parameters
| Paramter | Type | Is Mandatory? | Description |
| bot_id | String | Yes | A unique bot_id to identify the user and the default account settings |
| api_key | String | Yes | A unique api_key is provided by Floatbot to identify the user using the STT API |
| language | String | Yes | Indicates the language in which the audio is Spoken. For Supported languages click here |
| model | String | No | Specify the model to be used for transcribing the speech. For Supported model click here |
| auto_start_speech | Integer | No | Passing auto_start_speech = 1, will start speech after establishing connection auto_start_speech = 0, User needs to send autoStartSpecch after connection, This is the defualt value |
| sample_rate | Integer | No | Default sample rate is 16000 |
Request parameters
| Paramter | Type | Is Mandatory? | Description |
| Streaming audio | Binary | Yes | The audio streamed from the input device. |
Response parameters
| Paramter | Type | Description |
| request_id | string | API will auto-assign a unique identification number for each request. |
| success | string |
Will indicate the functional status of the API:
|
| final | boolean |
Will report whether the received output is partial or final:
|
| text | string | The streaming audio input is converted into text format in the requested language. |
| cause | string | The cause will appear for both successful and failed requests |
Handling Errors
The Streaming API raises exceptions for many reasons, such as a failed connection, invalid parameters and authentication errors. We provide more specific human-readable messages with an error response so that users can react to errors more. In the Websocket response, if the success = false, then the cause will display the reason for the error. Domain and API Key will be shared by Floatbot Team
Supported Languages
| Language | Language Code |
| English (United States) | en-US |
| Spanish (Mexico) | es-MX |
Supported Models
| Model Name | Description |
| INSURANCE | All your insurance related query audio would be accurately translated into the insurance related topics. |
| BANKING | You would be easily able to correctly translate the audio related to banking terminologies. |
| GENERAL | The model is trained on continuously transcribing speech irrespective of an industry type. |
Note: Model name is case sensitive. Therefore, type in only capital letters.