Introduction

This document plays a crucial role in helping developers implement Speech-to-Text (STT)/ Automated Speech Recognition (ASR) in their applications. The API is platform-agnostic, which means any device which has the ability to record the speech can use this API.  

The API is organized around Websocket. All requests should be sent over secured WebSocket. All responses, including errors, are delivered in JSON format.

Speech-to-Text API

The Speech-to-Text accurately converts speech into text using an API powered by Floatbot’s AI technology. This solution will transcribe the speech of languages in real-time.  

The solution is a fully managed and continually trained, which leverages machine learning to combine knowledge of grammar, language structure, and the composition of audio and voice signals to accurately transcribe the speech.

Prerequisites

The prerequisites to set-up and use the STT API are:

WSS Port	Default
Speech Recognition Type	Continuous Recognition Type
No. of Channel	1

Key features

The Speech-to-Text solution offers robust features that help you in delivering better user experience in products through voice commands:  

Accurate & Live Streaming Transcription - Obtain real-time speech recognitions resulting as the API processes the audio inputs treated from the application’s microphone. It can decide speech with high accuracy and confidence, even from the lower-quality audio input.  
Personalize Speech Model - Tailor speech recognition to transcribe domain specific terms and boost your transcription accuracy of specific words or phrases.  
Noise Resistance - The solution will decode moderate noisy audio data recorded in various environments without requiring additional noise cancellation.  
Content Filtering - Obscenity filter will detect inappropriate or unprofessional content in your audio data and filter out profane words in text output.  
Flexible Deployment – The API is platform agnostic and will support both the deployment model: Cloud-based deployment and on-premises deployment.

Getting the BOT ID and API key

All connections to Streaming STT API start as a WebSocket request. On successful authorization, the client can start sending binary WebSocket messages containing audio data in wav formats, if “auto_start_speech” passed in connection string is set to 1. 

If “auto_start_speech” is not defined or set to zero in connection string, User needs to send “startSpeechData” in JSON Object, in response API will return status as ready.  

As speech is detected, it returns the text format of the recognized speech content.  

The “bot_id” in the query parameter will identify the API request. The API will recognize the default settings like transcription language, model by the “bot_id”, and “API_Key” for each customer account.

Open the below link and enter the details

https://floatbot.ai/signup

Floatbot Signup

After signing up, verify your email Id with us to avoid any kind of repercussion in the future.

Floatbot Verification

Note: On receiving your request, the “bot_id” & “API_Key” will be forwarded to your email ID to test the API.

Authentication

The Streaming STT API will use the Bot Id and API key to authenticate requests. If API key is invalid or the query parameter is not present, the WebSocket connection will be closed.

Key notes on API details and architecture

The process to transcribe the continuous audio input:   

Open the connection with the STT service by defining “bot_id”, “api_key”, “language”, “model”   
In the API response, if the cause = Ready, then the connection is successfully established.   
Write the speech data into the upstream and continuously receive the transcribed   data.   (Note: In the response, if the final = false, then the audio is partially transcribed, and the service is still processing the input data.)   
Write --EOF-- into the upstream, to stop the recognition process.   (Note: If you fail to write --EOF-- into the upstream, then the STT service will automatically terminate the recognition process.)
In the API response, if the final = true, then the text received is considered as the final transcript.

Sample request and response

Initiating Speech Service:

wss://us.floatbot.ai/speech/streaming?language=en-US&bot_id=xxxxxxxxxxxxxxxxxxxxxxxxx&model=xxxxxxxx&api_key=xxxxxxxxxxxxx xxxxxxxxxxxxx

Note: 1) Floatbot will provide “bot_id” and “api_key” 

       2) Model name is case sensitive so type in capital letter only.

On successfully validating and establishing the connection:

{
"sender": "bot",  
"cause": "login", 
}

Send below JSON object before streaming audio data (if “auto_start_speech”=1 is not sent in connection):

{
"sockType": "startSpeechData",  
"sampleRate": 44100, 

}

Client should send audio buffer after receiving below response:

{   

  "request_id": "07387dde-da88-4874-8ad4-6fe6eade7a23", 
   "success": true, 
  "final": false, 
"cause": "ready" 
}

Partial Utterance - In-between an utterance:

{   
  "request_id": "07387dde-da88-4874-8ad4-6fe6eade7a23",   
"final": false,   
"text": "hello",   
"cause": "partial"
}

The Final Successful Response:

{ 
  "request_id": "07387dde-da88-4874-8ad4-6fe6eade7a23", 
"final": true, 
"text": "hello", 
"cause": "EOF received”
}

Error Response:   User will get below error for sending Invalid “bot_id” or “apikey”:

{   
 "success": false,   
"cause": "Authentication failure: Invalid credentials."   
}

User will get below error for unsupported language

{   
  "success": false,   
"cause": "Unsupported language"
}

User will get below error for unsupported model 

{   
  "success": false,   
"cause": "Unsupported model"
}

API References

Request URL

wss://us.floatbot.ai/speech/streaming?language=<language>&bot_id=<bot_id>&mo del=<model>&api_key=<apikey>&auto_start_speech=<0 or  1>&sample_rate=<audio sample rate>

Attribute details

Query parameters

Paramter	Type	Is Mandatory?	Description
bot_id	String	Yes	A unique bot_id to identify the user and the default account settings
api_key	String	Yes	A unique api_key is provided by Floatbot to identify the user using the STT API
language	String	Yes	Indicates the language in which the audio is Spoken. For Supported languages click here
model	String	No	Specify the model to be used for transcribing the speech. For Supported model click here
auto_start_speech	Integer	No	Passing auto_start_speech = 1, will start speech after establishing connection  auto_start_speech = 0,  User needs to send autoStartSpecch after connection, This is the defualt value
sample_rate	Integer	No	Default sample rate is 16000

Request parameters

Paramter	Type	Is Mandatory?	Description
Streaming audio	Binary	Yes	The audio streamed from the input device.

Response parameters

Paramter	Type	Description
request_id	string	API will auto-assign a unique identification number for each request.
success	string	Will indicate the functional status of the API:   If the  success  = true, then the API is functioning and Ready to generate output.   If the  success  = false, then the API is not functional and has some errors.
final	boolean	Will report whether the received output is partial or final:   If the  final  = true, then the received text is the final output.   If the  final  = false, then the text received is partial and is still processing the fil
text	string	The streaming audio input is converted into text format in the requested language.
cause	string	The cause will appear for both successful and failed requests

Handling Errors

The Streaming API raises exceptions for many reasons, such as a failed connection, invalid parameters and authentication errors. We provide more specific human-readable messages with an error response so that users can react to errors more.   In the Websocket response, if the success = false, then the cause will display the reason for the error.   Domain and API Key will be shared by Floatbot Team 

Supported Languages

Language	Language Code
English (United States)	en-US
Spanish (Mexico)	es-MX

Supported Models

Model Name	Description
INSURANCE	All your insurance related query audio would be accurately translated into the insurance related topics.
BANKING	You would be easily able to correctly translate the audio related to banking terminologies.
GENERAL	The model is trained on continuously transcribing speech irrespective of an industry type.

Note: Model name is case sensitive. Therefore, type in only capital letters.