This is a key feature in the voice interaction systems.
Now, there are many online services with this functionality, but they are not applicable everywhere because they lead to security issues (data leaks), may brings delays in interaction (slow response), and so on.
When I designing systems that require voice interaction (automatic calls processing, voice request processing, etc), a frequent requirement is to use a local Speech-to-Text systems.
And this solution was written to resolve this problems.
The server was written in C, based on wstk, alphacephei framework and its models.
Capable to work on the regular servers, produces fast responses that suitable to build realtime dialog systems.
This is a commercial product, if you are interested in purchasing or have some questions,
please visit a contact page.
There is an evaluation period with installation on your servers (preferred Ubuntu 22.04 x64).
Allows to keep your personal data in safe
There are open models for various languages
There are tools for it
Allows you to defined (in request) a dictionary of available words
Allows to generate vector for speaker identification
You don't need to purchase or rent some expensive hardware
There's a module for integration with Freeswitch (see mod_sivr_asr)
Allows to use asr-api from dialplan or scripts
Allows to save memory and improve performance by sharing models
This allows to integrate STT service in various application (see example below)
With various samplerate and channels.
Example #1 (simple request)
Request:
curl http://127.0.0.1:8801/v1/transcriptions -X POST -H "Authorization: Bearer secret" -H "Content-Type: multipart/form-data" -F language="en" -F smodel="small" -F file="@test.mp3"
Response (json):
{
"text" : "hello world"
}
Example #2 (with speakes identify)
Request:
curl http://127.0.0.1:8801/v1/transcriptions -X POST -H "Authorization: Bearer secret" -H "Content-Type: multipart/form-data" -F language="en" -F smodel="small" -F vmodel="default" -F file="@test.mp3"
Response (json):
{
"spk" : [-0.644623, 1.023342, 2.575434, 0.623447, -0.602342, 1.0234234 -1.4824234 -0.021242, 0.824297, -0.152424, ... ],
"spk_frames" : 81,
"text" : "hello world"
}