Speech-to-Text server

This is a key feature in the voice interaction systems.
Now, there are many online services with this functionality, but they are not applicable everywhere because they lead to security issues (data leaks), may brings delays in interaction (slow response), and so on.
When I designing systems that require voice interaction (automatic calls processing, voice request processing, etc), a frequent requirement is to use a local Speech-to-Text systems.
And this solution was written to resolve this problems.


The server was written in C, based on wstk, alphacephei framework and its models.
Capable to work on the regular servers, produces fast responses that suitable to build realtime dialog systems.


This is a commercial product, if you are interested in purchasing or have some questions, please visit a contact page.
There is an evaluation period with installation on your servers (preferred Ubuntu 22.04 x64).



Basic features:


--- Examples ---

Example #1 (simple request)

Request:
curl http://127.0.0.1:8801/v1/transcriptions -X POST -H "Authorization: Bearer secret" -H "Content-Type: multipart/form-data" -F language="en" -F smodel="small" -F file="@test.mp3"

Response (json):
{
 "text" : "hello world"
 }
        


Example #2 (with speakes identify)

Request:
curl http://127.0.0.1:8801/v1/transcriptions -X POST -H "Authorization: Bearer secret" -H "Content-Type: multipart/form-data" -F language="en" -F smodel="small" -F vmodel="default" -F file="@test.mp3"

Response (json):
{
 "spk" : [-0.644623, 1.023342, 2.575434, 0.623447, -0.602342, 1.0234234 -1.4824234 -0.021242, 0.824297, -0.152424, ... ],
 "spk_frames" : 81,
 "text" : "hello world"
}