How it works

Process of transcribing audio

The process of transcribing is illustrated in the below figure

The above picture illustrates how the offline decoding system works. The audio input will be processed through steps

  • Step 1: Resample the audio file

    • The audio needs to be split into mono channels, and sample rate that match with the trained model.

    • Tools used: Soxi/ffmpeg

  • Step 2: Detect the speech in the input

    • Speaker diarisation (or diarization) is the process of partitioning an input audio stream into homogeneous segments according to the speaker identity.

    • Output of this process is the segment file (.seg), including the speaker id, segment that including speech, and start/end time

    • Tools used: LIUM 8.4.1

  • Step 3: Convert the audio to proper format (kaldi format)

    • To process further by the kaldi toolkit, the audio data and segment file will be parse to kaldi script, to convert to kaldi proper format.

    • Tools used: Kaldi scripts

  • Step 4: Extract features from the input

    • Extract the features from the kaldi format, mfcc and iVector features.

    • Tools used: Kaldi scripts

  • Step 5: Decode/Generate the transcription

    • Features extracted from previous step will be parsed to kaldi toolkit, with our trained model, to generate the transcription in ctm/stm format.

    • Furthermore, transcription are also converted to different formats, support different requests from user: like TextGrid, csv, text.

The file output will be sent to public folder, user could have other post-processing like converting to their required format, sending to other modules (language understanding, adding sentence unit, etc.)

The system will process input files sequentially. The ‘file_name’ of input audio files will be normalized into ‘file_id’. The output folder will have the following structure:


/path/to/the/output/folder/
.
├── <file-id-1>
│   ├── <file-id-1>.<model_name>.ctm
│   ├── <file-id-1>.<model_name>.srt
│   ├── <file-id-1>.<model_name>.stm
│   ├── <file-id-1>.<model_name>.TextGrid
│   └── <file-id-1>.<model_name>.txt
├── <file-id-2>
│   ├── <file-id-2>.<model_name>.ctm
│   ├── <file-id-2>.<model_name>.srt
│   ├── <file-id-2>.<model_name>.stm
│   ├── <file-id-2>.<model_name>.TextGrid
│   └── <file-id-2>.<model_name>.txt
└── <file-id-3: eg: 8khz-testfile>
    ├── 8khz-testfile.<model_name>.ctm
    ├── 8khz-testfile.<model_name>.srt
    ├── 8khz-testfile.<model_name>.stm
    ├── 8khz-testfile.<model_name>.TextGrid
    └── 8khz-testfile.<model_name>.txt

*Other file types can be existed.

Information extraction from the output:

*.ctm file including word and start time, end time of each word *.srt, *.stm, *.textgrid including segments (sentences) with start time, end time. *.txt including the whole transcription.

File type and language supported

Currently our offline system support following file types and language models:

Languages

Description

Singapore Code Switch

Mandarin

Singapore English

User can upload file with size up to 500MB each time

For all language models, we support the following file types: .mp3, .mp4, .wav

No matter what the input audio format, our offline system can process to down/up sample the audio file to the 16khz sampling rate, 16 bit rate, and mono channel to process. Our system performs best with audio in 16Khz sampling rate, 16 bit rate and close talk or telephony with clear and clean speech.

Last updated