The transcription tool is based on the OpenAI Whisper model, a cutting-edge neural network architecture that is specifically designed for transcribing speech in a wide range of languages with exceptional accuracy. The model is trained on vast amounts of data, which enables it to capture even subtle nuances in speech patterns, accents, and intonations, resulting in highly accurate transcriptions.
In addition to providing accurate and efficient transcriptions, the tool also serves as an annotation tool for the creation of datasets. This is particularly important for less well-covered languages, where the availability of high-quality transcription data is often limited. By using the tool to create annotated datasets, researchers and language experts can help to improve transcription accuracy for these languages, as well as enhance the model’s understanding of their unique linguistic features and patterns.
Furthermore, the transcription of sound and video files enable their indexation by their text content, making the content of the files to be easily searchable and accessible. This is particularly useful for investigators and analysts who need to quickly locate and analyse specific segments of audio or video content, as it saves them the time and effort of manually listening or watching through hours of footage.
Moreover, the tool is a REST server that can be accessed by other legacy systems, providing a seamless integration between the transcription tool and other systems. This enables LEAs to increase the capacity of their existing systems, by taking advantage of the powerful transcription capabilities of the OpenAI Whisper model in premisses, without the need of keys, licences and the transmission of confidential data to a third party company.
The OpenAI Whisper model is a state-of-the-art neural network architecture that is specifically designed for speech transcription tasks. It is based on a transformer architecture, which is a type of neural network that has been particularly successful in natural language processing tasks. The transformer architecture allows the model to capture the complex temporal relationships between different parts of speech, resulting in highly accurate transcriptions.
One of the key features of the OpenAI Whisper model is its ability to adapt to new languages and dialects with minimal training data. This is particularly important for languages that are not well-covered by existing transcription models, as it allows researchers and language experts to quickly create accurate transcriptions without the need for extensive training data. The model is trained on vast amounts of data, including audio and text data from a wide range of languages and dialects. This enables it to capture even subtle nuances in speech patterns, accents, and intonations, resulting in highly accurate transcriptions. Another advantage of the OpenAI Whisper model is its ability to handle multiple speakers. However the model’s ability in handling overlapping speech (i.e. where there are multiple people speaking at the same time, such as in group discussions or conference calls) is limited. In these situations the results are relatively poor, even if it is way better than previous technologies.
The model was created with a dataset of 680,000 hours of audio, which included 117,000 hours covering 96 other languages than English. The dataset also included 125,000 hours of X → en translation data. One of the main focus of the work done, was on broadening the scope of weakly supervised pre-training beyond English-only speech recognition to be both multilingual and multitask. The authors found that for sufficiently large models there is no drawback and even benefits to joint multilingual and multitask training.
Overall, the OpenAI Whisper model is a highly advanced and flexible tool that provides exceptional accuracy and efficiency in speech transcription tasks. Its ability to adapt to new languages and dialects with minimal training data, handle multiple speakers and capture subtle nuances in speech patterns, make it a powerful solution for a wide range of transcription tasks useful for Law Enforcement Agencies. However, it is important to say that, as any artificial intelligence-based method, errors exist. It is important to check for the accuracy of the results, if they are usijed in operational situations. There is no guarantees over the accuracy of the transcriptions, especially for the less represented languages.
1.2 Limitations
· Data loss when window is closed
· Quality of the audio
· Hesitations, stuttering
· Overlapping speech
· Languages coverage
· Multilingual sound files
· Sub models
· Input audio/video files
· Hallucinations
The initial screen of the transcription tool displays a series of parameters that can be adjusted to optimize the transcription process. The interface is simple but give users the control to the most important options for the treatment of the audio/video file.

Figure 1 – Initial screen for uploading audio/video files
The first combo box express the language of the file. In general the framework is smart enough to correctly guess the language, so in most cases you can let the file language in automatic. However, if the speaker have a heavy accent, or there are more than one language spoken in the input file, it is possible to force the treatment of the file to be one specific language.

Figure 2 – Language choice, diarisation
The user can choose to use diarisation, a process that separates out the speech of individual speakers in the audio, or not. Additionally, users can choose to apply filters to filter out non speech segments and noise, which, in some cases, can improve transcription accuracy. The use of filters is not mandatory, but in some specific cases, can improve the precision of the results. In different ways, the two filters try to help the tool to focus on the parts of the text that there are voice, and work only on those parts. For long and quiet sound files, it may even increase the speed of the whole process. Even though, as it tries to filter the parts that are considered silent, it is possible that these filters also induce to a data loss, when the speaker is too far from the registering source, for example.
If the user knows the number of speakers in the audio, they can input this information into the tool. By default, the number of speakers is inferred from the audio, but if the user knows in advance this value, it can greatly improve the diarisation performance and produce more accurate phrases attributions.

Figure 3 – Filter screen and number of speakers, by the default the number of speakers is automatic |
The user can also choose the quality of the sub model to be used in the transcription process. This can impact the accuracy and speed of the transcription process, so users should choose the sub model that best fits their needs. By default the sub-model, if not chosen, will be the Medium. Once the parameters are set, the user can upload the audio or video file to be transcribed. The tool then begins the transcription process with the selected parameters.

Figure 4 – Submodel selection and upload file button
Users can also verify the status of the waiting queue to see which files are currently being transcribed, as well as which files are available for annotation. The waiting queue is a First In First Out (FIFO) kind of system, files are treated in the order they were received by the service. More than the number of files being treated, it also shows the status of the files. Processing is the file that is when the file is being transcribed, Queued are the status when the file is waiting to be treated, and converting is a local operation in the client side to convert the file in an understandable format, and strips only the audio part. For long videos, for example, that represents a significative decreasing in the size of the transferred data.

Figure 5 – Waiting queue with five audios, one being treated, three waiting to be transcribed and one being converted by the client before its transmission to the server.
The last button is linked to the visualisation of the saved data. It's important to note that any files saved for use as annotation data will be visible to all users who have access to the tool. This can help improve the accuracy and quality of the transcription models by allowing multiple users to annotate and train the models. However, it must be used with discretion, it is not supposed to be used for confidential files.
After the file is treated the transcription screen is presented to the user. This screen is the most important screen of the tool as it presents the results of the transcription and enables the user to listen to the audio, verify the transcription, correct the models mistakes and download the results. Fron the top down, it presents a menu for file related manipulations, and the wave format just below. After that the audio/video controls to play and stop the audio/video file and the transcription edition window. After the transcription is presented with the indications of the speakers, the time and the text that was spoken.

Figure 6 – Transcription screen
In the top menu, the download transcription combo box allows users to download the transcriptions in three different formats, text, .str json and csv. The first format is indicated for copy and past over reports etc. The second is to generate subtitles for the video you uploaded and the two other formats are useful as data formats for post treatment.

Figure 7 – Download transcription options
Not always the user may be proficient over the audio that was transcribed. It can be convenient to translate the audio in our own language. We added in the tool the possibility of translating the transcribed text. The translation is made phrase by phase, so it is possible that part of the context to be missed. The intention is not to give a court proof translation. The intention is, in a best effort way, give the users and idea of what is being said when they do not speak the target language. Please note the translation tool has to be installed on the machine and the .env file of the front-end has to be configured accordingly.