How does Samsung’s Voice Recorder’s Speech-to-text function store the text and sync it with the audio?

I am using Samsung Galaxy S7 and recently, I used the speech-to-text feature in the default Voice Recorder application. When I play one of the audio files that was recorded with that feature, the application is able to show the text in sync with the audio. I saw that there was the text stored in another text file together with the audio file, but it was just the raw text. I’m trying to figure out how this works.

Is this information for syncing the text and audio stored in the audio file? In this case, all the recordings are only in M4A files. Hence, I tried searching what M4A files can store, like if they can store subtitles because it would seem that it could be subtitles that enable this syncing. However, I can’t seem to find any information on this. Or it could also be that the application just has this information stored in some sort of storage or cache, which would mean that if I moved the files over to my computer, there would not be any text synced with the audio.