//Saurabh Shrivastava

Google Summer of Code, Week 9 & 10 : Let's Zoom In!

2017-08-23T00:00:00+05:30

Almost two weeks have passed since the second evaluations and my last blog post. This blog post lays down the summary of work I’ve done in past two weeks.

The task checklist with deliverables is located here : https://saurabhshri.github.io/gsoc/ .

The major work during these two weeks were to implement phoneme recognition. This involved generating phonetic language model with the help of corpus. The challenging task here was to convert text corpus into phonetic corpus.

<s> princeton </s>
<s> a fine institution </s>


IL P R IH N S IY T AH N SIL
SIL AH F AY N IH N S T AY T UW SH AH N SIL

While creating dictionary, I am using seq2seq to generate the phonemes from words which is trained on CMUDict. But this process is extremely slow and prooved not fesiable for large corpus. So, the alternative was to use fixed sets of rules to perform the conversion. Of course this isn’t going to be as accurate as the one generated from seq2seq, but since this corpus’s primary aim is to generate language model, it was the way to go.

Thankfully, these rules were compiled into C++ rules by Daniel S. Wolf in his project Rhubarb Lip Sync, and I modified it’s G2P file to perfrom corpus to phonetic corpus conversion.

On the left are the result of the tool that uses tensorflow to learn the rules using a dictionary (very slow) and on the right are the results obtained using the script I made which follows the rules I found above.

america AH M EH R AH K AH AE M EY R AY K AE

hypocrit HH IH P IH K R IH T HH AY P AA K R IH T

trump T R AH M P T R AH M P

saurabh S AO R AE B S OW R AE B HH

This corpus is then used to generate phonetic language model which is fed to phonetic decoder to perform phoneme recognition. Performing phoneme recognition slows down the process relatively as it adds an extra cost to compute those terms.

I also encountered an interesting bug during this. If the corpus contained dual whitespaces, the language model generated was in corrupt form. But the error was not properly reported by SphinxBase and upon reporting, a fix was pushed for the same.

I then proceeded to add deep level logging throughout the program which should aid in debugging. The logging is implemented through a variadic function which is invoked through a defined macro. Similarly, I implemented a function and macro for handling cases for fatal cases with defined error codes. These can be found in /src/lib_ccaligner/commons.h.

[INFO] /Users/saurabhshri/Desktop/try/ccaligner/src/lib_ccaligner/read_wav_file.cpp : 249 | Wave File chunkID verification successful
[INFO] /Users/saurabhshri/Desktop/try/ccaligner/src/lib_ccaligner/read_wav_file.cpp : 253 | Begin decoding wave file
[INFO] /Users/saurabhshri/Desktop/try/ccaligner/src/lib_ccaligner/read_wav_file.cpp : 108 | File format is identified as WAV
[INFO] /Users/saurabhshri/Desktop/try/ccaligner/src/lib_ccaligner/read_wav_file.cpp : 115 | Finding FMT and DATA subchunks
[INFO] /Users/saurabhshri/Desktop/try/ccaligner/src/lib_ccaligner/read_wav_file.cpp : 130 | FMT index : 12 , DATA index :70
[INFO] /Users/saurabhshri/Desktop/try/ccaligner/src/lib_ccaligner/read_wav_file.cpp : 153 | PCM : True
[INFO] /Users/saurabhshri/Desktop/try/ccaligner/src/lib_ccaligner/read_wav_file.cpp : 162 | MONO : True
[INFO] /Users/saurabhshri/Desktop/try/ccaligner/src/lib_ccaligner/read_wav_file.cpp : 171 | Sample Rate 16KHz : True
[INFO] /Users/saurabhshri/Desktop/try/ccaligner/src/lib_ccaligner/read_wav_file.cpp : 184 | BitRate 16 bits/sec : True
[INFO] /Users/saurabhshri/Desktop/try/ccaligner/src/lib_ccaligner/read_wav_file.cpp : 202 | Number of samples : 34543104
[INFO] /Users/saurabhshri/Desktop/try/ccaligner/src/lib_ccaligner/read_wav_file.cpp : 203 | Reading samples
[INFO] /Users/saurabhshri/Desktop/try/ccaligner/src/lib_ccaligner/read_wav_file.cpp : 257 | File decoded successfully

and

[ERROR] /Users/saurabhshri/Desktop/try/ccaligner/src/lib_ccaligner/params.cpp : 127 
		 -oFormat requires a valid output format!

In the following weeks I’ll implement the unified output handler for all possible situtations - continuous mode, complete mode or transcribing mode. I’ll also remove deprecated code, organise and refactor the remaining code while completing the documentation. Memory leak fixes, further optimisation et cetera shall also be done next week.

Hoping everyone’s having fun! See you in the next one! 🙏🏻

Google Summer of Code, Week 9 & 10 : Let's Zoom In! was originally published by Saurabh Shrivastava at //Saurabh Shrivastava on August 23, 2017.

Google Summer of Code : Mid Term Evaluations!✍️

2017-08-22T00:00:00+05:30

With the ending of eight week of Google Summer of Code, second phase of conding is almost complete. This meant that it’s time for mid term evaluations, which would judge whether I will be continuing the program or fail.

I successfuly passed my mid term evaluations, thankfully, leaving my mentors happy with my work. Though initially I hit a bump, but later on, I passed with flying colours. 😊

This was the last week of second phase of coding period. In the previous 4 weeks I began full blown ASR work and integrated it with the work built during the first phase. That involved couple of things like using PocketSphinx to perform speech recognition, generating language model, grammars, dictionaries et cetera from the subtitles followed by actually performing alignement using fuzzy comparision and window based search. The fourth week was reserved as buffer week to meet the missed milestones, complete the documentation and all the remaining things. By the end of second phase, I met the following deliverables :

Word recognition and timed transcription.
Tuned language models and dictionaries.
Adaptation script and implementation for custom models and dictionaries based on subtitles.
Exporting result with colored identification of recognised/ non recognised words. (Just for demonstration purpose)

When we began evaluation, with the first sample, output of ccaligner was mostly rubbish with no relevance to the actual subtitles. The sample was from one the daily soap operas. Turns out that the subtitle etracted from that vides were roll-up subtitles and had delayed timing, resulting in different recognition then expected, as the expected format is a SubRip.

Next we tried another sample from the same genere, and this time, the program kind of froze. It kept on trying to understand the utterance, but kept failing. After debugging, I found out that the subtitle had some emojis in it, which wern’t being handled correctly. This resulted in corrupted grammar and language model leading to the freeze. It was a rather quick fix, but Carlos ultimately decide to shift the actual evaluation later, so that I can do a bit more testing. Wise decision, as I found out that when the audio had no speech as opposed to the indictaion present in subtitle and flase positive from VAD, the decoder crashed.

After handling non ASCII characters and fixing the crash, I added the transcribe option in which the decoder does the alignment irrespective of subtitle timings. It still uses subtitle file to create language model, but marks the utterances on it’s own and gives word by word timed transcription along with confidence score.

Two days later, at night, I found out that Carlos is free and we re-attempted the evaluations. This time it worked flawlessly and I think I met all his expectations! 😊 He was happy with the output, and he proceeded to tell me that I passed, and said, that I did a good job! I was very happy to hear this after hitting that bump before, and kind of not meeting his expectations in the first evaluations.

Few days later I received the official email from Google Open Source Office, declaring me passed in the mid term evaluations. In the official feedback he mentioned about completing documentation as well and giving credits everywhere (which, I thankfully did throughout)!

Looks like the hardest parts are done and working reasonably well. Important missing thing is documentation (good one, both technical and informative). Your blogs are well written though so I don’t think this is going to be a problem. When you document, remember to mention all dependencies, too. Also if code was taken from any other project, give credit.

Here’s the screenshot of the same :

I’ll continue to improve myself and learn more, as I have learned from this evaluation.

Thank you Carlos and Alex for being my mentor, and for passing me in the second evaluations! 😊

What’s next?

Now I will begin working on implementing phoneme recognition in the tool. This involves using phoneme decoder, generating phonetic language model and implementing it in CCAligner. This will be followed by all sorts of refining and handling output. Logging, error handling, external documentation shall be of utmost importance as well. The next month will be super busy as it’s high time in college too.

I hope other students have passed their evaluations as well and are enjoying working. I can not express how amazing the experience has been so far. It’s almost surreal. Hope to complete the project within the timeline. See you in the next one! 🍀

Google Summer of Code : Mid Term Evaluations!✍️ was originally published by Saurabh Shrivastava at //Saurabh Shrivastava on August 22, 2017.

Google Summer of Code, Week 7 & 8 : Let's Karaoke! 🎤

2017-08-06T00:00:00+05:30

This blog post covers the work done in the last two weeks of the second phase of coding. I have updated the checklist of the deliverable for the second phase at https://saurabhshri.github.io/gsoc/ .

If you are wondering about the delay - I am not being lazy, blogging is one of the things I love the most. I am just a little busier than ever. 🙂

In the last blog post I mentioned that one of the tasks of this period would be to figuring out a proper way of handling FSGs so that they consider the garbage loop as well. So, after a lot of experimenting with different scenarios and different weights, I finally figured out a good enough approach with a consistent efficiency irrespective of the audio. Creating FSGs dynamically should not be considered as the most accurate approach as devising them manually based on audio could substantially increase the accuracy.

Here’s an example of one such FSG :

FSG_BEGIN CUSTOM_FSG
NUM_STATES 20
START_STATE 0
FINAL_STATE 19

# Transitions
TRANSITION 0 1 0.0909
TRANSITION 0 2 0.0909
TRANSITION 0 3 0.0909
TRANSITION 0 4 0.0909
TRANSITION 0 5 0.0909
TRANSITION 0 6 0.0909
TRANSITION 0 7 0.0909
TRANSITION 0 8 0.0909
TRANSITION 0 9 0.0909
TRANSITION 1 10 1.0 i
TRANSITION 2 11 1.0 was
TRANSITION 3 12 1.0 offered
TRANSITION 4 13 1.0 a
TRANSITION 5 14 1.0 summer
TRANSITION 6 15 1.0 research
TRANSITION 7 16 1.0 fellowship
TRANSITION 8 17 1.0 at
TRANSITION 9 18 1.0 princeton
TRANSITION 10 19 0.0909
TRANSITION 11 19 0.0909
TRANSITION 12 19 0.0909
TRANSITION 13 19 0.0909
TRANSITION 14 19 0.0909
TRANSITION 15 19 0.0909
TRANSITION 16 19 0.0909
TRANSITION 17 19 0.0909
TRANSITION 18 19 0.0909
TRANSITION 19 0 0.0909
FSG_END

This makes the grammar flexible and makes a room for allowing the cases where recognition doesn’t match the expected output!

The next part was to actually align the recognised words with the audio. Prior to this, the recognition and time detection based on frames was working. The very first step in this was to reset the time stream, so that the frame count can be with respect to zero. This eliminated the chances of error as we are not processing all samples, but only those samples which have utterances.

For each word, I found out the exact timing by dividing the frame count with the frame rate and converting it into milliseconds. This value was then added to the beginning timestamp of the samples for which they were recognised.

//the time when utterance was marked, the times are w.r.t. to this
long int startTime = sub->getStartTime();
long int endTime = startTime;

/*
* Finding start time and end time of each word.
*
* 1 sec = 1000 ms, thus time in second = 1000 / frame rate.
*
*/

startTime += sf * 1000 / frame_rate;
endTime += ef * 1000 / frame_rate;

They are also stored in an object of class recognisedBlock for later uses.

For the alignment with subtitle, the recognised words needed to be matched with subtitle text. A simple linear search would not be a preferred choice in this case, as it will provide erroneous results. For example, consider the case :

Actual      : [Why] would you use tomato just why
Recognised  : would you use tomato just [why]

So, if we search whole recognised sentence for actual words one by one, then Why[1] of Actual will get associated with with why[7] of recognised. This will not only result in incorrect tagging of word, but also the words in square bracket matches, sets the lastWordFoundAtIndex at 6, and the search stops.

To prevent this I am using a window based search approach where a word is searched only in the defined window thus, limiting the number of words it can look ahead, in order to prevent error of mismatching. This also enables user to define a window to look into.

searchWindowSize = 3;

Recognised  : so have you can you've brought seven
                   |
            ---------------
            |               |
Actual      : I think you've brought with you

Recognised  : so have you can you've brought seven
                        |
                -------------------
                |                  |
Actual      : I think you've brought with you

But since the recognition is not perfect, I perform a fuzzy search to find the match instead of directly comparing the words. The fuzzy search is performed by calculating the Levenshtein distance between the words. I then use this distance to find out how much are the words similar. If the words have similarity of 75% or above, I consider them a match! This percentage of course is user configurable.

Once the match is found, the word is marked as recognised and it’s starting and ending timestamps are initialised.

There are various output options added in which the results can be visualised. One of those options is addition of output as Karaoke mode. In this, the words that is being spoken is written in SRT format with a <font> tag. So, when the word is being spoken,it gets highlighted. Here’s a gif with the excerpt from the respective karaoke subtitle file. This subtitle file was the result of karaoke output from ccaligner.

Output Visualised as in Karaoke format --print-as-karaoke yes .

00:00:12,780 --> 00:00:12,911
<font color='#A1E4D3'> I</font> was offered a summer research fellowship at Princeton 

00:00:12,810 --> 00:00:13,100
I <font color='#0000FF'> was</font> offered a summer research fellowship at Princeton 

00:00:13,180 --> 00:00:13,540
I was <font color='#0000FF'> offered</font> a summer research fellowship at Princeton 

00:00:13,550 --> 00:00:13,900
I was offered <font color='#0000FF'> a</font> summer research fellowship at Princeton 

00:00:13,910 --> 00:00:14,290
I was offered a <font color='#0000FF'> summer</font> research fellowship at Princeton 

00:00:14,300 --> 00:00:14,680
I was offered a summer <font color='#0000FF'> research</font> fellowship at Princeton 

00:00:14,690 --> 00:00:15,130
I was offered a summer research <font color='#0000FF'> fellowship</font> at Princeton 

00:00:15,140 --> 00:00:15,250
I was offered a summer research fellowship <font color='#0000FF'> at</font> Princeton 

00:00:15,260 --> 00:00:15,940
I was offered a summer research fellowship at <font color='#0000FF'> Princeton</font> 

Then I worked on creating helpful demonstration for my evaluation. In my first evaluation, proper demo were missing and it impacted my evaluation. So, I made sure, it was not the case this time. This involved creating a simple user interface so that arguments could be issued to check the developed functionalities so far. Of course this UI was just for testing and was not very polished. I also created a simple batch script for dependency installation. If you want to try out the work so far, you may find detailed installation usages in the readme file in the repository. As always, the recent commits are in development branch. To try, simply do

./ccaligner --print-aligned color -in input.wav -srt input.srt

I would like to apologise for delayed posts and documentations. I am already in the college as I can not afford to miss it more. I am trying to balance all the aspects of the project, college and placement and I hope to come on track soon! The next post shall cover my mid term evaluation and the changes I made during it. It pretty interesting and climatic I would say. 😉 I hope other participants are having fun, and are doing good! See you in the next one! 🖖🏻

Google Summer of Code, Week 7 & 8 : Let's Karaoke! 🎤 was originally published by Saurabh Shrivastava at //Saurabh Shrivastava on August 06, 2017.

Google Summer of Code, Week 5 & 6 : What'd You Say?👂

2017-07-12T00:00:00+05:30

Almost two weeks have passed since the first evaluations and my last blog post. This blog post lays down the summary of work I’ve done in past two weeks, and is the first time I am writing a combined blog post of more than one week.

If you’re wondering why I haven’t updated the checklist till now, the answer is simple - I am discovering things on the go. Determining and forcing rigid checklists would only make my work restrictive. I will continue uploading the blog posts with the updates as always. I have major deliverables in my mind as per the proposal, and will be listed in the checklist before evluations. If you are interested in reading my project proposal, it could be found here.

I started working on implementing ASR (Automatic Speech Recognition) in the project as soon as I passed my first evaluations. In the phase 1 of coding period, I built the foundation of the tool, upon which the core alignment has to be built. I also made an approx aligner to aid in the task. By the end of the first phase, the project has met following milestones :

Tool for subtitle processing and basic testing architecture.
Sample repository.
Algorithmic and Probability based word - audio matching.
Audio processing.
VAD implementation.

In the past two weeks, I have started the work of analysing audio to recognize words. This involves using ASR. I am currently using CMU’s PocketSphinx for this task. Maybe in the future I’ll add more ASRs like Kaldi or Google Speech Recognition et cetera. PocketSphinx is a (relatively) lightweight, speech recognition engine, specifically tuned for handheld and mobile devices, though it works equally well on the desktop. It’s written natively in C and has wrappers in huge number of languages such as Python or even JS.

Since, the project needs to be - not just an academic tool , the trade off needs to be among resource requirement and accuracy. PocketSphinx kind of stand in the middle of the niche. Plus, thanks to Nickolay V. Shmyrev’s for his tireless work at answering questions on forums and groups, the help is few messages away.

The very first step was to get PocketSphinx compile using CMake, since that is what I was using to build the tool. This also meant ensuring that it compiles across all platforms. The tutorial recommends installing PocketSphinx but there was no tutorial for compilation to be used as API. After lots of ‘trial and error’, I ended up collecting all the .c by recursively going through both PocketSphinx and SphinxBase libraries and then creating object files using them. Since, I am not compiling them the ‘recommended way’, there are certain flags, that remain unset. So, I had to manually set them. If you ever need to compile PocketSphinx in the same way, you may look at the CMakeLists.txt file in the main directory the project. If you have some better alternatives, PR are most certainly welcomed! 😊

After getting it to compile, I tried incorporating it with existing audio pipeline, and it works just fine. I first tried to use it directly on the approx locations, but soon realised that all it yielded me was garbage. It makes sense because we need to mark that starting and ending of “utterances” properly in order for successful recognition. It would need more than just taking some arbitrary window and expecting it to work.

Upon processing one sub at a time, I found out that the accuracy was very bad, it indeed recognizes some words, but for rest, it was out of bounds.

Recognised  : an exalt are so i'm
Actual      : So, in the next half hour or so,

This was expected as it does not make use of the transcription we already have.

So, the next step was to generate a custom language model based on available transcription. The ngram model is generated using CMUCLMTK toolkit. I generate the dictionary containing phonemes using the g2p-seq2seq tool. Currently they are invoked using system commands. I have also started using a new, better acoustic model which has significantly increased the accuracy of the recognition.

Recognised  : exploring your vision for one exciting future might look like
Actual      : exploring your vision for what an exciting future might look like

Recognised  : so this is makes the first question a little ironic
Actual      : which I guess makes the first question a little ironic

So, basically, I process chunks of samples based on timings present in subtitles. I mark utterance using the beginning time of subtitle and process samples that fall in the duration of the dialogue. I mark end of utterance and find hypothesis. Then I find the times of the words using the segment iterator. In case you’re looking to implement the same, the function looks something like this :

bool Aligner::printWordTimes(cmd_ln_t *config, ps_decoder_t *ps)
{
    ps_start_stream(ps);
    int frame_rate = cmd_ln_int32_r(config, "-frate");
    ps_seg_t *iter = ps_seg_iter(ps);
    while (iter != NULL) {
        int32 sf, ef, pprob;
        float conf;

        ps_seg_frames(iter, &sf, &ef);
        pprob = ps_seg_prob(iter, NULL, NULL, NULL);
        conf = logmath_exp(ps_get_logmath(ps), pprob);
        printf(">>> %s \t %.3f \t %.3f\n", ps_seg_word(iter), ((float)sf / frame_rate),
               ((float) ef / frame_rate));
        iter = ps_seg_next(iter);
    }
}

So far so good. :) This works with good accuracy, but not very good.

I really thought having subtitles will make the work easier as we already have some time information and order in which the words are spoken.

I tried creating FSG as :

If subtitle was :

00:00:04,960 --> 00:00:06,536
Thanks for having me.

Resultant FSG file was

FSG_BEGIN SAURABH
NUM_STATES 5
START_STATE 0
FINAL_STATE 4

# Transitions
TRANSITION 0 1 1.0 thanks
TRANSITION 1 2 1.0 for
TRANSITION 2 3 1.0 having
TRANSITION 3 4 1.0 me
FSG_END

In few cases it helps recognize the words perfectly, but this makes it too restrictive. For the parts where the recognition is something else or there were some other utterance, I get

ERROR: "fsg_search.c", line 940: Final result does not match the grammar in frame 44

I am hoping to make it flexible so that it still knows that the next word to recognize is ‘X’, and it becomes easier for it, but at the same time not fail.

Nikolay, suggested to include the garbage loop in the FSG to make it a little flexible. He was very kind to draw me a state diagram depicting the hints to what to do.

The next task will be to convert this into FSG programatically for all subs. Also, now since I have the timestamps of the words being spoken, it’s time to do the alignment. The difficult part will be for the words which are not recognised, and based on the type of speech in audio, sometimes they can be large. One way is to assign approx timestamps where the words couldn’t be recognised. I’ll let user enable/disable this with an argument.

On the personal front, my college is resuming quite earlier than I anticipated. Let’s see how it goes. I am still trying to figure out, if I should join right now or not. Since I am stepping into final year, the pressure for attendance should be low (as is usually the case). Let us see how it goes. I hope everything goes fine, because managing both college and full time GSoC project won’t come easy. In worst case, I might have to join college soon. In that scenario, I’ll try to cover-up by working extra on weekends. Also, I need to stress less probably. It’s been a long time I haven’t watched any movie, and ever since Silicon Valley’s season ended, my Monday entertainment ended as well. I loved the new spiderman in Captain America : Civil War, and I hope I get to able to watch the third reboot : Spiderman Homecoming soon. I have to do shopping as well for my college. Since, it’s placement season this semester, I will need formal clothes. I honestly don’t like shopping, specially if it’s related to clothes. I am already nervous and tensed about placements and all these formalities do not help at all.

Do you guys have any suggestions or tips? Feel free to comment / message me about it! 🙂 I hope other participants are having fun working on their projects. I can feel the stress sometimes when things don’t work as I planned them. I am really lucky to be in such an encouraging community who always somehow is able to cheer me and motivate me. Hopefully, everything shall work out good! 🕊

Google Summer of Code, Week 5 & 6 : What'd You Say?👂 was originally published by Saurabh Shrivastava at //Saurabh Shrivastava on July 12, 2017.

Google Summer of Code, Week 4 : The Evaluations! ⚰💀

2017-07-03T00:00:00+05:30

The fourth week ended on 29th of June marking the end of Phase 1 of coding period. The week also comprised of the first evaluations, which would judge whether I will be continuing the program or fail. 💀

I passed my evaluations! My mentor already told me about it during the evaluations, I received the official email from GSoC team on 30th declaring that I am eligible to continue through the next rounds.

This was the last week of first phase of coding period. In the previous 3 weeks I spent time building a the nitty-gritty of the tool upon which now I will continue the work. That involved building a robust subtitle parser, creating approx aligner, building test environment (which included collecting samples), ability to read and process audio, voice activity detection and a lot of reading. The fourth week was reserved as buffer week to meet the missed milestones, complete the documentation and all the remaining things.

In the fourth week I added the capability to read wave files from stream/pipe. Till now the files that were present on the disk were read. Now it is possible to simply pipe the wave file into the program. It was comparatively challenging as the specifications needed to be decoded on the go. It would have been easy if I were only reading raw samples, but reading the wave file and verifying to make sure the wave file is of proper format needed a lot of precision.

There are 3 modes in which wave files can be read.

File is present on disk. This is the most common usage scenario when the file is present in the disk. If filename is passed to WaveFileReader constructor then it uses this method.

    WaveFileData * file = new WaveFileData(argv[1]);	//supply filename
    file->read();
    std::vector<int16_t> samples = file->getSamples();	//return samples

For example : ./ccaligner input.wav

Data is piped/streamed. This is helpful when the wave file is not present butis being generated. This helps in making the tool capable of fitting into pipelines.

    WaveFileData * file = new WaveFileData();	//will read from pipe or stream
    file->read();
    std::vector<int16_t> samples = file->getSamples();	//return samples

For example : ffmpeg [arguments] | ./ccaligner

Data is piped/streamed and is first stored in buffer and then processed. This is helpful when we need to ensure that we have complete data before proceeding. This too helps in making the tool capable of fitting into pipelines.

    //readStreamIntoBuffer is an enum decalared in read_wave_file.h
    WaveFileData * file = new WaveFileData(readStreamIntoBuffer); 
    file->read();
    std::vector<int16_t> samples = file->getSamples();  //return samples

For example : ffmpeg [arguments] | ./ccaligner -useBuffer

The interface was modified to bring this change in place. You can try the various modes of reading in the VAD demo present in the /demo directory.

Though I try to document the code as I write it, there was still a way to improve it even more. I spent some time documenting the code, as well as the repository. I hope it is now even more easier to read and comprehend my code.

How did the evaluations go?

The highlight of this week were the very first evaluations of my Google Summer of Code project. The evaluations window opened on 26th and were to remain open till 30th. Carlos (CCExtractor Org Admin, my mentor) alloted 28th as the day for my evaluations. My another mentor Alex is on his GCI trip to Google Office, SF.

This was the first time I was going to experience something like this, and no matter how much I researched or read about it, there’s no telling how the evaluations are gonna proceed and what is it going to involve. It basically just bottles down to mentor slash project slash work.

So, my evaluation began on 29th past Midnight (IST), i.e. 28th Morning at my mentor’s place. I must confess, I was super-nervous (which totally got reflected in my evaluations). So, Carlos sent me a DM on slack asking if I am ready for the evaluations, and I replied that I need few mins to push the final documentation that I wrote prior that day.

Within a span of 2 or 3 minutes I replied him back and the evaluation began. Carlos began by asking where can he obtain the binaries. I told him that he can compile his own (as I made sure it’s buildable across all platforms) or I can send him. He chose the first option and I told him how can he compile his own binaries, which basically boiled down to clonning the repo, and using make to build the ccaligner executable.

Now, I had a bit different picture about evaluations then what happened next. Sure, I expected him to test the code but I expected that more time would be spent on checking how the code actually is. Meaning, I was naive enough to expect that he’ll go through code file by file and see how things work, and is it good or not. I spent an awful amount of time making the code as flexible and adaptable as possible.

After building the tool, he chose a random sample from the HD he sent to us (with all the video samples), and run it against it. The tool worked as expected. But since it did not involve any audio processing (note : only approx aligner is implemented till this time in the interface), he asked me about it. I told him that the audio analysis part is being worked upon and he can try the components stored separetely (VAD) in the demo dir. He built the VAD demo and fed it the wave file obtained from the video file using FFmpeg and it gave the output in stdout with time frames and binary values - 0 for voice absent and 1 for voice present.

Then he proceeded to ask me if I have covered all the milestones I listed in my proposal and the checklist (http://saurabhshri.github.io/gsoc/). I gave him a brief overview against all the listed tasks in checklist and assured him that I have met them all. I asked him to be brutally honest because I wanted, rather I needed to hear what my mentor thinks about it.

This is what my mentor responded :

I expected to see a bit more to be honest, but don’t be worried about it. If all the “cool previews” happen at stage 2 that’s fine. Your code quality is good, blog is good, communication is good… so no problems. OK, so eval done. you passed, so just continue working 🙂

So, as you can see, though I passed my evaluations, I need to work even harder. I am happy that my mentor expects a lot from me and I hope in next evaluations I work upto his expectations.

In the official feedback, my mentor wrote :

Code quality is good, however it would be useful to build in a way that allow to have good demos as work progresses. Building “completely horizontally” doesn’t allow to preview functionality. We’re betting on things working well at the end. Love the blog.

I’ll continue to improve myself and learn more, as I have learned from this evaluation.

Thank you Carlos and Alex for being my mentor, and for passing me in the first evaluations! 😊 Let’s have another fun month of some open-source goodiness.

How about the stipend?

Ah well, looks like I was not very lucky at the stipend front 😛. Google was super-quick to release the stipend on 30th itslef. But for some reason, the payment for me got cancelled with the status “Transfer rejected by processor.” . Since it was weekend already, their support ass unavailable. Also, since it’s US holiday on 3rd and 4th, I can only hope that their Indian support is available on Monday. I emailed them about the issue and also sent and email to the GSoC team.

Looks like they were available on Monday, as the previous transaction was totally cancelled and a new transaction was made. I received an email that the money is sent to the bank, and shall be deposited within 4 to 5 days. Let’s hope for the best! 🙂

What’s next?

Now I will begin working on implementing ASR in the tool. I am using CMU’s Pocketsphinx as it is light, portable and has great and active community. Plus, it’s in C, so should be easier to integrate it with my tool. Another possibility was to use Kaldi, but Kaldi in general is pretty resource demanding. Maybe I’ll add Kaldi support post GSoC. I have already started working on using Pocketsphinx’s API. I was successfully able to compile it using CMake with my tool and supply samples obtained from read_wave_file.cpp .

This time I might not make a very extensive task list as I made last time because this time I need to figure things on the go. I’ll keep posting the updates in this very blog.

I hope other students have passed their evaluations as well and are enjoying working. Remember guys, it can be frustrating at times and there will be a lot of factors trying to make your morale down. Fight it and you’ll emerge a winner. I hope I am able to do so as well. See you in the next one! 🎭

Google Summer of Code, Week 4 : The Evaluations! ⚰💀 was originally published by Saurabh Shrivastava at //Saurabh Shrivastava on July 03, 2017.

Google Summer of Code, Week 3 : printUsage(); 🖨

2017-06-23T00:00:00+05:30

Today marks the end of three of thirteen weeks of Google Summer of Code’s coding period. This week I implemented a simple user interface, added various output formats, fixed some bugs, fixed Unix build and processed samples.

All the latest commits could be found in the development branch. I will merge them to master once the first phase completes.

The code written till now is arranged systematically and is in the form of library so that anyone can use it in her code and make use of the available functionalities.

Say, for example, you want to make use of Approx Aligner (to perform approximate word by word audio subtitle synchronization) in your own cool project. All you need to do is, clone the CCAligner repository and put it in your project directory. Then simply include the “generate_approx_timestamp.h” file in your project by providing appropriate path. Also, include all the other files in your CMake file, and you are good to go! You can then simply do ApproxAligner * aligner = new ApproxAligner(input_filename); and use appropriate functions for your use. 😎

I am trying to include all the examples in the /demo directory in the repository. You can find it here. They have their own CMakeLists.txt in the respective directories so that you can build them individually if you want to. I will add a demo install script later in which you should be able to pass the demo you want to see as a parameter and it’ll build it for you.

While it’s great for developers to use this, there was no way to use it just out of the box. So, I have built a small user interface so that it should be easy to use and try and maybe use it as a base to develop your application. This enables using the tool directly after cloning it and then building it. The interface is pretty minimal and is command line only. You can run the tool without any arguments to see the available options. I will be updating the readme file with the instructions once it is complete.

Here’s a snippet of the same :

  ApproxAligner : ccaligner -a input.srt
  				  ccaligner -a input.srt -of <output_format>
  										  (srt/xml/json/stdout)
  e.g. ccaligner -a input.srt -of xml

As evident from the above usage examples, there are now options available to have output in more than just SRT format. The output can currently be obtained in the form of SRT, XML, JSON format and also on stdout. The structure of all of them is shown below. The respective functions can be edited and extended to user desired scheme.

/* printToXML() */

 ├───Subtitle
           ├───Start Time
           ├───End Time
           ├───Start Time in ms
           ├───Start Time in ms
           ├───Word
           ├───Style
           |   ├───Style Tag 1
           |   ├───Style Tag 2
           |   └───Style Tag
           
Example : 

<subtitle>
	<time start=3560 end=3805></time>
	<srtTime start=00:00:03,560 end=00:00:03,805></srtTime>
	<text>It's</text>
</subtitle>

<subtitle>
	<time start=3805 end=4099></time>
	<srtTime start=00:00:03,805 end=00:00:04,099></srtTime>
	<text>great</text>
</subtitle>

/* printToConsole() */

START
word 	: {word here}
start	: {start time in ms}
end 	: {end time in ms}
END
           
Example : 

START
word    : It's
start   : 3560 ms
end     : 3805 ms
END

START
word    : great
start   : 3805 ms
end     : 4099 ms
END

🔥

Also the Unix build (Linux and MacOS) was failing ever since I added the VAD. The webRTC library failed to link. Turns out, webRTC uses WEBRTC_POSIX variable to determine if the system is POSIX. But for some reason, this variable did not get set. It wasn’t apparent at first that this is the problem. After a lot of digging deep I found that this was the culprit. I had to add a manual check for posix systems so that correct multithreading code executes in respective environment.

Once this was fixed, it turned out while the pthread library successfully links in MacOS, it wasn’t linking it in Linux. Supplying -lpthread as a flag while linking solved the problem. The CMakelist.txt was modified to incorporate all these changes. The same was reflected in the VAD demo as well.

Also, there’s a special requirement of type of wave files that are needed by ASR and VAD library. The wave file must be 16 bit PCM, Mono, 16Khz or 8Khz sampled. So, I processed the samples to obtain the wave files.

What’s next?

I originally planned to do the video chunking, i.e. splitting the videos such that they only contain the parts where audio is present. But for now I am planning to postpone it as it totally relies on the way that will be best suited as per ASR’s requirement. So, I will do it at that time.

I will now add the capability to read the wave file from stream i.e. in a way that data can be piped through other programs such as ffmpeg. So that doing something like ffmpeg -i video <other params> | ccaligner will be possible. I will also be working on adding sliding window approch to VAD to get better results. It should be interesting to see if it improves the accuracy.

I will be preparing a full report of the progress so far, complete the documentation, update the same on ccextractor.org et cetera.

Also, 28th is the date when Carlos has decided to read the report so that he can submit the evaluation. The evaluations start on 26th and are due before 30th. I am a bit nervous (bit >= 64) as this is the first evaluation ever I am facing in my entire life. I hope everything goes good. I am very happy with my mentors, they are super nice and super fun to work with. I am glad that they trusted me enough to be in the driver’s seat for this project.

All the very best to all the GSoC participants for their very first evaluations. I hope you all (inlcuding me) pass with flying colours! 🌈

Google Summer of Code, Week 3 : printUsage(); 🖨 was originally published by Saurabh Shrivastava at //Saurabh Shrivastava on June 23, 2017.

Google Summer of Code, Week 2 : Valar Researchis! ⚔

2017-06-18T00:00:00+05:30

The second week of Google Summer of Code just wrapped up. This week was spent on adding support for managing timestamps for each word, improving Approx Aligner, printing result as SRT, revising Game of Thrones and doing a lot of reading and researching. The first evaluations are due in two weeks.

I had imagined that the Approx Aligner would be far from accurate, the results are definitely not bad. I have attached the demo later on in this post.

The first evaluations are even more close as the third week of coding period begins. Since the beginning of the program I am properly (rather ahead atm) on the timeline I proposed. Hence, I did not write much code in this week, but instead I spent the time learning few things which will be helpful for the project later on. Another reason for that was the fact that I was sick (fever) for a while. 🤒

Here’s the summary of things I did in the second week :

Added support for managing the timestamps of individual words.

The parser was modified to add the support of storing the start-time and end-time of each word ad accessing it. The time values are stored as a vector of long int and there are functions to get these values both as vector or individual values by index.

    
 //Relevent Data Members : 
    
 std::vector<std::string> _word;         //list of words in dialogue
 std::vector<long int> _wordStartTime;   //start time of each word in dialogue
 std::vector<long int> _wordEndTime;     //end time of each word in dialogue
 std::vector<long int> _wordDuration;    //actual duration of each word without silence
    
 //Relevent Member Functions : 
    
 std::vector<std::string> getIndividualWords(); //return string vector of individual words
 std::string getWordByIndex(int index);       //return word stored at 'index'
 std::vector<long int> getWordStartTimes();   //return long int vector of start time of individual words
 std::vector<long int> getWordEndTimes();     //return long int vector of end time of individual words
 long int getWordStartTimeByIndex(int index); //return the start time of a word based on index
 long int getWordEndTimeByIndex (int index);  //return the end time of a word based on index

This makes things very easy. For example printing the words and their timestamps in SRT format is now as easy as doing :

 //converting STARTING timestamp (in milliseconds) to hh:ss:mm,ms
 ms_to_srt_time(_sub->getWordStartTimeByIndex(i),&hh1,&mm1,&ss1,&ms1);
    
 //converting ENDING timestamp (in milliseconds) to hh:ss:mm,ms
 ms_to_srt_time(_sub->getWordEndTimeByIndex(i),&hh2,&mm2,&ss2,&ms2); 
    
 //arranging the timestamps in SRT style
 sprintf(timeline, "%02d:%02d:%02d,%03d --> %02d:%02d:%02d,%03d\n", 
 		hh1, mm1, ss1, ms1, hh2, mm2, ss2, ms2);
            
 out<<timeline;
 out<<_sub->getWordByIndex(i)<<"\n\n";

Improving the Approx Aligner and making output as SRT

Using the word weight to find it’s approximate timeframe is bound to have errors. I spent the time in figuring out how can I still improve it and if its even possible. After a lot of experimentation it hit me - apart from the words, the one other thing that makes a difference in actual speech is the “silence” between those words which I was not considering at all. So, using word weight and finding the duration of each word, I split the remaining time into the silence between each word being spoken. One more addition could be a more emphasis on silence in presence of punctuation, but this is difficult to calculate as it hugely varies from speaker to speaker and situation to situation (well, so does other things).

Anyway, after making changes, I also added the option to print the result as an SRT file so that I can actually see the results and verify them. I will make a report on it’s accuracy once I run it across all the samples. The disk containing samples will reach to me soon. The results right now, surprisingly, are not bad at all!

I tested it on couple of samples, and it was pretty great given that the method uses 0% audio processing and how fast it is and that it’s entirely based on probability and approximation! Here’s a quick screengrab demostrating the result :

You can see the results yourself too. Here are the video sample and it’s word by word synced subtitles generated using Approx Aligner. :)
- Video File : ElonMusk2017.mp4
- Subtile File : ElonMusk2017.srt
This is actually pretty good, because now we will be able to have a window (say +- {ms} of this timestamp) where the word will actually be present and hence we can focus our ASR to detect the word in this range rather than trying to guess it over a period of time.
Loads of reading and researching 📑

I did not want to waste any time just because I was sick. So, I got few research papers in the field of Forced Aligners and ASR printed and spent time reading them. Also, I read the code of the programs using these ASRs to see in what ways people are using them to accomplish their tasks.

I now have an even better picture of how and what I am going to do in phase two. Also, I already have the capability to perform VAD and read audio samples implemented ahead of time which is a plus.

What’s next?

I will be implementing the various other output options, such as JSON and XML. Also, I’ll refactor somme code and write documentation on using the Approx Aligner. The tool is in the form of libraray currently and it makes sense to also provide an interface so that it can be used directly as well (i.e. without writing code), after all - it’s a tool. I should have the disk containing samples soon - so I’ll have to process it to get audio in the required format from them and also work on some batch scripts to automatically run test over all of them.

This time of summer is quite daunting here in India. It’s hot, but worse, it’s humid. My eyes need a checkup as well - I am often having headaches and blurry vision (not a good sign I know). I hate spectacles and definitely don’t want them. The last time I had check-up, the doctor pointed out that my right eye needs a glass of 0.5 power. But given my hatred towards them I did not get my glasses made. I sincerely hope it’s not worsened.

Anyway, I hope everyone’s having fun and the other participants are having great time building their projects. ⛱

Google Summer of Code, Week 2 : Valar Researchis! ⚔ was originally published by Saurabh Shrivastava at //Saurabh Shrivastava on June 18, 2017.

Google Summer of Code, Week 1 : The Beginning! 🕺

2017-06-09T19:30:00+05:30

It has been two days since Google Summer of Code’s first of thirteen weeks ended. I have started working on my project CCAligner - Word by Word Audio Subtitle Synchronization with CCExtractor Development.

My progress could be tracked through weekly checklist for milestones and tasks, which can be accessed here : https://saurabhshri.github.io/gsoc/ .

With the end of Community Bonding Period, the coding period officially began. This is how I spent the first week of coding period :

Setting-up the new server. 💻

As I mentioned in the last post, I was already done with setting-up my development environment. But I had made a request for a dedicated server of my own with root access and Carlos (CCExtractor Org admin, my co-mentor) got me one! :) It’s a nice little server running Ubuntu Server. This was the first time I used such a vanilla version of linux. It’s super light out of the box and has absolutely nothing (just the linux) installed. It was so much fun setting up the server according to my preferences.

I also installed x2go in it so that I can also “visually” see the changes I am making. It’s super-helpful while debugging.
Started building basic skeleton of tool.

CCAligner can be found on Github at : https://github.com/saurabhshri/CCAligner/

I made a rough sketch of how the general hierarchy will be in the project. I will try to make my tool as modular as possible, and also in the way that it can be easily used as a library in other projects. It’s often difficult to “librarise” the code after creating, so even if it consumes some extra time, it’s better in my belief to begin doing the same.

In general, there are two main directories inside the source directory -
- lib_ext : This shall contain all the external libraries that I’ll use.
- lib_ccaligner : This shall contain the CCAligner library.
```
 ├───demo
 │   ├───ApproxAligner
 |	└─── ...
 ├───src
 │   ├───lib_ccaligner
 |	|	├─── ...
 │   |   └─── ...
 │   └───lib_ext
 │       ├─── ...
 │       └─── ...
 └───tests
  	├─── ...
 	└─── ...
```
Begin writing the implementation of approximation based word tagging.

Using the subtitle parser I created, I began writing the implementation of approximating the timestamp of a word based on it’s weight (calculated as a function of ratio of word length to sentence length). This is a super fast method to calculate the timestamp of each word and doesn’t require any type of audio processing. But obviously, this has very poor accuracy. Nonetheless this shall come in handy where audio - analysis is not feasiable and we require the sync super-fast without caring much about accuracy. Not to forget, this will provide us with a better window to analyse the word when audio analysis will come into the picture.

Since, the code is in the form of library, it’s very easy to use. The demo can be found in the demo directory which is linked here.
```
 #include "generate_approx_timestamp.h"

 int main()
 {
     std::cout<<"Enter path to the subtitle file : ";

     std::string filename;
     std::cin>>filename;

     ApproxAligner * aligner = new ApproxAligner(filename);	// that's it :) More customization to come!
     aligner->align();

     return 0;
 }
    
```
It has room for ton of improvement, which I definitely will do with the course of time. Everything is currently pretty raw, as you’ll expect a software to be in it’s primary stages. And yes, I’ll begin naming my header files in a better way! :P
Set-up a small testing environment. ✔️

As mentioned in the last post, I have collected quite a lot of samples, a mojority of which are Ted talks (as they have clear speech as well as good subtitles). Previously I worked on improving CCExtractor Sample-Platform’s difference showing library. I extended this so that I can compare the output of the tool with the actual results, and also to compare results across various techniques.

I have also set-up Travis-CI to check builds across linux and OSX with every commit to the repository. Soon, I’ll add the test scripts to it as well. This will be a step towards test driven development.
Start implementation for audio based processing.

The very first step in this region was being able to read audio files and extract data. For this my code should be able to take wave file as input, check if the file is valid or not, and then decode it to extract information such as SampleRate, BitRate, etc. and the samples (which contain the audio data) itself. This was an intresting job. While I could find some code online that did the reading, but they were not utilising the object oriented approach and also, did not read the data in the manner desired for my use.

So, I searched for the wave file specifications online. The official Microsoft’s document was not the example of best specification-documentation I have seen, but was OK. Fortunately, I found another one which was very well written, and was precise. If you are interested, it is located here.

I did encounter few bugs while decoding wave files. Though the bugs were pretty small and easily fixable, they took some time to get spotted.

1.When I read the wave files in buffer, and used it to create a new wave file, for no apparent reason, the output was a very scattered and noisy version. Now, it felt pretty ridiculous to me that how can the output be different if all I am doing is reading it as it is and sending it as output.

Difference in input and output wave files.

After spending some time on it, I ended up opening both the input and output wave files in a hex viewer and search for the differences. After few bytes, I could see that the hex code shifted by one, i.e.
```
   632e ca30 0a2d 582f 4c2b 8c2d 032a 2b2c
   632e ca30 0d0a 2d58 2f4c 2b8c 2d03 2a2b	
```
Now, it became clear. Everywhere in the file, 0a was replaced by 0a0d. Now this immediately raised a question in my mind. Could it be..? I Googled it, and yes! It was the endline characters. All the LF were converted to CRLF leading to the distortion in the output file. Turn out, while I was reading the file in binary format, I was writing it in the text format, which led to this fiasco. Once the bug found, setting the ios::binary flag was all it took.

2.Another intresting thing that happened was, everything was being decoded perfectly, except the SampleRate. This was strange because it was in the middle of various value - which were decoded perfectly. I had to read the detailed specification to find out that the data is in the form of unsigned values. I had not considered it and I was storing it in signed char. So, where the values should be, say for example, 255 it stored -128.

3.The specs which I used for reference were incomplete. It missed the fact that the wave file is not necessarily 44 Bytes and that it may contain some meta-data. So, to find the data subchunk, I manually searched for them in the stream.

At the end, the wave files are now being correctly read, and the the samples are collected as a vector of 16 bit integers (unsigned short int) : vector<int16_t> _samples.
Read about various available VAD techniques.

I researched about various available VAD techniques, compared their pros and cons and at the end decided to go with Google’s webRTC. They have one of the best VADs out there and the code is already in C. It’s documentation to use the native C code is almost non-existant. While I was able to locate the function that performs VAD by peeking into the files myself, it was not very clear about the arguments that needed to be supplied.

I did search the discuss-webRTC Google group, but there was nothing to be found in the forum. I even asked them on their IRC channel, but I was instructed to ask the question on StackOverflow, which I did. I was not very sure that someone would answer and when I saw the stats for webRTC related question, that increased my suspicion.

68.4% Questions Unanswered in last 30 days.

Fortuantely I found out that there is a python wrapper of webRTC’s VAD in the Github. I immediately sent an email to John Wiseman, who created the awesome py-webrtcvad and asked him if he could help. Now, I wasn’t sure that I would find a reply. So, I looked into the SO stats to see the “most active” members on the webRTC tag and also emailed two people from there.

I was surprised to find that all the three people replied and all were willing to help. This is what I love about the open source community, everyone is eager to help. ❤️ Sometimes this comes rather surprising to me. Thank you John, Ajay and Gilad. :)

So ultimately John answered my question on SO, and I have a small implementation of VAD ready. In the coming weeks I’ll improve it and bring it to use.
Fixed the CMake build script of CCExtractor for windows.

CCExtractor had broken CMakeLists.txt since a long time. Due to this, the only way to compile it on Windows was using Visual Studio. I prefer to work in CLion so I spent some time and fixed the CMake list. It should also be helpful in bringing the Windows support to Sample-Platform which can use this to build it.

What’s next?

This was the first of total thirteen weeks. There’s a lot of work left. The work already done is also pretty raw and there’s a room for tons of improvement. My mentors have suggested a few changes as well. E.g. The tool needs to be able to read directly from the stream (like the data being piped from FFMPEG). I will stick to the timeline and continue the work. I am in constant contact with my mentors and shall work in accordance with their feedbacks.

The second week has already commenced and the evaluations are in the June end. I hope other GSoC participants are enjoying their work as well! 📈

Google Summer of Code, Week 1 : The Beginning! 🕺 was originally published by Saurabh Shrivastava at //Saurabh Shrivastava on June 09, 2017.

GSoC 2017, Coding Period Begins! ⚡

2017-05-29T19:30:00+05:30

This marks the ending of a month long community bonding period. Fine-tuning deliverable, setting up timeline, early coding and much more, here’s my GSoC 2017 community bonding experience with CCExtractor Development. 😊

As it turns out, the little backstory written below is not little afterall. 😛 Click here to skip it and jump to later portion.

A little backstory

I began preparing for GSoC in November and at it took me no time to decide that CCExtractor is the one I would love to work with. It was one of the first orgs I researched about and I immediately liked them. They were amidst Google Code-In and it was actually pretty awesome to watch the participants and mentors working together.

I distinctly remember the starting few days. Though I had joined CCExtractor’s discussion group on Slack, I did not say anything in the beginning. I only stayed their and read their conversations. It was not because I didn’t have anything to say or that I did not want to introduce myself, but because I was hesitant. I had never done this before. I was dead scared, and nervous. But then I saw how lovely the community is. The Code-In participants were in the age group of 13-18 and seeing them work together gave me a boost in confidence. The mentors were super - supportive. Even when someone was being PITA they handled it pretty flawlessly. So, I thought, if these kids can do it, why can’t I (though in reality, they were completing Code-In taks and I was merely trying to introduce myself 😛). So, I introduced myself with my first PR which was a mere correction of a typo in a readme file. I was surprised to see how welcoming everyone was. Not only the mentors, even the participants gave me warm welcome. My hesitation went from 100% to 0% in no time.

That was it. From that moment I actively started participating in the discussions, code reviews, random chats and what not. I started contributing and also helped the code-in students in verifying their code and helping them test things. Along with the main CCExtractor tool, I also had a lot of fun working on Sample-Platform! Also, contributing to sample-platform taught me a great deal of things - not just talking about code - but other important things like open-source ethics, working with people, making good PRs, asking good and quality questions, doing research, proper version control et cetra. Thank you Carlos and Willem, I can not thank you both enough! During this period I also became good friends with Alex and Evgeny who were code-in participants (and winners) and now GSoC mentors. It’s super-fun to work with all of these guys and its awesome to be a part of the community!

Communinty Bonding Period

The amazing news of my proposal getting selected for this GSoC was announced on 4th of May which also marked the begining of community bonding period. Community bonding is basically a month allocated to learn about one’s organization’s processes - release and otherwise - developer interactions, codes of conduct, et cetera. What once choose to do during this period varies.

Since I was already active in the community since quite some time, I was already familiar with most of the people and with the general practices. I have two mentors - Carlos Fernandez Sanz (who originally built CCExtractor) and Alex Bratosin (CCExtractor GCI 2016 Winner).

So here are the things I did during the community bonding period.

Set-up a blog.

This was actually quite important. To note down my GSoC progress and to keep mentors and everyone in the loop, I created this blog. It’s actually a simple Jekyll blog hosted on Github Pages.

Blog link : https://saurabhshri.github.io/.

Additionally I decided that it would be easier for my mentors and me to keep a check on work if we have a detailed checklist of the tasks which is to be done. My mentors agreeed, and hence I am maintaining the milestones and deliverable in the form of a checklist as Github Gist, so that it’s easily embeddable.

Milestones/Weekly deliverable checklist : https://saurabhshri.github.io/gsoc/ .

Till now I have written about three posts (fourth including this) about my GSoC work on the blog the first being announcement of me getting accepted. All my GSoC work progress and related work is tagged under the “Gsoc” category.

Category wise posts : https://saurabhshri.github.io/categories/ .

Screenshot of one of the blog posts
Brought myself a chair.

Students participating in GSoC are expected to work at least 40 hours a week, and never having done anything for such a long time before, it was an easy realization that a proper chair is a must. I brought a nice ergonomic chair with lambar support and adjustable height and it really helps in the backpain caused due to bending in front of laptop. I think it was worth an investment.
Completed GSoC formalities including setting up Payoneer account.

This took more efforts than I originally expected. Setting up mode of payment (i.e. deciding to recieve money in USD or INR) took some significant amount of time. I visited several banks to enquire about their fees. Plus when I decided and set-up the method, it was chosen wrongly dure to some error on Payoneer’s end. I had to again send some documents and information to them on email, after around two weeks of email interchanges the account was finally set-up.

Google added as funding source in Payoneeer account

I will post a detailed post about setting up payment process to help future students sometime soon.
Fine tuning deliverable.

I spent a lot of time reading about the techniques I might incorporate while doing the project. Though I had made a pretty detailed plan while making the proposal, I fine tuned the deliverable for first phase. They can be found in the checklist I mentioned above.
Creating sample repository.

The kind of samples that are required for my tool are a bit restrcitive. Two of the primary conditions are :

a. samples must have subtitles
b. samples must have clear speech audio

To find such samples I enquired various people like creator of Kaldi based forced-aligner Gentle and CMUSphinx community. Nikolay from CMUSphinx gave me a very good idea to use Ted Talks as sample. They fit both my primary requirements and are easily available to download as well.

Some Ted samples

Also, Carlos is sending a 2 TB disk loaded with transport streams recorded from various TV stations. Let’s see when I get to recieve that.
Started coding early.

I decided to start coding early and hence started to work on a subtitle parser which is one of the primary requirement of my tool. I have already completed it (at least the part required in my project).

Subtitle Parser

Link to parser : https://github.com/saurabhshri/simple-yet-powerful-srt-subtitle-parser-cpp .

Link to relevant blog posts :

a. Creating a full blown (SRT) Subtitle Parser

b. Simple yet powerful single header srt subtitle parsing library in cpp

I have also begun working on setting up testing scripts which will help me during coding period.
Researching

This is something which I will need to do throughout. A lot of things which I am planning to do in my project like VAD, ASR, Phoneme Recognition, LM Training et cetera are something which are new to me. It’ll be interesting to learn about them and implement them. I am quite excited and a little nervous about them. :)
Setting up development environement.

I inititally hoped to recieve a seperate server to work upon, but I understand that it is not entirely possible to give me my own separate server. I will be working on the gsocdev3 server on which I used to work before. It’s an excellent machine. The internet is quite good and it has huge sample collection. I also have my laptop with several VMs all ready to test the code I write. I have also enrolled myself in AWS free tier in case I need it.

What’s next?

The coding period officially commences from 30th May. I will stick to the timeline and work to tick those square boxes in the checklist. I will try to complete the tasks ahead of time in order to save time in case I get stuck at some point (which is often the case developing something new). Also, first phase evaluations begin on 26 June.

I wish all the participants a fun and productive summer. 🙋

GSoC 2017, Coding Period Begins! ⚡ was originally published by Saurabh Shrivastava at //Saurabh Shrivastava on May 29, 2017.

GSoC 2017 : End of Community Bonding Period

2017-05-29T15:10:00+05:30

Community bonding period, where students spend time learning and getting ready for their project, ends on 30th of May.

The Google Summer of Code 2017 community bonding period began with the announcement of accepted projects on 4th of May and will end tomorrow, on 30th of May. This will mark the official commencement of Coding Period, which will last for 3 months. Students shall begin writing code for their projects under their mentors based on the timeline and milestones they set during this community bonding period. First evaluations are scheduled to take place from 26th June till 30th June.

I will post about my community bonding experience and the work progress in my next blog post soon. :) Meanwhile you can find my profile on Github here : https://github.com/saurabhshri and my GSoC timeline and deliverable checklist here : https://saurabhshri.github.io/gsoc/ .

I wish all the best too the GSoC 2017 participants (including me). Let’s make something awesome this summer!

GSoC 2017 : End of Community Bonding Period was originally published by Saurabh Shrivastava at //Saurabh Shrivastava on May 29, 2017.

Simple yet powerful single header srt subtitle parsing library in cpp

2017-05-29T11:11:00+05:30

Srtparser.h : Simple, yet powerful single header C++ SRT Subtitle Parser Library. 💖

This is a follow-up blog post to my previous post, where I began implementing a fully functional but simple to use subtitle parser in C++ for my GSoC project.

I am happy to announce that the subtitle parser is ready. You may access it here : https://github.com/saurabhshri/simple-yet-powerful-srt-subtitle-parser-cpp

I have tried my best to document it in the Github repo, and it should be fairely easy to use. But in case you have any doubt or you need any help, feel free to contact me or raise an issue in that Github repo, I will be happy to help. 😁

The parser is super easy to use and has tons of features ✨ :

It is a single header C++ (CPP) file, and can be easily used in any project.
It is focused on portability, efficiency and simplicity and has no external dependency.
Wide variety of functions at programmers’ disposal to parse srt file as per need.
Some amazing and useful capabilities such as :
- extracting and stripping HTML and other styling tags from subtitle text.
- extracting and stripping speaker names.
- extracting and stripping non dialogue texts.
- extracting words as list.
- get time in both string and in milliseconds.
It is super easy to extend and customize.

I could not find any other subtitle parser / parsing library which was apt for my usage and I ended up creating one myself. Hope this is helpful to other developers as well. This is definitely one of the best ones out there. Feel free to use it.

How to use srtparser.h subtitle parser

Include the header file in your program.
```
     #include "srtparser.h"
```

Create SubtitleParserFactory object. Use this factory object to create SubtitleParser object.

     SubtitleParserFactory *subParserFactory = new SubtitleParserFactory("inputFile.srt");
     SubtitleParser *parser = subParserFactory->getParser();

Use the parser. 😉

     std::vector<SubtitleItem*> sub = parser->getSubtitles();
     long int startTime = sub→getStartTime();

Read more about the available functions in this easy to read and explanatory table : https://github.com/saurabhshri/simple-yet-powerful-srt-subtitle-parser-cpp#parser-functions .

You may also checkout a demo program using this library in example/ directory.

Elements present in subtitle - item

Following is the list of all fields present in subtitle item :

    long int _startTime;                    //in milliseconds
    
    long int _endTime;
    
    std::string _text;                      //actual line, as present in subtitle file
    
    long int timeMSec(std::string value);   //converts time string into ms

    int _subNo;                              //subtitle number
    
    std::string _startTimeString;           //time as in srt format
    
    std::string _endTimeString;
    
    bool _ignore;                           //should subtitle be ignore; used when the subtitle is empty after processing
    
    std::string _justDialogue;              //contains processed subtitle - stripped style, non dialogue text removal etc.
    
    int _speakerCount;                      //count of number of speakers
    
    std::vector<std::string> _speaker;      //list of speakers in a single subtitle
    
    int _nonDialogueCount;                  //count of non spoken words in a subtitle
    
    std::vector<std::string> _nonDialogue;  //list of non dialogue words, e.g. (applause)
    
    int _wordCount;                         //number of words in _justDialogue
    
    std::vector<std::string> _word;         //list of words in dialogue
    
    int _styleTagCount;                     //count of style tags in a single subtitle
    
    std::vector<std::string> _styleTag;     //list of style tags in that subtitle

Downloading the header file.

You may download the header file in any way desireable.

Simply clone the repo using

git clone https://github.com/saurabhshri/simple-yet-powerful-srt-subtitle-parser-cpp.git
Download the zip file from

https://github.com/saurabhshri/simple-yet-powerful-srt-subtitle-parser-cpp/archive/master.zip

License

srtparser.h library is licensed under MIT License (find it here). Feel free to use it in your application. :) Happy development!

Contribution and Feature request/ Bug

Feel free to raise an issue or make a feature request here.

Also, feel free to contribute to the project. Your help would highly be appreciated! 😀

Simple yet powerful single header srt subtitle parsing library in cpp was originally published by Saurabh Shrivastava at //Saurabh Shrivastava on May 29, 2017.

Creating a full blown (SRT) Subtitle Parser

2017-05-16T00:00:00+05:30

Creating a C++ subtitle parsing library to fetch and process subtitle file easily and efficiently.

EDIT: I have completed building the parser, read more about it here : https://saurabhshri.github.io/2017/05/gsoc/simple-yet-powerful-single-header-srt-subtitle-parsing-library-in-cpp.

For my GSoC 2017 project, CCAligner - Word by Word Audio Subtitle Synchronization Tool, the very first step required is processing of subtitle to extract primarily two things - the words those are being spoken, and time duration in which they are being spoken.

A usual SubRip (SRT) subtile file has 4 basic components :

A number indicating which subtitle it is in the sequence.
The time that the subtitle should appear on the screen, and then disappear.
The subtitle itself.
A blank line indicating the start of a new subtitle.

All of these, of course are textual.

E.g.

1
00:00:00,520 --> 00:00:03,536
Chris Anderson: Elon, hey, welcome back to TED.

2
00:00:03,560 --> 00:00:04,936
<i>(Applause)</i>

3
00:00:04,960 --> 00:00:06,536
Elon Musk: Thanks for having me.

4
00:00:06,560 --> 00:00:09,416
CA: So, in the next half hour or so,

Now as you can see, a subtitle file may also include some text in the text field which is not actually spoken but is there to convey some information. The second subtitle is a perfect example of this. The words (applause) are not spoken but are present in text field. Moreover, styling tags may also be associated as they are valid in SubRip (SRT) format. The <i> .. </i> for example, denotes that the line is to be displayed in italics.

My project (CCAligner), as per plan, should take two files as input - the video/audio file itself and it’s subtitle file. In order to perform alignment, the tool needs to read subtitle file and extract meaningful data in required format. This requires parsing i.e. the tool, needs to parse the subtitle file and present / extract data in suitable format. In order to do so, I am creating a parser which shall perform this task for me.

All in all, the subtitle (SRT) parser must have at least following functionalities :

Should be written in C++ (CPP) language, as my tool will primarily be in C++.
Should be capable of returning starting time, ending time and actual subtitle text with simple function calls.
Should be able to extract and strip HTML and other styling tags (e.g. hello to hello).
Should be able to extract and strip speaker names (e.g. Elon: Hi to Hi).
Should be able to extract and strip non dialogue elements (e.g. (applause) to {blank}).
Should be easy to use and efficient in functioning.

One of the major advantages of Open-Source is, that one does not need to reinvent the wheel. :) So obviously before head-diving into writing the code I did a bit of search to see what work is already been done. There are not a lot of SRT parser I could find which were written in C++, but I found a one in which there is a base, but everything is very raw and perfect for adding onto. Obviously I wasn’t expecting to find a parser exactly as per my need. Here’s the repository.

I have already begun working on it. The original parser could only return starting and ending time (in ms) and subtitle text. I have modified it to be capable of performing the functionalities I listed above. The work is still raw, but I am sure it should come out great.

Right now my srt parser successfully does the following :

Return timestamps in both string format and in ms.
Capable of strippin style tags, non dialogue data, speaker names.
Extract speaker names.

But there are some issues, after all I have only just begun working on it. These issues are expected to be resolved soon.

Style tags and non dialogue texts are stripped but not stored.
Only single-name speaker name is extracted, i.e. works for Elon: Hi but only extracts last name for Elon Musk: Hi. I’ll add a check to extract both.
To convert time to ms, I am currently usin regex, which I do not like. Plus it os only available in C++ 11 and above. So, I have to write an alternative for that.

I am also planning to make this parser a single header library. After solving the above issues I’ll work on that and upload it as an entirely new repository.

Current Parse Output :

I have implemented a lot of functions in my SRT parser,here’s a demo of some of them :

for(SubtitleItem * element : sub)
    {
        myfile<<"start : "<<element->getStartTime()<<endl;
        myfile<<"end : "<<element->getEndTime()<<endl;
        myfile<<"text : "<<element->getText()<<endl;
        myfile<<"justDialogue : "<<element->getDialogue()<<endl;
        myfile<<"speakerCount : "<<element->getSpeakerCount()<<endl;

        if(element->getSpeakerCount())
        {
            std::vector<std::string> name = element->getSpeakerNames();
            for(std::string display : name)
                myfile<<"speakers : "<<display<<", ";
            myfile<<endl;
        }
        myfile<<"ignore : "<<element->getIgnoreStatus()<<endl;
        myfile<<"____________________________________________"<<endl<<endl;
    }
    

For the input :

1
00:00:00,520 --> 00:00:03,536
Chris Anderson:
Elon, hey, welcome back to TED.

2
00:00:03,560 --> 00:00:04,936
It's great to have you here.

3
00:00:04,960 --> 00:00:06,536
Elon Musk: Thanks for having me.

4
00:00:06,560 --> 00:00:09,416
CA: So, in the next half hour or so,

I recieved the following output :

start : 520
end : 3536
text : 
Chris Anderson:
Elon, hey, welcome back to TED.


justDialogue : 
Chris 
Elon, hey, welcome back to TED.


speakerCount : 1
speakers : Anderson, 
ignore : 0
____________________________________________

start : 3560
end : 4936
text : 
It's great to have you here.


justDialogue : 
It's great to have you here.


speakerCount : 0
ignore : 0
____________________________________________

start : 4960
end : 6536
text : 
Elon Musk: Thanks for having me.


justDialogue : 
Elon Thanks for having me.


speakerCount : 1
speakers : Musk, 
ignore : 0
____________________________________________

start : 6560
end : 9416
text : 
CA: So, in the next half hour or so,


justDialogue : 
So, in the next half hour or so,


speakerCount : 1
speakers : CA, 
ignore : 0
____________________________________________

Looks good, right? :)

Of course there’s a lot of work left. I will also spend quite a time in optimizing the performance and documenting it so that other people can also use it.

If you guys have any suggestions for the parser, or would like to request a feature, feel free to post in the comments or mail me. I will be adding link to the repository as soon as I upload it.

Creating a full blown (SRT) Subtitle Parser was originally published by Saurabh Shrivastava at //Saurabh Shrivastava on May 16, 2017.

Here's why I waited 3 days before revealing about my GSoC selection.

2017-05-08T00:00:00+05:30

I got accepted into Google Summer of Code 2017, the result of which was announced on May 4th. I did not tell anyone about this until yesterday, here’s why!

While I had been preparing for GSoC since November, no one in my vicinity knew about it. It wasn’t a very big secret and I had no major reason for hiding it - except for the fact I did not want to get disctracted by the thoughts of what my friends and peers were thinking about it and the constant questioning that would have risen. Working quitely gave me the benefit of remaining focussed and working without any burden of people’s opinions and expectations. It also gave me an amazing opprtunity to surprise my family and closed ones when I finally got accepted into GSoC. I will definitely write about that and update this post with link to it. So visit again some time later if you are interested in reading that.

The thing is, when I decided to try for GSoC (it was around November previous year), I made a promise to myself that my mother would be the first person I tell about this if I get accepted. The problem was, when the result was announced, I was amidst my semester examinations. I sure did not want to miss my mother’s expression by telling her over the phone, I wanted to be there when I tell her this. So, even being the blabber-mouth I am, I somehow had to keep this news to myself.

When the result was announced, I rushed out of room with my laptop and cell-phone to another college building, because there were lot of people in my room at that time. I sat there and logged into the dashboard to find this amazing news. :) I immediately had the conversation regarding the same with the community and then went back to my room to prepare for next exam.

My exams finally ended on 6th May, and I took the morning train to home the immediate next day! I had arranged a surprise announcement for my family and friends and then I finally told my parents about my selection. I would like to thank everyone for their wishes and compliments and I am really sorry to everyone because I did not tell you all earlier about this. But I hope you understand the reason now.

I finally published the long-due post about me getting accepted into GSoC prior to this post (read that here) and now that I am home, it’s time to get to work as per my proposed timeline! :)

Thank you for reading, feel free to comment with your views and questions, if any!

Here's why I waited 3 days before revealing about my GSoC selection. was originally published by Saurabh Shrivastava at //Saurabh Shrivastava on May 08, 2017.

Accepted in Google Summer of Code 2017!

2017-05-08T00:00:00+05:30

My proposal CCAligner - Word by Word Subtitle Synchronization with CCExtractor Development has been accepted for Google Summer of Code (GSoC) 2017!

If you are reading this, chances are, you already know what’s GSoC. In case you don’t, here’s official link to the same : http://g.co/gsoc .

I am very happy to announce that my proposal to build a tool for word by word audio subtitle synchronization has been selected for Google Summer of Code. I will be working with the organization CCExtractor Development which made the de-facto subtitle extraction tool - CCExtractor. I am super excited to work with my mentors Carlos Fernandez Sanz (who originally built CCExtractor) and Alex Bratosin (CCExtractor GCI 2016 Winner).

What is my project about?

I have named my project CCAligner as it conveniently lays out it’s basic functionality and also adheres to the name of it’s parent tool CCExtractor. So, what generally happens is that the usual subtitle files (such as SubRips) have line by line synchronization in them i.e. the subtitles containing the dialogue appear when the person starts talking and disappears when the dialogue finishes. This continues for the whole video. For example :

1274
01:55:48,484 – 01:55:50,860
The Force is strong with this one

In the above example, the dialogue #1274 - The Force is strong with this one appears at 1:55:48 remains in the screen for two seconds and disappears at 1:55:50.

The aim of the project is to tag the word as it is spoken, similar to that in karaoke systems.

E.g.

The———[6948484:6948500]
Force——[6948501:6948633]
is————[6948634:6948710]
strong—–[6948711:6949999]
with——–[6949100:6949313]

In the above example each word from subtitle is tagged with beginning and ending timestamps based on audio.

Why is this useful?

While watching a video, it makes sense to have a whole / part of sentence displayed on screen rather than individual words as they are spoken. But there are cases where having timing information of each word is very important. Think of a scenario where you have to tag an occurrence of an event, marked by a special word, then having the information about when the word was spoken is what we need. This is a very basic example to just give you an idea. I have written about various possible applications of this tool in my proposal, and do give it a read if you are interested.

I really hope by the end of summer, the tool gets ready to be used. The basic flow of usage would be really simple. Just call the tool, pass the audio file, the subtitle file, choose the mode, the output type and the result should be word by word subtitle synchronization.

What am I doing right now?

Right now, it’s community bonding period. I just finished my exams and returned home. As mentioned in the timeline, I will be spending this month fine-tuning my deliverables by discussing with my mentors, and also making a sample repository for me to test the tool.

The GSoC result was announced on 4th, why such a late post? Am I lazy?

While I might be lazy, which I most certainly am, it has nothing to do with me being late. I fell in love with open-source ever since I made my first contribution, and I extremely excited for this GSoC. I was amidst my semester examinations when the result was announced. I had already intimidated my mentors about the same and they themselves advised me to prioritize exams. I will be publishing few more posts about the same later on.

Where’s the proof of selection and my proposal?

I am including this, only to show-off my name on GSoC website :P . This is the official link about my project on Google Summer of Code website - https://summerofcode.withgoogle.com/projects/#5589068587991040 . I shall also soon add myself on CCExtractor’s website.

About my proposal, you may find my GSoC 2017 proposal for CCExtractor Development here and in case that link doesn’t work (please comment about the same, I shall replace it), here’s the mirror.

Follow this blog to read my future posts. Thank you for reading. Feel free to comment with your views, questions and criticism, if any. I would love to discuss them. :)

Accepted in Google Summer of Code 2017! was originally published by Saurabh Shrivastava at //Saurabh Shrivastava on May 08, 2017.

Hello World!

2017-03-11T00:00:00+05:30

Hello everybody! This blog will contain all the updates by me. I will try my best to update it regularly. Stay tuned!

I am Saurabh Shrivastava a 3rd year Information Technology Engineering undergrad at IET DAVV, Indore, India. My interests lie in the field of computers. I was introduced to programming in 11th standard and scored a perfect 100 in my 12th Computer Science Board Examinations. I love programming in C and C++ and am constantly learning new things.

I recently got introduced to the world of open source and I haven’t looked back since. I absolutely love the idea of open source software and am thouroughly enjoying it. I am trying my part to contribute as I learn :) It feels amazing to know that your contribution is actually out there and people are using it. I also started competitive coding at CodeChef but haven’t been active since sometime. I shall start again as and when I get time (should be soon after this semester ends).

You may read more about me in the About Me section which I should be updating soon.

This blog post is basically to test this newly set up blog. You may read more about the details of blog set-up below :

Platform: The blog is hosted on Github Pages with the help of Jekyll. The blog was up in literally minutes without much hassle.
Theme: The theme is Notepad by @hmfaysal. I have always been a fan of plain and simple design.
Source: You may find the source of this blog here.

Thank you for reading! :) Comment here to share your suggestions.

Hello World! was originally published by Saurabh Shrivastava at //Saurabh Shrivastava on March 11, 2017.

america	AH M EH R AH K AH	AE M EY R AY K AE
hypocrit	HH IH P IH K R IH T	HH AY P AA K R IH T
trump	T R AH M P	T R AH M P
saurabh	S AO R AE B	S OW R AE B HH