Hello, and thanks for your response!

2 min readSep 27, 2019

Hello, and thanks for your response! You are raising very valid points here which are worth answering for making it clearer for other readers as well.

It is true that in the code that I have provided in this post there is not any part that glues the word pieces back together. I believe this depends on the use case so I didn’t want to give just one concrete way of merging them back together. Yet, I should have mentioned that in the post so thanks for pointing out.

To give you an example, in my last project I analysed how impactful vectors of the second piece of wordpieces were, and it turned out that 95% of them were just irrelevant, like ##ing, ##ed etc… as a result my merging strategy was to refer back to the original token list and remove the embedding for the second piece of the word, so I ended up with just one contextualized embedding and a one-to-one mapping of word to embedding. Alternatively, you can average, sum, multiply both vectors together to combine them. Not sure how that would perform but I am guessing it will require some experiments to find out. Coming back to my initial argument, you may don’t even need to glue them back together. More precisely if you are just feeding the output further to another model and you don’t require word granularity, it is not necessary to glue them back together. i.e. Sentiment analysis of a paragraph, or document duplicate identification etc.

Additionally, I don’t think the title should change because it is indeed contextual word embeddings using BERT. The fact that sometimes you also get word piece embeddings is a subtle technicality which I explained here and even though I agree that this can make the title not 100% accurate it is certainly more accurate than changing it to “Contextualized word pieces embeddings” which can confuse some readers!

With regards to your second point about the code being a duplicate from pytorch’s transformers example, well it is since my code is adapted from Google’s official BERT examples which use pytorch’s transformers code and it is pretty obvious. My humble opinion, no reference needed there.

Hope this answers your questions.

Written by Andreas Pogiatzis

No responses yet