I don’t think I can say for sure.

1 min readJun 11, 2019

I don’t think I can say for sure. More precisely, it all boils down how much it can cover. If a good amount of your domain specific corpus can be covered by the existing word pieced vocab then you should be fine.

I believe that it does not sums the embeddings, it feeds them to the model separately, so in the case of schizophrenia, if BERT Tokenizer can split it to pieces then you can benefit from the pre-trained weights. Otherwise it will just be another unknown token.

It may worth running some experiments to see how BERT tokenizer performs on your corpus or even skimming through the given vocab.txt to see if there are similarities with what you are looking for.

Otherwise if you want to train on your vocabulary, here is Github repository that has a word piece tokenization implemented. However you may want to test it first because I haven’t tried it.

https://github.com/kwonmha/bert-vocab-builder

Written by Andreas Pogiatzis

Responses (1)