1 min readMay 4, 2019
Hey,
If I remember correctly the only difference is that the labels are an array instead of a single value, no?
Also, here is a NER example using BERT in a BiLSTM — CRF architecture. Plus there is the option to set the layer as trainable (i.e. to be fine-tuned) or not.
Note that the code was very rushed so there many improvements possible. Also, there are some parts that need a further explanation but I may as well and grab the chance to write a blog post about it.
Hope this helps mate.