一、关于 sentencepiece
- github : https://github.com/google/sentencepiece
- 论文 《SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing》:https://aclanthology.org/D18-2012.pdf
Unsupervised text tokenizer for Neural Network-based text generation.
SentencePiece is an unsupervised text tokenizer and detokenizer mainly for Neural Network-based text generation systems where the vocabulary size is predetermined prior to the neural model training.
SentencePiece implements subword units (e.g., byte-p