HackerNews中文版

您好，我手头有一个大型（2 GB）的经过整理的高质量文本语料库，使用的是一种低资源语言。我希望构建一个模型，为写作者提供高级的“自动补全”服务。我考虑使用一个仅解码器模型，例如 Llama、Mistral 或 Gemma，切掉嵌入层（这些层基于不需要的语言），创建新的嵌入层（可能基于在语料库上训练的 FastText 模型进行初始化），并搭配一个从我的语料库新创建的 tokenizer，然后用我的语料库训练模型直到收敛。其他潜在的细节包括：一个用于同义词感知的训练的自定义损失函数（基于一个自定义的高质量同义词词典），其中“正确”单词的同义词会得到一定程度的奖励；使用特定于该语言的词性标注器对语料库进行词性标注，并将词性标注头添加到模型中作为多任务学习，以强制进行语法生成。为了能够使用一个好的模型作为基础，我可能不得不使用 PEFT (LoRA)。我目前的配置是 Colab Pro+ 上可用的，所以我可能可以使用 7b-12b 范围的模型？我的主要问题是，哪个基础模型最适合这项任务？（再说一次，用于各种类型的普通写作补全，而不是编程或高级推理）。此外，同义词和词性标注的添加会有帮助还是有害？还有什么我可能遗漏的吗？谢谢！

查看原文

Hi,I have a largish (2 GB) corpus of curated, high-quality text in some low-resource language, and I want to build a model that would provide an advanced "auto complete" service for writers.I'm thinking of taking a decoder-only model such as Llama, Mistral or Gemma, slice off the embedding layers (which are based on unneeded languages), create new ones (perhaps initialized based on a FastText model trained on the corpus), paired with a tokenizer newly created from my corpus, then train the model on my corpus until convergence.Additional potential details include: a custom loss function for synonym-aware training (based on a custom high-quality thesaurus), where synonyms of the "correct" word are somewhat rewarded; POS-tagging the corpus with a Language-specific POS-tagger, and add a POS-tagging head to the model as a Multi-task Learning, to force grammatical generation.In order to be able to use a good model as the base, I will probably be forced to use PEFT (LoRA). My current setup is whatever is available on Colab Pro+, so I can probably use the 7b-12b range of models?My main question is, which base model would be best for this task? (Again, for completion of general writing of all kinds, not programming or advanced reasoning).Also, will the synonym and POS additions help or hurt?Anything else I might be missing?Thanks!

Ask HN：用于 CLM 微调的最佳基础模型是什么？