Ask HN:用于 CLM 微调的最佳基础模型是什么?

5作者: philomath8689 个月前
您好, 我手头有一个大型(2 GB)的经过整理的高质量文本语料库,使用的是一种低资源语言。我希望构建一个模型,为写作者提供高级的“自动补全”服务。 我考虑使用一个仅解码器模型,例如 Llama、Mistral 或 Gemma,切掉嵌入层(这些层基于不需要的语言),创建新的嵌入层(可能基于在语料库上训练的 FastText 模型进行初始化),并搭配一个从我的语料库新创建的 tokenizer,然后用我的语料库训练模型直到收敛。 其他潜在的细节包括:一个用于同义词感知的训练的自定义损失函数(基于一个自定义的高质量同义词词典),其中“正确”单词的同义词会得到一定程度的奖励;使用特定于该语言的词性标注器对语料库进行词性标注,并将词性标注头添加到模型中作为多任务学习,以强制进行语法生成。 为了能够使用一个好的模型作为基础,我可能不得不使用 PEFT (LoRA)。我目前的配置是 Colab Pro+ 上可用的,所以我可能可以使用 7b-12b 范围的模型? 我的主要问题是,哪个基础模型最适合这项任务?(再说一次,用于各种类型的普通写作补全,而不是编程或高级推理)。 此外,同义词和词性标注的添加会有帮助还是有害? 还有什么我可能遗漏的吗? 谢谢!
查看原文
Hi,<p>I have a largish (2 GB) corpus of curated, high-quality text in some low-resource language, and I want to build a model that would provide an advanced &quot;auto complete&quot; service for writers.<p>I&#x27;m thinking of taking a decoder-only model such as Llama, Mistral or Gemma, slice off the embedding layers (which are based on unneeded languages), create new ones (perhaps initialized based on a FastText model trained on the corpus), paired with a tokenizer newly created from my corpus, then train the model on my corpus until convergence.<p>Additional potential details include: a custom loss function for synonym-aware training (based on a custom high-quality thesaurus), where synonyms of the &quot;correct&quot; word are somewhat rewarded; POS-tagging the corpus with a Language-specific POS-tagger, and add a POS-tagging head to the model as a Multi-task Learning, to force grammatical generation.<p>In order to be able to use a good model as the base, I will probably be forced to use PEFT (LoRA). My current setup is whatever is available on Colab Pro+, so I can probably use the 7b-12b range of models?<p>My main question is, which base model would be best for this task? (Again, for completion of general writing of all kinds, not programming or advanced reasoning).<p>Also, will the synonym and POS additions help or hurt?<p>Anything else I might be missing?<p>Thanks!