Ask HN: 自托管 LMS 中更好的查重方法?
1 分•作者: pigon1002•4 天前
我正在构建一个开源学习管理系统(LMS),并使用 OpenSearch 的 more_like_this 查询和字符 n-gram 来进行相似度评分,从而添加了剽窃检测功能。
简单来说,当学生提交答案时,我会搜索其他学生在同一问题上的相似答案。效果还行,但感觉有点“黑客”——只是重用了我已有的搜索引擎。
目前的设置如下:
```
search = cls.search().filter(
"nested", path="answers",
query={"term": {"answers.question_id": str(question_id)}}
)
search = search.query(
"nested",
path="answers",
query={
"more_like_this": {
"fields": ["answers.answer"],
"like": text,
"min_term_freq": 1,
"minimum_should_match": "1%",
}
},
)
# 获取前 10 个结果,然后在 Python 中重新排序
def normalize(t):
return re.sub(r"\s+", "", t.strip())
def char_ngrams(t, n=3):
return set(t[i:i+n] for i in range(len(t)-n+1))
norm_text = normalize(text)
text_ngrams = char_ngrams(norm_text)
for hit in response.hits:
norm_answer = normalize(hit.answer)
answer_ngrams = char_ngrams(norm_answer)
intersection = len(text_ngrams & answer_ngrams)
union = len(text_ngrams | answer_ngrams)
ratio = int((intersection / union) * 100)
if ratio >= 60:
# 标记为相似
```
约束条件:
* 仅限自托管,不使用外部 API
* 学生人数为几千人
* 希望操作简单,反正已经运行了 OpenSearch
问题:
* 这种方法合理吗?或者我遗漏了什么显而易见的东西?
* 其他自托管系统使用什么方法?我查阅了 Moodle 的文档,但他们的剽窃插件大多调用外部服务
* 有人尝试过不需要 GPU 的轻量级机器学习模型吗?
搜索引擎方法有效,但想知道是否有更适合我们约束条件的更好方法。
查看原文
I'm building an open-source LMS and added plagiarism detection using OpenSearch's more_like_this query plus character n-grams for similarity scoring.<p>Basically when a student submits an answer, I search for similar answers from other students on the same question. Works decently but feels a bit hacky - just reusing the search engine I already had.<p>Current setup:<p><pre><code> search = cls.search().filter(
"nested", path="answers",
query={"term": {"answers.question_id": str(question_id)}}
)
search = search.query(
"nested",
path="answers",
query={
"more_like_this": {
"fields": ["answers.answer"],
"like": text,
"min_term_freq": 1,
"minimum_should_match": "1%",
}
},
)
# get top 10, then re-rank in Python
def normalize(t):
return re.sub(r"\s+", "", t.strip())
def char_ngrams(t, n=3):
return set(t[i:i+n] for i in range(len(t)-n+1))
norm_text = normalize(text)
text_ngrams = char_ngrams(norm_text)
for hit in response.hits:
norm_answer = normalize(hit.answer)
answer_ngrams = char_ngrams(norm_answer)
intersection = len(text_ngrams & answer_ngrams)
union = len(text_ngrams | answer_ngrams)
ratio = int((intersection / union) * 100)
if ratio >= 60:
# flag as similar
</code></pre>
Constraints:
- Self-hosted only, no external APIs
- Few thousand students
- Want simple operations, already running OpenSearch anyway<p>Questions:
- Is this approach reasonable or am I missing something obvious?
- What do other self-hosted systems use? Checked Moodle docs but their plagiarism plugins mostly call external services
- Anyone tried lightweight ML models for this that don't need GPU?<p>The search engine approach works but curious if there's a better way that fits our constraints.