NLPCC 2026
← Back to Home

Task Guidelines

NLPCC 2026 Shared Task 10: The Reliability of AI-Assisted Scientific Reporting

1. Track Definitions

Track 1: Claim-Level Faithfulness to Experimental Results

Each example contains:

  • Evidence bundle: a result paragraph, optionally with tables/charts, extracted from an open-access NLP paper.
  • Claim paragraph: typically 2-4 sentences, generated to summarize or interpret the evidence bundle.

For each sentence in the paragraph, systems must output one label from the following set:

LabelDescription
SupportedThe sentence is adequately supported by the evidence bundle.
Unsupported Causal MechanisticThe sentence introduces unsupported causal or mechanistic interpretation.
Unsupported EntityThe sentence mentions unsupported datasets, metrics, model variants, baselines, or related scientific entities.
Scope OvergeneralizationThe sentence extends the supported scope beyond what the evidence warrants.
ContradictionThe sentence directly conflicts with the evidence.

Each sentence is assigned a single primary label. In rare cases, a sentence may be annotated with multiple applicable labels; predicting any one of them is considered correct for evaluation purposes.

Track 1 Train-Dev Data Format

Each record in the released JSONL file contains the following fields:

{
  "claim_text": "string — the claim paragraph",
  "evidence_bundle": [
    { "type": "text", "text": "result paragraph text ..." },
    { "type": "table", "table_caption": ["Table 3: ..."], "img_path": "images/xxx.jpg" },
    { "type": "image", "image_caption": ["Figure 1: ..."], "img_path": "images/xxx.jpg" }
  ],
  "sentence_label": [
    { "sentence": "sentence text ...", "types": ["Supported"] },
    { "sentence": "sentence text ...", "types": ["Unsupported Entity"] }
  ]
}
FieldTypeDescription
claim_textstringThe full claim paragraph.
evidence_bundlelist of objectsOne or more evidence items. Each item has a type field ("text", "table", or "image") and type-specific content fields.
sentence_labellist of objectsPer-sentence annotations. Each object contains a sentence (string) and types (list of labels; typically one, see note above).
At test time, the types field in sentence_label will be withheld. Participants will receive claim_text, evidence_bundle, and the sentence segmentation.

The image files referenced by img_path in the evidence bundle are provided separately in images.zip (train-dev) and images-testp1.zip (Phase 1 test). Inside the archives, files are stored at the archive root (e.g. <sha>.jpg). To match the relative images/<sha>.jpg paths used in the JSONL records, extract each archive into a directory named images/ (for example unzip data/images.zip -d data/images/).

Track 1 Train-Dev Data Statistics

Count
Records3,333
Total sentences17,547

Negative label distribution (counts are not mutually exclusive):

LabelCount
Unsupported Entity639
Scope Overgeneralization605
Unsupported Causal Mechanistic276
Contradiction216

Track 2: Citation-Level Faithfulness to External Evidence

Each example contains:

  • Atomic claim: a single scientific claim associated with one cited paper.
  • Cited paper (full text): the cited paper in structured textual form, represented as a list of paragraphs with stable paragraph IDs (e.g., P1, P2, ...).

Systems must output:

  1. One support label from the following set:
LabelDescription
SupportedThe cited paper provides evidence directly supporting the claim.
OverstateThe claim overstates what the cited paper supports.
Topical MatchThe cited paper is topically related but does not provide the evidential support required for the claim.
IrrelevantThe cited paper has no meaningful support relation to the claim.
  1. A ranked list of up to k = 3 evidence paragraph IDs, identifying the paragraphs in the cited paper that are most relevant to the support judgment.

Track 2 Train-Dev Data Format

{
  "claim_text": "string — the scientific claim",
  "cited_paper_full_text": [
    {"P1": "paragraph text ..."},
    {"P2": "paragraph text ..."},
    ...
  ],
  "evidence_para_ids": ["P12", "P13"],
  "evidence_texts": { "P12": "paragraph text ...", "P13": "paragraph text ..." },
  "label": "Supported | Overstate | Topical Match | Irrelevant"
}
FieldTypeDescription
claim_textstringThe atomic scientific claim to be verified.
cited_paper_full_textlist of objectsThe full cited paper, represented as a list of {paragraph_id: text} objects in document order.
evidence_para_idslist of stringsGold paragraph IDs that serve as evidence for the support judgment.
evidence_textsobjectA convenience mapping from each gold evidence paragraph ID to its text content.
labelstringThe gold support label (one of: Supported, Overstate, Topical Match, Irrelevant).
At test time, the label, evidence_para_ids, and evidence_texts fields will be withheld. Participants will receive only claim_text and cited_paper_full_text.

Track 2 Train-Dev Data Statistics

LabelCount
Supported1,029
Irrelevant382
Topical Match376
Overstate375
Total2,162

2. Evaluation

Metrics

Track 1 uses two official metrics:

  1. Sentence-level Macro-F1 over all labels — balanced performance across all categories.
  2. Paragraph Exact Match (PEM) — a paragraph is correct only if all sentence labels are predicted correctly.

Track 1 ranking score = (Macro-F1 + PEM) / 2

Track 2 uses two official metrics:

  1. Label Macro-F1 over the four support labels — balanced performance on support-relation classification.
  2. Joint@3 — a prediction is counted as successful only if the system assigns the correct support label and retrieves at least one correct evidence paragraph in its top-3 evidence predictions.

Track 2 ranking score = (Macro-F1 + Joint@3) / 2

For Track 2, when the gold evidence list of a sample is empty, Joint@3 is counted as successful only if the predicted label is correct and the predicted evidence list is also empty (the match-empty policy of the offline evaluator). This is the policy used for the official ranking on both the public leaderboard (Phase 1) and the hidden test set (Phase 2).

Two-Phase Evaluation

Evaluation is conducted in two phases to balance development flexibility with robustness against overfitting.

Phase 1 — Open Evaluation (May 26 – June 20, 2026):
The Phase 1 test data, baseline prompting kit, and offline evaluation scripts have been released on May 26 (see the data/, baseline_prompting/, and offline_eval/ directories in the repository). Participants must submit result files to the Phase 1 Codabench platform in the official format to obtain their Phase 1 scores. The public leaderboard will remain open until the final submission deadline.

Phase 2 — Hidden Evaluation (June 11 – June 20, 2026):
On June 11, the organizers will release a previously unseen held-out test set (without labels). Participants must run their final models on this hidden test set and submit result files to the platform before the submission deadline (June 20).

Final Score:
The final ranking score is the average of the Phase 1 (open) and Phase 2 (hidden) scores. This two-phase design discourages overfitting to the open test data and better reflects system generalizability.

Data Release and Usage Policy

All data released for this shared task, including train/dev data, open test data, and hidden test data, is provided exclusively for participation in NLPCC 2026 Shared Task 10 during the competition period.

During the competition period, all credits and attribution for the released data belong to the organizers. Participants and third parties may not redistribute, mirror, republish, relabel, or create alternative releases of the data under any other name.

After the competition concludes, all task data, including both training and test data, will be redistributed under an open license and may be used for scientific research purposes.

3. Submission Requirements

Each participating team must submit:

  1. Result files in the official submission format, submitted to the platform for both Phase 1 (open) and Phase 2 (hidden) test inputs. The Phase 1 public platform is Codabench competition 16666.
  2. A technical report that comprehensively describes:
    • Training data sources and any data augmentation or preprocessing steps
    • Model architecture and design choices
    • Prompting strategy or retrieval setup (if applicable)
    • Training procedure and hyperparameters
    • Evaluation methodology and any ablation results
    The technical report must provide sufficient detail for full reproducibility. The organizers reserve the right to request verification materials — including source code, model weights, and training scripts — for any submission where reproducibility concerns arise.

4. Schedule

DateEvent
March 20, 2026Shared task announcement and call for participation
March 20, 2026Registration opens
April 15, 2026Release of detailed task guidelines and training data
May 25, 2026Registration deadline
May 26, 2026Phase 1 data, evaluation entry, and offline evaluation scripts released
June 11, 2026Hidden (held-out) test data release, no labels (Phase 2 begins)
June 20, 2026Deadline for participants to submit all results (Phase 1 + Phase 2)
June 30, 2026Evaluation results released; call for system reports

5. FAQ

Q: Do teams have to participate in both tracks?

A: No. Teams may participate in either track or both. Tracks are evaluated independently.

Q: How are awards determined?

A: We follow the NLPCC official guidelines. The top-ranked team in each track will receive a certificate jointly issued by NLPCC and CCF-NLP. We are also considering an award for the top-ranked team based on the average score across both tracks (to be confirmed).

Q: Does each Track 1 sample contain only one piece of evidence?

A: No. An evidence bundle may contain multiple materials — for example, a result paragraph together with one or more paragraphs, tables or charts. Systems should consider all provided evidence when making predictions.

Q: Track 1 may include multimodal inputs (e.g., tables or charts). Are participants required to use end-to-end multimodal models?

A: No. There are no restrictions on system design. Participants may use any approach they see fit.

Q: In Track 2, what if a claim references multiple citations?

A: Each Track 2 data sample is paired with cited papers. Participants only need to consider the full-text material provided for that sample; the task will not require resolving references beyond the given cited paper.

Q: Can we submit multiple times to the online platform during Phase 1 (open evaluation)?

A: Yes. Participants may submit multiple times during the open evaluation period. The leaderboard will reflect the latest submission.

Q: Can we use external datasets or additional papers beyond the released train/dev data for training?

A: Yes. Training data sources are unrestricted. All external data sources used must be disclosed in the technical report. We will categorize and summarize the different approaches in the final task overview paper, but the choice of training data will not affect scoring & ranking.

Q: Can we use proprietary LLMs (e.g., GPT-5, Claude) via API in our systems?

A: Yes. However, participants must clearly report the model name, version (w/ request date), and API costs in the technical report.

任务指南

NLPCC 2026 共享任务 10:AI 辅助科学报告的可靠性

1. 赛道定义

Track 1:面向实验结果的陈述级忠实性判定

每个样本包含:

  • 证据材料(Evidence bundle):从开放获取的 NLP 论文中提取的结果段落,可能附带表格或图表。
  • 陈述段落(Claim paragraph):通常包含 2-4 个句子,用于总结或解释证据材料。

系统需要为段落中的每个句子输出一个标签,标签集如下:

标签描述
Supported该句子被证据材料充分支持。
Unsupported Causal Mechanistic该句子引入了证据不支持的因果或机制性解释。
Unsupported Entity该句子提及了证据不支持的数据集、指标、模型变体、基线或相关科学实体。
Scope Overgeneralization该句子将结论推广到超出证据所支持的范围。
Contradiction该句子与证据直接矛盾。

每个句子标注一个主要标签。在极少数情况下,一个句子可能被标注了多个适用标签;评测时预测命中其中任一标签即视为正确。

Track 1 训练开发数据格式

发布的 JSONL 文件中,每条记录包含以下字段:

{
  "claim_text": "string — 陈述段落",
  "evidence_bundle": [
    { "type": "text", "text": "结果段落文本 ..." },
    { "type": "table", "table_caption": ["Table 3: ..."], "img_path": "images/xxx.jpg" },
    { "type": "image", "image_caption": ["Figure 1: ..."], "img_path": "images/xxx.jpg" }
  ],
  "sentence_label": [
    { "sentence": "句子文本 ...", "types": ["Supported"] },
    { "sentence": "句子文本 ...", "types": ["Unsupported Entity"] }
  ]
}
字段类型描述
claim_textstring完整陈述段落。
evidence_bundlelist of objects一个或多个证据项。每个证据项包含 type 字段("text""table""image")及对应的内容字段。
sentence_labellist of objects逐句标注。每个对象包含 sentence(字符串)和 types(标签列表;通常为一个,详见上述说明)。
测试阶段,sentence_label 中的 types 字段将被隐去。参赛者将收到 claim_textevidence_bundle 以及句子切分结果。

证据材料中 img_path 所引用的图像文件单独提供在 images.zip(train-dev)和 images-testp1.zip(Phase 1 测试集)中。压缩包内的文件位于压缩包根目录(例如 <sha>.jpg),为与 JSONL 中 images/<sha>.jpg 这样的相对路径对齐,请将压缩包解压到名为 images/ 的目录中(例如 unzip data/images.zip -d data/images/)。

Track 1 训练开发数据统计

数量
记录数3,333
总句子数17,547

负向标签分布(计数不互斥):

标签数量
Unsupported Entity639
Scope Overgeneralization605
Unsupported Causal Mechanistic276
Contradiction216

Track 2:面向外部证据的引文级忠实性判定

每个样本包含:

  • 原子级陈述(Atomic claim):与一篇被引论文关联的单条科学陈述。
  • 被引论文全文(Cited paper full text):以结构化文本形式呈现,表示为带有稳定段落 ID(如 P1P2、...)的段落列表。

系统需要输出:

  1. 以下标签集中的一个支持关系标签
标签描述
Supported被引论文提供了直接支持该陈述的证据。
Overstate该陈述夸大了被引论文所支持的内容。
Topical Match被引论文与陈述主题相关,但未提供陈述所需的证据支持。
Irrelevant被引论文与该陈述无实质性支持关系。
  1. 一个至多包含 k = 3 个证据段落 ID 的排序列表,标识被引论文中与支持判断最相关的段落。

Track 2 训练开发数据格式

{
  "claim_text": "string — 科学陈述",
  "cited_paper_full_text": [
    {"P1": "段落文本 ..."},
    {"P2": "段落文本 ..."},
    ...
  ],
  "evidence_para_ids": ["P12", "P13"],
  "evidence_texts": { "P12": "段落文本 ...", "P13": "段落文本 ..." },
  "label": "Supported | Overstate | Topical Match | Irrelevant"
}
字段类型描述
claim_textstring待验证的原子级科学陈述。
cited_paper_full_textlist of objects被引论文全文,以 {paragraph_id: text} 对象列表形式按文档顺序排列。
evidence_para_idslist of strings作为支持判断依据的标准答案段落 ID。
evidence_textsobject便捷映射,从标准答案证据段落 ID 到其文本内容。
labelstring标准答案标签(SupportedOverstateTopical MatchIrrelevant 之一)。
测试阶段,labelevidence_para_idsevidence_texts 字段将被隐去。参赛者仅收到 claim_textcited_paper_full_text

Track 2 训练开发数据统计

标签数量
Supported1,029
Irrelevant382
Topical Match376
Overstate375
合计2,162

2. 评测

评测指标

Track 1 使用两个官方指标:

  1. 句子级 Macro-F1 — 在所有标签上的平均 F1,衡量各类别的均衡表现。
  2. 段落完全匹配(PEM, Paragraph Exact Match) — 仅当段落内所有句子标签均预测正确时,该段落才计为正确。

Track 1 排名分数 = (Macro-F1 + PEM) / 2

Track 2 使用两个官方指标:

  1. 标签 Macro-F1 — 在四个支持关系标签上的宏平均 F1,衡量分类的均衡表现。
  2. Joint@3 — 仅当系统预测了正确的支持关系标签在 top-3 证据段落预测中命中了至少一个正确的证据段落时,该预测才计为成功。

Track 2 排名分数 = (Macro-F1 + Joint@3) / 2

对于 Track 2,当某条样本的 gold 证据段落列表为空时,仅当系统预测了正确的支持关系标签预测的证据段落列表也为空,Joint@3 才计为成功(即离线评测脚本的 match-empty 策略)。这是 Phase 1 公开排行榜与 Phase 2 隐藏测试集官方排名所使用的策略。

两阶段评测

评测分两个阶段进行,以兼顾开发灵活性与防止过拟合。

第一阶段 — 公开评测(2026 年 5 月 26 日 – 6 月 20 日):
第一阶段测试数据、baseline prompting 套件与离线评测脚本已于 5 月 26 日发布,详见仓库中的 data/baseline_prompting/offline_eval/ 目录。参赛队伍须按官方格式将结果文件提交至 Codabench 第一阶段平台 以获得第一阶段分数。公开排行榜将持续开放至最终提交截止日。

第二阶段 — 隐藏评测(2026 年 6 月 11 日 – 6 月 20 日):
6 月 11 日,组织方将发布此前未对参赛者公开的隐藏测试集(不含标签)。参赛队伍须在该隐藏测试集上运行最终模型,并在提交截止日(6 月 20 日)前将结果文件提交至平台。

最终得分:
最终排名分数为第一阶段(公开)与第二阶段(隐藏)得分的平均值。两阶段设计旨在抑制对公开测试数据的过拟合,更好地反映系统的泛化能力。

数据发布与使用政策

本共享任务发布的所有数据,包括训练/开发数据、公开测试数据和隐藏测试数据,在比赛期间仅限用于参加 NLPCC 2026 Shared Task 10。

在比赛期间,上述已发布数据的署名权及发布权均归组织方所有。参赛者及第三方不得以其他名义对数据进行二次分发、镜像发布、重新发布、重新命名,或创建替代性发布版本。

比赛结束后,所有任务数据,包括训练数据和测试数据,将以开放协议重新分发,并可用于科研用途。

3. 提交要求

每支参赛队伍须提交:

  1. 结果文件:按官方提交格式,将第一阶段(公开)和第二阶段(隐藏)的预测结果提交至平台。第一阶段公开平台为 Codabench competition 16666
  2. 技术报告,须全面描述以下内容:
    • 训练数据来源及数据增强或预处理步骤
    • 模型架构与设计选择(如适用)
    • 提示策略或检索方案(如适用)
    • 训练流程与超参数
    • 评测方法及消融实验结果
    技术报告须提供充分的细节以确保完全可复现。组织方保留对存在可复现性疑虑的提交要求提供验证材料(包括源代码、模型权重和训练脚本)的权利。

4. 日程安排

日期事项
2026 年 3 月 20 日共享任务公告及征集参赛
2026 年 3 月 20 日开放报名
2026 年 4 月 15 日发布详细任务指南与训练数据
2026 年 5 月 25 日报名截止
2026 年 5 月 26 日发布第一阶段数据、评测入口和离线评测脚本
2026 年 6 月 11 日发布隐藏测试数据(不含标签)(第二阶段开始)
2026 年 6 月 20 日参赛队伍提交所有结果的截止日(第一阶段 + 第二阶段)
2026 年 6 月 30 日公布评测结果;征集系统报告

5. 常见问题(FAQ)

Q:参赛队伍是否必须同时参加两个赛道?

A:不是。参赛队伍可以选择参加任一赛道或同时参加两个赛道,赛道独立评测。

Q:奖项如何评定?

A:我们遵循 NLPCC 官方指导意见。每个赛道的排名第一的队伍将获得由 NLPCC 和 CCF-NLP 联合颁发的证书。我们也在考虑为两个赛道综合平均分排名第一的队伍设立额外奖项(待确认)。

Q:Track 1 的每个样本是否只包含一份证据材料?

A:不是。一个证据材料包(evidence bundle)可能包含多份材料——例如结果段落加上一个或多个段落、表格或图表。系统应综合考虑所有提供的证据进行预测。

Q:Track 1 可能包含多模态输入(如表格或图表),参赛者是否必须使用端到端多模态模型?

A:不是。对系统设计不作任何限制,参赛者可以使用任何方案。

Q:Track 2 中,如果一条陈述引用了多篇文献怎么办?

A:每个 Track 2 数据样本均配对了被引论文。参赛者只需考虑该样本所提供的全文材料,任务不要求解析超出所给被引论文范围的引用。

Q:在第一阶段(公开评测)期间,可以多次向在线平台提交吗?

A:可以。参赛队伍可以在公开评测期间多次提交,排行榜将反映最新提交结果。

Q:是否可以使用发布的训练/开发数据以外的外部数据集或论文进行训练?

A:可以。训练数据来源不受限制。所有使用的外部数据源须在技术报告中披露。我们将在最终任务概述论文中分类总结各种方案,但训练数据的选择不影响评分与排名。

Q:是否可以在系统中通过 API 使用商业大语言模型(如 GPT-5、Claude)?

A:可以。但参赛者须在技术报告中清楚注明所使用的模型名称、版本(含请求日期)及 API 费用。