Task Guidelines

NLPCC 2026 Shared Task 10: The Reliability of AI-Assisted Scientific Reporting

1. Track Definitions

Track 1: Claim-Level Faithfulness to Experimental Results

Each example contains:

Evidence bundle: a result paragraph, optionally with tables/charts, extracted from an open-access NLP paper.
Claim paragraph: typically 2-4 sentences, generated to summarize or interpret the evidence bundle.

For each sentence in the paragraph, systems must output one label from the following set:

Label	Description
`Supported`	The sentence is adequately supported by the evidence bundle.
`Unsupported Causal Mechanistic`	The sentence introduces unsupported causal or mechanistic interpretation.
`Unsupported Entity`	The sentence mentions unsupported datasets, metrics, model variants, baselines, or related scientific entities.
`Scope Overgeneralization`	The sentence extends the supported scope beyond what the evidence warrants.
`Contradiction`	The sentence directly conflicts with the evidence.

Each sentence is assigned a single primary label. In rare cases, a sentence may be annotated with multiple applicable labels; predicting any one of them is considered correct for evaluation purposes.

Track 1 Train-Dev Data Format

Each record in the released JSONL file contains the following fields:

{
  "claim_text": "string — the claim paragraph",
  "evidence_bundle": [
    { "type": "text", "text": "result paragraph text ..." },
    { "type": "table", "table_caption": ["Table 3: ..."], "img_path": "images/xxx.jpg" },
    { "type": "image", "image_caption": ["Figure 1: ..."], "img_path": "images/xxx.jpg" }
  ],
  "sentence_label": [
    { "sentence": "sentence text ...", "types": ["Supported"] },
    { "sentence": "sentence text ...", "types": ["Unsupported Entity"] }
  ]
}

Field	Type	Description
`claim_text`	string	The full claim paragraph.
`evidence_bundle`	list of objects	One or more evidence items. Each item has a `type` field (`"text"`, `"table"`, or `"image"`) and type-specific content fields.
`sentence_label`	list of objects	Per-sentence annotations. Each object contains a `sentence` (string) and `types` (list of labels; typically one, see note above).

At test time, the types field in sentence_label will be withheld. Participants will receive claim_text, evidence_bundle, and the sentence segmentation.

The image files referenced by img_path in the evidence bundle are provided separately in images.zip (train-dev) and images-testp1.zip (Phase 1 test). Inside the archives, files are stored at the archive root (e.g. <sha>.jpg). To match the relative images/<sha>.jpg paths used in the JSONL records, extract each archive into a directory named images/ (for example unzip data/images.zip -d data/images/).

Track 1 Train-Dev Data Statistics

	Count
Records	3,333
Total sentences	17,547

Negative label distribution (counts are not mutually exclusive):

Label	Count
`Unsupported Entity`	639
`Scope Overgeneralization`	605
`Unsupported Causal Mechanistic`	276
`Contradiction`	216

Track 2: Citation-Level Faithfulness to External Evidence

Each example contains:

Atomic claim: a single scientific claim associated with one cited paper.
Cited paper (full text): the cited paper in structured textual form, represented as a list of paragraphs with stable paragraph IDs (e.g., P1, P2, ...).

Systems must output:

One support label from the following set:

Label	Description
`Supported`	The cited paper provides evidence directly supporting the claim.
`Overstate`	The claim overstates what the cited paper supports.
`Topical Match`	The cited paper is topically related but does not provide the evidential support required for the claim.
`Irrelevant`	The cited paper has no meaningful support relation to the claim.

A ranked list of up to k = 3 evidence paragraph IDs, identifying the paragraphs in the cited paper that are most relevant to the support judgment.

Track 2 Train-Dev Data Format

{
  "claim_text": "string — the scientific claim",
  "cited_paper_full_text": [
    {"P1": "paragraph text ..."},
    {"P2": "paragraph text ..."},
    ...
  ],
  "evidence_para_ids": ["P12", "P13"],
  "evidence_texts": { "P12": "paragraph text ...", "P13": "paragraph text ..." },
  "label": "Supported | Overstate | Topical Match | Irrelevant"
}

Field	Type	Description
`claim_text`	string	The atomic scientific claim to be verified.
`cited_paper_full_text`	list of objects	The full cited paper, represented as a list of `{paragraph_id: text}` objects in document order.
`evidence_para_ids`	list of strings	Gold paragraph IDs that serve as evidence for the support judgment.
`evidence_texts`	object	A convenience mapping from each gold evidence paragraph ID to its text content.
`label`	string	The gold support label (one of: `Supported`, `Overstate`, `Topical Match`, `Irrelevant`).

At test time, the label, evidence_para_ids, and evidence_texts fields will be withheld. Participants will receive only claim_text and cited_paper_full_text.

Track 2 Train-Dev Data Statistics

Label	Count
`Supported`	1,029
`Irrelevant`	382
`Topical Match`	376
`Overstate`	375
Total	2,162

2. Evaluation

Metrics

Track 1 uses two official metrics:

Sentence-level Macro-F1 over all labels — balanced performance across all categories.
Paragraph Exact Match (PEM) — a paragraph is correct only if all sentence labels are predicted correctly.

Track 1 ranking score = (Macro-F1 + PEM) / 2

Track 2 uses two official metrics:

Label Macro-F1 over the four support labels — balanced performance on support-relation classification.
Joint@3 — a prediction is counted as successful only if the system assigns the correct support label and retrieves at least one correct evidence paragraph in its top-3 evidence predictions.

Track 2 ranking score = (Macro-F1 + Joint@3) / 2

For Track 2, when the gold evidence list of a sample is empty, Joint@3 is counted as successful only if the predicted label is correct and the predicted evidence list is also empty (the match-empty policy of the offline evaluator). This is the policy used for the official ranking on both the public leaderboard (Phase 1) and the hidden test set (Phase 2).

Two-Phase Evaluation

Evaluation is conducted in two phases to balance development flexibility with robustness against overfitting.

Phase 1 — Open Evaluation (May 26 – June 16, 2026):
The Phase 1 test data, baseline prompting kit, and offline evaluation scripts have been released on May 26 (see the data/, baseline_prompting/, and offline_eval/ directories in the repository). Participants must submit result files to the Codabench platform in the official format to obtain their Phase 1 scores. Phase 1 public submissions on the platform close on June 16 (UTC+8).

Phase 2 — Hidden Evaluation (June 11 – June 20, 2026):
On June 11, the organizers released a previously unseen held-out test set (without labels). Due to platform limitations, Codabench cannot support parallel phase submissions. Participants must run their final models on this hidden test set and submit Phase 2 prediction files through Codabench during June 17–20 (UTC+8). During Phase 2, Codabench is used only to collect submission files and will not publicly display the Phase 2 leaderboard.

Final Score:
During the Phase 2 testing period, Phase 1 submissions will be handled by automatically selecting the best score from existing submission records. For the final ranking, the organizers will review all Phase 1 and Phase 2 submissions, privately confirm with each participating team which files will be used for final evaluation, and retain the best score for each track in each phase. Final results are scheduled to be released on June 30 (UTC+8).

Data Release and Usage Policy

All data released for this shared task, including train/dev data, open test data, and hidden test data, is provided exclusively for participation in NLPCC 2026 Shared Task 10 during the competition period.

During the competition period, all credits and attribution for the released data belong to the organizers. Participants and third parties may not redistribute, mirror, republish, relabel, or create alternative releases of the data under any other name.

After the competition concludes, all task data, including both training and test data, will be redistributed under an open license and may be used for scientific research purposes.

3. Submission Requirements

No online/network tools: The use of online or network-connected tools is prohibited. During inference on test inputs (Phase 1 and Phase 2), systems must not invoke tools that access the internet to retrieve external information at prediction time. This includes, but is not limited to, web search, web browsing, live retrieval APIs, and agent tools that fetch content from the internet.

Each participating team must submit:

Result files in the official submission format, submitted to Codabench competition 16666 for both Phase 1 (open) and Phase 2 (hidden) test inputs. If a team participates in both tracks, the Track 1 and Track 2 results should be submitted together in a combined submission.
Final materials confirmation through the Feishu confirmation form, including Phase 1 confirmation metadata, Phase 2 confirmation metadata, and a concise system description PDF. This concise system description is not the conference submission version.
A technical report that comprehensively describes:
- Training data sources and any data augmentation or preprocessing steps
- Model architecture and design choices
- Prompting strategy or retrieval setup (if applicable)
- Training procedure and hyperparameters
- Evaluation methodology and any ablation results
The technical report must provide sufficient detail for full reproducibility. The organizers reserve the right to request verification materials — including source code, model weights, and training scripts — for any submission where reproducibility concerns arise.

The official NLPCC call for shared-task technical reports is expected to be released later than the main conference paper call.

4. Schedule

Date	Event
March 20, 2026	Shared task announcement and call for participation
March 20, 2026	Registration opens
April 15, 2026	Release of detailed task guidelines and training data
May 25, 2026	Registration deadline
May 26, 2026	Phase 1 data, evaluation entry, and offline evaluation scripts released
June 11, 2026	Hidden (held-out) test data release, no labels (Phase 2 begins)
June 16, 2026	Phase 1 public submissions close on Codabench (UTC+8)
June 17–20, 2026	Phase 2 submissions close on Codabench (UTC+8)
June 30, 2026	Final results released

5. FAQ

Q: Do teams have to participate in both tracks?

A: No. Teams may participate in either track or both. Tracks are evaluated independently.

Q: How are awards determined?

A: We follow the NLPCC official guidelines. The top-ranked team in each track will receive a certificate jointly issued by NLPCC and CCF-NLP. We are also considering an award for the top-ranked team based on the average score across both tracks (to be confirmed).

Q: Does each Track 1 sample contain only one piece of evidence?

A: No. An evidence bundle may contain multiple materials — for example, a result paragraph together with one or more paragraphs, tables or charts. Systems should consider all provided evidence when making predictions.

Q: Track 1 may include multimodal inputs (e.g., tables or charts). Are participants required to use end-to-end multimodal models?

A: No. There are no restrictions on system design. Participants may use any approach they see fit.

Q: In Track 2, what if a claim references multiple citations?

A: Each Track 2 data sample is paired with cited papers. Participants only need to consider the full-text material provided for that sample; the task will not require resolving references beyond the given cited paper.

Q: Can we submit multiple times to the online platform during Phase 1 (open evaluation)?

A: Yes. Participants may submit multiple times during the open evaluation period. The leaderboard will reflect the latest submission.

Q: Can we use external datasets or additional papers beyond the released train/dev data for training?

A: Yes. Training data sources are unrestricted. All external data sources used must be disclosed in the technical report. We will categorize and summarize the different approaches in the final task overview paper, but the choice of training data will not affect scoring & ranking.

Q: Can we use proprietary LLMs (e.g., GPT-5, Claude) via API in our systems?

A: Yes. However, participants must clearly report the model name, version (w/ request date), and API costs in the technical report.

Q: Can we use online/network tools (e.g., web search or live retrieval) during prediction?

A: No. Online or network-connected tools that fetch external information from the internet are prohibited during inference on test inputs.

任务指南

NLPCC 2026 共享任务 10：AI 辅助科学报告的可靠性

1. 赛道定义

Track 1：面向实验结果的陈述级忠实性判定

每个样本包含：

证据材料（Evidence bundle）：从开放获取的 NLP 论文中提取的结果段落，可能附带表格或图表。
陈述段落（Claim paragraph）：通常包含 2-4 个句子，用于总结或解释证据材料。

系统需要为段落中的每个句子输出一个标签，标签集如下：

标签	描述
`Supported`	该句子被证据材料充分支持。
`Unsupported Causal Mechanistic`	该句子引入了证据不支持的因果或机制性解释。
`Unsupported Entity`	该句子提及了证据不支持的数据集、指标、模型变体、基线或相关科学实体。
`Scope Overgeneralization`	该句子将结论推广到超出证据所支持的范围。
`Contradiction`	该句子与证据直接矛盾。

每个句子标注一个主要标签。在极少数情况下，一个句子可能被标注了多个适用标签；评测时预测命中其中任一标签即视为正确。

Track 1 训练开发数据格式

发布的 JSONL 文件中，每条记录包含以下字段：

{
  "claim_text": "string — 陈述段落",
  "evidence_bundle": [
    { "type": "text", "text": "结果段落文本 ..." },
    { "type": "table", "table_caption": ["Table 3: ..."], "img_path": "images/xxx.jpg" },
    { "type": "image", "image_caption": ["Figure 1: ..."], "img_path": "images/xxx.jpg" }
  ],
  "sentence_label": [
    { "sentence": "句子文本 ...", "types": ["Supported"] },
    { "sentence": "句子文本 ...", "types": ["Unsupported Entity"] }
  ]
}

字段	类型	描述
`claim_text`	string	完整陈述段落。
`evidence_bundle`	list of objects	一个或多个证据项。每个证据项包含 `type` 字段（`"text"`、`"table"` 或 `"image"`）及对应的内容字段。
`sentence_label`	list of objects	逐句标注。每个对象包含 `sentence`（字符串）和 `types`（标签列表；通常为一个，详见上述说明）。

测试阶段，sentence_label 中的 types 字段将被隐去。参赛者将收到 claim_text、evidence_bundle 以及句子切分结果。

证据材料中 img_path 所引用的图像文件单独提供在 images.zip（train-dev）和 images-testp1.zip（Phase 1 测试集）中。压缩包内的文件位于压缩包根目录（例如 <sha>.jpg），为与 JSONL 中 images/<sha>.jpg 这样的相对路径对齐，请将压缩包解压到名为 images/ 的目录中（例如 unzip data/images.zip -d data/images/）。

Track 1 训练开发数据统计

	数量
记录数	3,333
总句子数	17,547

负向标签分布（计数不互斥）：

标签	数量
`Unsupported Entity`	639
`Scope Overgeneralization`	605
`Unsupported Causal Mechanistic`	276
`Contradiction`	216

Track 2：面向外部证据的引文级忠实性判定

每个样本包含：

原子级陈述（Atomic claim）：与一篇被引论文关联的单条科学陈述。
被引论文全文（Cited paper full text）：以结构化文本形式呈现，表示为带有稳定段落 ID（如 P1、P2、...）的段落列表。

系统需要输出：

以下标签集中的一个支持关系标签：

标签	描述
`Supported`	被引论文提供了直接支持该陈述的证据。
`Overstate`	该陈述夸大了被引论文所支持的内容。
`Topical Match`	被引论文与陈述主题相关，但未提供陈述所需的证据支持。
`Irrelevant`	被引论文与该陈述无实质性支持关系。

一个至多包含 k = 3 个证据段落 ID 的排序列表，标识被引论文中与支持判断最相关的段落。

Track 2 训练开发数据格式

{
  "claim_text": "string — 科学陈述",
  "cited_paper_full_text": [
    {"P1": "段落文本 ..."},
    {"P2": "段落文本 ..."},
    ...
  ],
  "evidence_para_ids": ["P12", "P13"],
  "evidence_texts": { "P12": "段落文本 ...", "P13": "段落文本 ..." },
  "label": "Supported | Overstate | Topical Match | Irrelevant"
}

字段	类型	描述
`claim_text`	string	待验证的原子级科学陈述。
`cited_paper_full_text`	list of objects	被引论文全文，以 `{paragraph_id: text}` 对象列表形式按文档顺序排列。
`evidence_para_ids`	list of strings	作为支持判断依据的标准答案段落 ID。
`evidence_texts`	object	便捷映射，从标准答案证据段落 ID 到其文本内容。
`label`	string	标准答案标签（`Supported`、`Overstate`、`Topical Match`、`Irrelevant` 之一）。

测试阶段，label、evidence_para_ids 和 evidence_texts 字段将被隐去。参赛者仅收到 claim_text 和 cited_paper_full_text。

Track 2 训练开发数据统计

标签	数量
`Supported`	1,029
`Irrelevant`	382
`Topical Match`	376
`Overstate`	375
合计	2,162

2. 评测

评测指标

Track 1 使用两个官方指标：

句子级 Macro-F1 — 在所有标签上的平均 F1，衡量各类别的均衡表现。
段落完全匹配（PEM, Paragraph Exact Match） — 仅当段落内所有句子标签均预测正确时，该段落才计为正确。

Track 1 排名分数 = (Macro-F1 + PEM) / 2

Track 2 使用两个官方指标：

标签 Macro-F1 — 在四个支持关系标签上的宏平均 F1，衡量分类的均衡表现。
Joint@3 — 仅当系统预测了正确的支持关系标签，且在 top-3 证据段落预测中命中了至少一个正确的证据段落时，该预测才计为成功。

Track 2 排名分数 = (Macro-F1 + Joint@3) / 2

对于 Track 2，当某条样本的 gold 证据段落列表为空时，仅当系统预测了正确的支持关系标签，且预测的证据段落列表也为空，Joint@3 才计为成功（即离线评测脚本的 match-empty 策略）。这是 Phase 1 公开排行榜与 Phase 2 隐藏测试集官方排名所使用的策略。

两阶段评测

评测分两个阶段进行，以兼顾开发灵活性与防止过拟合。

第一阶段 — 公开评测（2026 年 5 月 26 日 – 6 月 16 日）：
第一阶段测试数据、baseline prompting 套件与离线评测脚本已于 5 月 26 日发布，详见仓库中的 data/、baseline_prompting/ 和 offline_eval/ 目录。参赛队伍须按官方格式将结果文件提交至 Codabench 平台以获得第一阶段分数.

第二阶段 — 隐藏评测（2026 年 6 月 11 日 – 6 月 20 日）：
6 月 11 日，组织方已发布此前未对参赛者公开的隐藏测试集（不含标签）。由于平台无法并行 Phase 提交，参赛队伍须在该隐藏测试集上运行最终模型，并于 6 月 17–20 日（UTC+8）期间通过 Codabench 平台提交第二阶段预测结果文件。第二阶段中，Codabench 平台用于收集提交文件，不会公开显示第二阶段分数排行榜。

最终得分：
在第二阶段测试期间，第一阶段提交将按现有提交记录自动选取最高分。最终排名时，组委会将核对第一阶段与第二阶段的所有提交结果，并与各参赛队伍私下一一核对确认最终计算所使用的文件；每个赛道、每个阶段将保留最高分。最终结果计划于 6 月 30 日（UTC+8）公布。

数据发布与使用政策

本共享任务发布的所有数据，包括训练/开发数据、公开测试数据和隐藏测试数据，在比赛期间仅限用于参加 NLPCC 2026 Shared Task 10。

在比赛期间，上述已发布数据的署名权及发布权均归组织方所有。参赛者及第三方不得以其他名义对数据进行二次分发、镜像发布、重新发布、重新命名，或创建替代性发布版本。

比赛结束后，所有任务数据，包括训练数据和测试数据，将以开放协议重新分发，并可用于科研用途。

3. 提交要求

禁止使用联网工具：禁止使用任何联网工具。在测试集推理过程中（第一阶段与第二阶段），系统不得调用联网工具从互联网获取外部信息。这包括但不限于网络搜索、网页浏览、实时检索 API，以及在预测时从互联网获取内容的智能体工具。

每支参赛队伍须提交：

结果文件：按官方提交格式，将第一阶段（公开）和第二阶段（隐藏）的预测结果提交至 Codabench competition 16666。若队伍参加两个赛道，请确保两个赛道的结果合并在同一次提交中提交。
最终材料确认：飞书确认表格汇总确认最终材料，包括 Phase 1 最终确认提交信息、Phase 2 最终确认提交信息和简要系统说明 PDF。该简要系统说明不是会议投稿版本。
技术报告，须全面描述以下内容：
- 训练数据来源及数据增强或预处理步骤
- 模型架构与设计选择（如适用）
- 提示策略或检索方案（如适用）
- 训练流程与超参数
- 评测方法及消融实验结果
技术报告须提供充分的细节以确保完全可复现。组织方保留对存在可复现性疑虑的提交要求提供验证材料（包括源代码、模型权重和训练脚本）的权利。

NLPCC 官方关于 shared task 完整技术报告的征稿通知预计会晚于主会议论文征稿通知发布。

4. 日程安排

日期	事项
2026 年 3 月 20 日	共享任务公告及征集参赛
2026 年 3 月 20 日	开放报名
2026 年 4 月 15 日	发布详细任务指南与训练数据
2026 年 5 月 25 日	报名截止
2026 年 5 月 26 日	发布第一阶段数据、评测入口和离线评测脚本
2026 年 6 月 11 日	发布隐藏测试数据（不含标签）（第二阶段开始）
2026 年 6 月 16 日	Phase 1 平台侧公开提交截止（UTC+8）
2026 年 6 月 17–20 日	Phase 2 在 Codabench 平台收集提交文件（UTC+8）
2026 年 6 月 30 日	公布最终结果

5. 常见问题（FAQ）

Q：参赛队伍是否必须同时参加两个赛道？

A：不是。参赛队伍可以选择参加任一赛道或同时参加两个赛道，赛道独立评测。

Q：奖项如何评定？

A：我们遵循 NLPCC 官方指导意见。每个赛道的排名第一的队伍将获得由 NLPCC 和 CCF-NLP 联合颁发的证书。我们也在考虑为两个赛道综合平均分排名第一的队伍设立额外奖项（待确认）。

Q：Track 1 的每个样本是否只包含一份证据材料？

A：不是。一个证据材料包（evidence bundle）可能包含多份材料——例如结果段落加上一个或多个段落、表格或图表。系统应综合考虑所有提供的证据进行预测。

Q：Track 1 可能包含多模态输入（如表格或图表），参赛者是否必须使用端到端多模态模型？

A：不是。对系统设计不作任何限制，参赛者可以使用任何方案。

Q：Track 2 中，如果一条陈述引用了多篇文献怎么办？

A：每个 Track 2 数据样本均配对了被引论文。参赛者只需考虑该样本所提供的全文材料，任务不要求解析超出所给被引论文范围的引用。

Q：在第一阶段（公开评测）期间，可以多次向在线平台提交吗？

A：可以。参赛队伍可以在公开评测期间多次提交，排行榜将反映最新提交结果。

Q：是否可以使用发布的训练/开发数据以外的外部数据集或论文进行训练？

A：可以。训练数据来源不受限制。所有使用的外部数据源须在技术报告中披露。我们将在最终任务概述论文中分类总结各种方案，但训练数据的选择不影响评分与排名。

Q：是否可以在系统中通过 API 使用商业大语言模型（如 GPT-5、Claude）？

A：可以。但参赛者须在技术报告中清楚注明所使用的模型名称、版本（含请求日期）及 API 费用。

Q：是否可以在预测过程中使用联网工具（如网络搜索或实时检索）？

A：不可以。在测试集推理过程中，禁止使用从互联网获取外部信息的联网工具。