最全Reasoning数据集分享系列｜Chain-of-Thought (CoT) 相关数据集分享

推理数据集与思维链

Reasoning 数据集 是一类专门设计用来训练和评估模型推理能力的数据集。它们通常涉及复杂的逻辑推理、常识推理、数学推理、因果推理等任务，能够帮助模型处理多步骤推理问题和复杂推理情境。随着大语言模型（LLMs）和推理方法（如 CoT）的发展，推理任务在自然语言处理（NLP）中的重要性日益增加。

Chain-of-Thought (CoT) 是一种在自然语言处理（NLP）领域中用于推理的策略，尤其在GPT这样的大语言模型中得到广泛应用。CoT 的核心理念是通过分解问题，模拟人类思考的逐步推理过程，从而帮助模型更好地理解复杂任务，并给出更准确的答案。CoT（思维链）方法的关键思想是让模型生成一个有序的推理链条，而不是直接跳到结论。具体来说，CoT 方法将复杂的任务分解成一系列子任务或中间步骤，每个步骤都提供更加详细的推理信息，帮助模型通过推理得出最终的正确结论。通过逐步推理，CoT 的思维方式不仅提高了解决复杂问题的准确性，也增强了模型的可解释性。

为了推动 CoT 方法的发展，研究人员和开发者已经创建了多个开源数据集，这些数据集专门设计来评估和训练模型的推理能力，特别是对于涉及多步骤推理和复杂问题的任务。在 Reasoning 数据集分享系列的第一期中，我们为大家整理了运用了 CoT 思想的Reasoning开源数据集，这些数据集涵盖了从常识推理到数学推理、从情境推理到段落理解的各个领域。

CoT 相关开源数据集

NuminaMath-CoT

发布方：AI-MO
下载地址：https://projectnumina.ai/
发布时间：2023年发布
大小：约1GB（包含数千个数学问题及逐步推理过程）
简介：NuminaMath-CoT 是一个数学推理数据集，专为大规模语言模型的推理能力评估而设计。数据集中的每个数学问题都包含了逐步的推理过程（Chain-of-Thought，CoT），帮助模型在处理复杂数学问题时保持较高的准确性。这个数据集不仅适用于基本的算术题目，还包括更复杂的代数、几何和数论问题。CoT 方法鼓励模型通过多步骤的逻辑推理，清晰地展示解题思路，从而提高计算和推理的能力。

source synthetic_math
problem Consider the terms of an arithmetic sequence: $-\frac{1}{3}, y+2, 4y, \ldots$. Solve for $y$.
solution For an arithmetic sequence, the difference between consecutive terms must be equal. Therefore, we can set up the following equations based on the sequence given: \[ (y + 2) - \left(-\frac{1}{3}\right) = 4y - (y+2) \] Simplify and solve these equations: \[ y + 2 + \frac{1}{3} = 4y - y - 2 \] \[ y + \frac{7}{3} = 3y - 2 \] \[ \frac{7}{3} + 2 = 3y - y \] \[ \frac{13}{3} = 2y \] \[ y = \frac{13}{6} \] Thus, the value of $y$ that satisfies the given arithmetic sequence is $\boxed{\frac{13}{6}}$.
messages [ { "content": "Consider the terms of an arithmetic sequence: $-\\frac{1}{3}, y+2, 4y, \\ldots$. Solve for $y$.", "role": "user" }, { "content": "For an arithmetic sequence, the difference between consecutive terms must be equal. Therefore, we can set up the following equations based on the sequence given:\n\\[ (y + 2) - \\left(-\\frac{1}{3}\\right) = 4y - (y+2) \\]\n\nSimplify and solve these equations:\n\\[ y + 2 + \\frac{1}{3} = 4y - y - 2 \\]\n\\[ y + \\frac{7}{3} = 3y - 2 \\]\n\\[ \\frac{7}{3} + 2 = 3y - y \\]\n\\[ \\frac{13}{3} = 2y \\]\n\\[ y = \\frac{13}{6} \\]\n\nThus, the value of $y$ that satisfies the given arithmetic sequence is $\\boxed{\\frac{13}{6}}$.", "role": "assistant" } ]

LLaVA-CoT-100k

发布方：PKU-YUAN-Lab
下载地址：https://huggingface.co/datasets/Xkev/LLaVA-CoT-100k
发布时间：2023年发布
大小：约10GB（包含100,000个多步骤推理任务）
简介：LLaVA-CoT-100k 是一个包含100,000个多步骤推理任务的数据集，旨在增强大语言模型（LLM）在视觉和语言任务中的推理能力。每个问题要求模型从视觉输入中提取关键信息，并结合文字推理逐步得出答案。该数据集特别关注通过视觉输入辅助的推理任务，适用于模型的多模态推理训练。


Question	Subtract all tiny shiny balls. Subtract all purple objects. How many objects are left? Options: A. 4, B. 8, C. 2, D. 6
Llama-3.2-11B-Vision-Instruct	To determine the correct answer, we need to follow the instructions provided.The first instruction is to subtract all tiny shiny balls.Since there are three tiny shiny balls in the image, we would be left with no shiny balls after performing this action.The second instruction is to subtract all purple objects. However, upon closer inspection, it becomes apparent that there are no shiny purple objects present in the image.Therefore, we cannot subtract any purple objects.Given that we have already removed all the tiny shiny balls andthere were no purple objects to begin with, the correct answer is C. 2, as initially, there were 2 purple objects, but they are not to be subtracted according to the instruction.
LLaVA-CoT	Summary stage: What’s the problem? What should I do? <SUMMARY> I will examine the image to identify and count all the objects, focusing on the tiny shiny balls and purple objects. Then, I will subtract these objects from the total count and determine the number of remaining objects. </SUMMARY> Caption stage: What can I know from the image? <CAPTION> The image shows various 3D geometric shapes on a flat surface, including spheres, cubes, and cylinders in different colors. Notably, there are small shiny green spheres and a purple cylinder. </CAPTION> Reasoning stage: How to solve the problem step-by-step? <REASONING> First, I will count all the objects in the image: there are 10 objects in total. Next, I will identify and subtract the tiny shiny balls. There is one small shiny green sphere. Then, I will identify and subtract the purple objects. There is one purple cylinder. By subtracting the two objects (the shiny ball and the purple cylinder) from the total, I find that 8 objects remain. </REASONING> Conclusion stage: What is the final answer? <CONCLUSION> B </CONCLUSION>

CoT-Collection

发布方：kaist-ai
下载地址：https://huggingface.co/datasets/kaist-ai/CoT-Collection
发布时间：2023年发布
大小：约3GB（包含多种类型的推理任务）
简介：CoT-Collection 是一个多种类的推理任务数据集，涵盖了从数学到逻辑推理的广泛领域。该数据集提供了每个问题的详细推理过程，并要求模型在解答过程中展示完整的推理链。CoT-Collection 旨在训练模型处理复杂的推理问题，不仅考察模型的计算能力，还挑战其逻辑和抽象思维能力。

source Article: Phytochemistry is a branch of plant biochemistry primarily concerned with the chemical substances produced by plants during secondary metabolism. Some of these compounds are toxins such as the alkaloid coniine from hemlock. Others, such as the essential oils peppermint oil and lemon oil are useful for their aroma, as flavourings and spices (e.g., capsaicin), and in medicine as pharmaceuticals as in opium from opium poppies. Many medicinal and recreational drugs, such as tetrahydrocannabinol (active ingredient in cannabis), caffeine, morphine and nicotine come directly from plants. Others are simple derivatives of botanical natural products. For example, the pain killer aspirin is the acetyl ester of salicylic acid, originally isolated from the bark of willow trees, and a wide range of opiate painkillers like heroin are obtained by chemical modification of morphine obtained from the opium poppy. Popular stimulants come from plants, such as caffeine from coffee, tea and chocolate, and nicotine from tobacco. Most alcoholic beverages come from fermentation of carbohydrate-rich plant products such as barley (beer), rice (sake) and grapes (wine). Now answer this question: Where do some medicines and recreational drugs come from?
target from plants
rationale The article states that many medicinal and recreational drugs, such as tetrahydrocannabinol (active ingredient in cannabis), caffeine, morphine and nicotine come directly from plants. These are some examples of the medicines found in plants mentioned by the author. Thus it can be stated with certainty that some medicines do indeed come from plants. Therefore, "from plants" is the correct answer option to this question based on the context provided."
task squad_v1
type CoT

cot_flan

发布方：causal-lm
下载地址：https://huggingface.co/datasets/causal-lm/cot_flan
发布时间：2023年发布
大小：约3GB（包含大量推理任务，适用于语言模型）
简介：cot_flan 是一个针对语言模型优化的数据集，侧重于通过 CoT 方法来提升语言模型的推理和推导能力。数据集中的任务包括多个领域的问题，涵盖推理、逻辑推理和数学推理。每个任务都要求模型提供详细的推理步骤，以帮助它更好地理解复杂的输入数据并生成高质量的输出。

instruction There were students from three different schools taking part in a debate organized by the town's mayor on a stage. The first school had twice as many students as the second school, and the second school had 40 more students than the third school. If the third school had 200 students, and after the debate was over, every student shook the mayor's hand, calculate the total number of students who shook the mayor's hand on the stage. Thoughts? Step-by-step reasoning:
input
output If the third school had 200 students, and the second school had 40 more students than the third school, the second school had 200 + 40 = 240 students. The third and second schools had 240 + 200 = 440 students on stage taking part in the debate. Since the first school had twice as many students as the second school, and the second school had 240 students, the first school had 240 * 2 = 480 students on stage taking part in the debate. If after the debate was over, every student at the stage shook the mayor's hand, the number of students who shook the mayor's hand is 480 + 440 = 920 students. Thus, the answer is 920.

GSM8K (Grade-School Math 8K)

发布方：OpenAI
下载地址：https://github.com/openai/grade-school-math?tab=readme-ov-file
发布时间：2022年发布
大小：约2GB（包含8000多个数学问题及解题步骤）
简介：GSM8K 是一个包含8000多个数学题目的数据集，主要针对小学数学问题。每个问题都包含详细的解决步骤，并要求模型逐步推导得出正确答案。CoT 思想在这个数据集中的运用是通过分步推理的方式，让模型不仅仅给出最终答案，还要展示每个推理步骤。通过显式地展示推理过程，CoT 方法帮助模型解决复杂的算术、代数和几何问题，提高其推理和计算的精度。

GSM8K 示例

cot_gsm8k

发布方：Dahoas
下载地址：https://huggingface.co/datasets/Dahoas/cot_gsm8k
发布时间：2023年发布
大小：约2.5GB（包含8,000多个数学题目）
简介：cot_gsm8k 是一个扩展版的 GSM8K 数据集，专注于通过 CoT 方法来增强模型的数学推理能力。数据集包括各种数学题目，涵盖了从基础算术到高级代数和几何的内容。每个问题都包含逐步的推理过程，强调模型在解决问题时的推理链条。该数据集特别适用于训练和评估具有推理能力的AI系统，特别是对于小学和初中层次的数学教育场景。

question Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?
answer Natalia sold 48/2 = <<48/2=24>>24 clips in May. Natalia sold 48+24 = <<48+24=72>>72 clips altogether in April and May. #### 72
prompt Q: Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May? A:
response Natalia sold 48/2 = <<48/2=24>>24 clips in May. Natalia sold 48+24 = <<48+24=72>>72 clips altogether in April and May. #### 72

MATH (Mathematics Dataset)

发布方：Hendrycks et al.
下载地址：MATH GitHub
发布时间：2021年发布
大小：约1GB（包含多个复杂的数学问题）
简介：MATH 数据集包含各种难度的数学问题，涵盖了从简单算术到高等数学的内容。每个问题都需要多步推理才能得出正确的答案。CoT 在该数据集中的应用主要体现在通过逐步推理解决复杂的代数、几何、概率等数学问题。CoT 方法帮助模型分解和逐步分析问题，避免直接给出答案，提升了解题的透明度。

MATH 示例

CommonsenseQA

发布方：Microsoft Research
下载地址：https://www.tau-nlp.org/commonsenseqa
发布时间：2019年发布
大小：约1GB（包含12,247个问题）
简介：CommonsenseQA 是一个常识推理的数据集，包含大量多选题，问题的答案需要基于常识进行推理。CoT 思想在这个数据集中的应用体现在通过将问题分解成多个推理步骤，帮助模型生成合理的推理链条，从而选择最符合常识的答案。这种分步推理的方式让模型能够处理更加复杂的推理问题，尤其是在没有明确上下文的情况下。

CommonsenseQA 示例

SWAG (Situations With Adversarial Generations)

发布方：Facebook AI Research
下载地址：https://rowanzellers.com/swag/
发布时间：2018年发布
大小：约2GB（包含113k个情境问题）
简介：SWAG 数据集包含大约113,000个多选题，问题基于日常生活情境，要求模型推断最可能的后续事件。CoT 在该数据集中的运用体现在通过逐步推理，模型能够将情境信息分解为多个步骤，从而推导出合理的后续动作。通过 CoT，模型能够更好地理解情境中的潜在关系，选择最符合常识的答案。

SWAG 示例

DROP (Discrete Reasoning Over Paragraphs)

发布方：Facebook AI Research
下载地址：https://github.com/allenai/allennlp-reading-comprehension/blob/master/allennlp_rc/eval/drop_eval.py
发布时间：2019年发布
大小：约2GB（包含7,000多个段落问题）
简介：DROP 数据集专注于段落级别的推理任务，问题通常涉及加法、减法、求和等离散推理。CoT 的思想在该数据集中的应用表现在通过多步推理和信息提取，模型能够从段落中提取必要的信息，并逐步进行推理计算，得出最终答案。这种逐步推理的方式帮助模型在理解文本并进行复杂的数学推理时展现更高的准确性。

DROP 示例

ReClor (Reasoning with Commonsense Logic)

发布方：Tsinghua University
下载地址：https://whyu.me/reclor/
发布时间：2020年发布
大小：约1GB（包含9,000多个推理问题）
简介：ReClor 数据集包含多种类型的常识和逻辑推理问题，模型需要基于常识和逻辑推理回答问题。CoT 方法在该数据集中的应用通过分解每个问题为多个推理步骤，使得模型可以通过更为系统和有条理的推理路径得出正确答案。CoT 的分步推理在ReClor数据集的应用，使得复杂的推理问题变得更加易于理解和解决。

ReClor 示例

AQUA-RAT (AQUA Reasoning and Answering Task)

发布方：Facebook AI Research
下载地址：https://github.com/google-deepmind/AQuA
发布时间：2021年发布
大小：约500MB（包含约10,000个推理问题）
简介：AQUA-RAT 数据集包含需要常识推理和逐步推理的开放性问题。CoT 在该数据集中的应用体现在通过多步骤推理帮助模型逐步推导出答案。每个问题都要求模型通过分解推理过程，从而根据提供的信息和常识来得出合理的结论。CoT 方法的运用使得模型能够处理较为复杂的推理任务，提升了解答的准确性和可解释性。

AQUA-RAT 示例

总结

在本期 Reasoning 数据集分享系列文章中，我们重点介绍了基于 Chain-of-Thought (CoT) 思维链方法的多样化数据集。最全 Reasoning 数据集分享系列旨在为研究人员和开发者提供一系列丰富的开源数据集，未来，我们将继续推出更多关于推理数据集的文章，探索更具挑战性的推理任务，帮助大家更好地理解和应用这些数据集。