From 1547c531a2c0a791d8ce5738eb7fd24221d5dad5 Mon Sep 17 00:00:00 2001 From: Shreyas Pimpalgaonkar Date: Mon, 27 Jan 2025 11:11:52 -0800 Subject: [PATCH] add curator --- README.md | 12 ++++++++++++ docs/curator/README.md | 30 ++++++++++++++++++++++++++++++ docs/curator/README_cn.md | 29 +++++++++++++++++++++++++++++ 3 files changed, 71 insertions(+) create mode 100644 docs/curator/README.md create mode 100644 docs/curator/README_cn.md diff --git a/README.md b/README.md index d781998..a188c76 100644 --- a/README.md +++ b/README.md @@ -160,6 +160,18 @@ English/[简体中文](https://github.com/deepseek-ai/awesome-deepseek-integrati + +### Synthetic data curation + + + + + + + +
Icon Curator An open-source tool to curate large scale datasets for post-training LLMs.
+ + ### IM Application Plugins diff --git a/docs/curator/README.md b/docs/curator/README.md new file mode 100644 index 0000000..c307d9d --- /dev/null +++ b/docs/curator/README.md @@ -0,0 +1,30 @@ + +![image](https://raw.githubusercontent.com/bespokelabsai/curator/main/docs/Bespoke-Labs-Logomark-Red-crop.png) + + +# [Curator](https://github.com/bespokelabsai/curator) + + +Curator is an open-source tool to curate large scale datasets for post-training LLMs. + +Curator was used to curate [Bespoke-Stratos-17k](https://huggingface.co/datasets/bespokelabs/Bespoke-Stratos-17k), a reasoning dataset to train a fully open reasoning model [Bespoke-Stratos](https://www.bespokelabs.ai/blog/bespoke-stratos-the-unreasonable-effectiveness-of-reasoning-distillation). + + +### Curator supports: + +- Calling Deepseek API for scalable synthetic data curation +- Easy structured data extraction +- Caching and automatic recovery +- Dataset visualization +- Saving $$$ using batch mode + +### Call Deepseek API with Curator easily: + +![image](https://pbs.twimg.com/media/GiLHb-xasAAbs4m?format=jpg&name=4096x4096) + +# Get Started here + +- [Colab Example](https://colab.research.google.com/drive/1Z78ciwHIl_ytACzcrslNrZP2iwK05eIF?usp=sharing) +- [Github Repo](https://github.com/bespokelabsai/curator) +- [Documentation](https://docs.bespokelabs.ai/) +- [Discord](https://discord.com/invite/KqpXvpzVBS) diff --git a/docs/curator/README_cn.md b/docs/curator/README_cn.md new file mode 100644 index 0000000..2c7dbe2 --- /dev/null +++ b/docs/curator/README_cn.md @@ -0,0 +1,29 @@ +![image](https://raw.githubusercontent.com/bespokelabsai/curator/main/docs/Bespoke-Labs-Logomark-Red-crop.png) + + +# [Curator](https://github.com/bespokelabsai/curator) + + +Curator 是一个用于后训练大型语言模型 (LLMs) 和结构化数据提取的制作与管理可扩展的数据集的开源工具。 + +Curator 被用来制作 [Bespoke-Stratos-17k](https://huggingface.co/datasets/bespokelabs/Bespoke-Stratos-17k),这是一个用于训练完全开源的推理模型 [Bespoke-Stratos](https://www.bespokelabs.ai/blog/bespoke-stratos-the-unreasonable-effectiveness-of-reasoning-distillation) 的推理数据集。 + + +### Curator 支持: + +- 调用 Deepseek API 进行可扩展的合成数据管理 +- 简便的结构化数据提取 +- 缓存和自动恢复 +- 数据集可视化 +- 使用批处理模式节省费用 + +### 轻松使用 Curator 调用 Deepseek API: + +![image](https://pbs.twimg.com/media/GiLHb-xasAAbs4m?format=jpg&name=4096x4096) + +# 从这里开始 + +- [Colab 示例](https://colab.research.google.com/drive/1Z78ciwHIl_ytACzcrslNrZP2iwK05eIF?usp=sharing) +- [Github 仓库](https://github.com/bespokelabsai/curator) +- [文档](https://docs.bespokelabs.ai/) +- [Discord](https://discord.com/invite/KqpXvpzVBS) \ No newline at end of file