From ec0774bc3e9f12912721d191e17442412ae73585 Mon Sep 17 00:00:00 2001 From: alexaustin007 Date: Mon, 27 Jan 2025 20:10:54 -0500 Subject: [PATCH] In the README.md the sample dataset format mentions 'instruction' and 'output' fields, but an example JSON line would be helpful. --- README.md | 14 ++++++++++++++ 1 file changed, 14 insertions(+) diff --git a/README.md b/README.md index 77ad8f7..7375287 100644 --- a/README.md +++ b/README.md @@ -282,6 +282,20 @@ pip install -r finetune/requirements.txt Please follow [Sample Dataset Format](https://huggingface.co/datasets/nickrosh/Evol-Instruct-Code-80k-v1) to prepare your training data. Each line is a json-serialized string with two required fields `instruction` and `output`. +Example of a JSON-serialized string, one formatted for use in Python and another for use in SQL. + +###Python## +{ + "instruction": "Write a Python function to calculate factorial", + "output": "def factorial(n):\n if n == 0:\n return 1\n else:\n return n * factorial(n-1)" +} + +###SQL### +{ + "instruction": "Create a SQL query to find duplicate emails", + "output": "SELECT email FROM users GROUP BY email HAVING COUNT(*) > 1;" +} + After data preparation, you can use the sample shell script to finetune `deepseek-ai/deepseek-coder-6.7b-instruct`. Remember to specify `DATA_PATH`, `OUTPUT_PATH`.