Synthetic drop-in replacements for Kialo debate datasets.
The Kialo debates are a 👑 gold mine for NLP researchers, AI engineers, computational sociologists, and Critical Thinking scholars. Yet, the mine is legally ⛔️ barred (for them): Debate data downloaded or scraped from the website may not be used for research or commercial purposes in the absence of explicit permission or license agreement.
That's why the DebateLab team has built this python module for creating synthetic debate corpora, which may serve as a drop-in replacements for the Kialo data. We're synthesizing such data from scratch, simulating multi-agent debate and collaborative argument-mapping with 🤖 LLM-based agents.
- permissive ODC license
- reproducible and extendable
- open source code basis
- works with open LLMs
- one-line-import as networkx graphs
| id | llm | # debates | ~# claims | link | contributed by |
|---|---|---|---|---|---|
| synthetic_corpus-001 | Llama-3.1-405B-Instruct | 1000/50/50¹ | 560k/28k/28k¹ | HF hub→ | DebateLab² |
| synthetic_corpus-001-DE | Llama-3.1-SauerkrautLM-70b-Instruct³ | 1000/50/50¹ | 560k/28k/28k¹ | HF hub→ | DebateLab |
¹ per train / eval / test split
² with ❤️ generous support from 🤗 HuggingFace
³ as translator
The following steps sketch the procedure by which debates are simulated:
- Determine the debate's
tag cloudby randomly sampling 8 topic tags. - Given the
tag cloud, let 🤖 generate a debatetopic(e.g., a question). - Given the
topic, let 🤖 generate a suitablemotion(i.e., the central claim). - Recursively generate an argument tree, starting with the
motionastarget argument(code→):- Let 🤖 identify the implicit
premisesof thetarget argument(code→). - Let 🤖 generate k
prosfor differentpremisesof thetarget argument(code→):- Choose
premiseto target in function ofpremises' plausibility. - Let 🤖 assume randomly sampled persona.
- Generate 2k candidate arguments and select k most salient ones.
- Choose
- Let 🤖 generate k
consagainst differentpremisesof thetarget argument(code→):- Choose
premiseto target in function ofpremises' implausibility. - Let 🤖 assume randomly sampled persona.
- Generate 2k candidate arguments and select k most salient ones.
- Choose
- Check for and resolve duplicates via semantic similarity / vector store (code→).
- Add
prosandconsto argument tree, and use each of these as newtarget argumentthat is argued for and against, unless max depth has been reached.
- Let 🤖 identify the implicit
Configure workflows/synthetic_corpus_generation.py. Then:
hatch shell
python workflows/synthetic_corpus_generation.py