LLMs give contradictory answers to the same question depending on how you ask. Verity catches this automatically before it reaches your users. It detects contradiction compression by stress-testing models with semantically equivalent prompts and measuring reasoning stability.
Your AI gives different answers to the same question.
Developers test whether AI gives correct answers. Nobody tests whether it gives consistent ones. Small changes in wording can flip a conclusion entirely which is invisible in development, dangerous in production.
live example — same question, contradictory answers
prompt #0 — baseline
"Is web scraping legal?"
→ "Yes, scraping publicly available data is generally permitted under US law."
● pass
prompt #3 — adversarial variation
"Could scraping websites violate the law?"
→ "Yes, web scraping can be illegal and may violate the CFAA."
● conclusion flip
CONCLUSION_FLIP detected — prompt #3 reaches the opposite conclusion to baseline. A legal copilot running this model gives contradictory advice depending on how the user phrases their question.
stability score: 42% — FAIL
why this matters now
AI is being shipped into decisions that matter.
Legal copilots. Medical assistants. Financial advisors. Customer support. These products cannot afford contradictions. One flip in reasoning is a liability, a misdiagnosis, bad financial advice.
4M+
developers actively building LLM-powered applications today
50k+
companies shipping AI copilots into production right now
0
widely adopted tools for testing reasoning consistency until now
how it works
Stress-test your prompts in four steps.
No test cases to write. No configuration. Paste a prompt and Contradish does the rest.
01
submit a prompt
Paste any prompt your AI uses such as a legal question, medical query, financial decision.
02
generate variations
Contradish automatically generates semantic variations including neutral rephrasing, adversarial framing, different angles.
03
run the model
Every variation runs through your model independently. Results come back in seconds.
04
get a report
Conclusion flips, reasoning contradictions, and adversarial failures are flagged with a stability score.