You’ve mastered the basics. Your first spec is published and running. Now you want to write specs that:
- Handle edge cases your competitors don’t
- Succeed 90%+ of the time (not 50%)
- Scale to complex feature sets
- Teach agents better patterns
This guide covers advanced techniques learned from building SpecMarket’s own specs.
1. Decompose Complex Features Into Layered Success Criteria
A bad spec tries to define “complete feature X” in one criterion. A good spec breaks it into layers.
Bad:
- Build a payment systemGood:
- Database schema: transactions table with id, user_id, amount_usd, status, created_at fields
- API endpoint: POST /api/payments/charge accepts { userId, amountUsd, stripeTokenId }
- Validation: reject amounts > $10,000 or < $0.01 with specific error messages
- Success flow: stripe charge succeeds → transaction marked "completed" → confirmation email sent
- Failure flow: stripe charge fails → transaction marked "failed" → error message to user
- Idempotency: sending same request twice doesn't create two charges
- Webhook: Stripe webhooks are processed and update transaction status within 5 seconds
- Tests: 8+ tests covering success, validation failures, Stripe errors, idempotency, webhook processingEach layer can be verified independently. Agents tackle them one at a time. If one layer fails, you know exactly which one.
2. Model Uncertainty Explicitly
Specs rarely have 100% certainty about what will work. Model the uncertainty.
Example:
## Success Criteria
- [ ] API starts without errors
- [ ] GET /api/specs returns list of specs (may be empty)
- [ ] **Error handling: API returns 500 with error message on database failure**
- *Rationale: We can't guarantee the database always works, but we CAN guarantee we handle it gracefully*
## Known Limitations
- Sorting specs by "popularity" is a heuristic; it's not based on real-time data
- Email delivery is "eventual" — confirmation emails may take up to 5 minutes
- Search is not full-text indexed; it's a simple substring match (acceptable for <10,000 specs)When you model uncertainty, agents know what’s non-negotiable vs. what’s “best effort.”
3. Use Concrete Enums, Not Strings
Agents misunderstand what strings mean. Enums clarify.
Bad spec:
status: string # Can be "draft", "published", "archived", or "deleted"Good spec:
type SpecStatus = "draft" | "published" | "archived";
// Note: we never use "deleted" — archives are permanent
// Business rules:
// - Only "draft" specs can be edited
// - "published" specs are immutable (new versions only)
// - "archived" specs are hidden from search but still visible to their authorEnumerate all valid values. Explain why certain values exist or don’t exist.
4. Prevent Success Criteria Escapes
Agents find loopholes. Your success criteria should have no escape hatches.
Bad criterion:
- The app displays a list of usersEscape: agent creates a 1-item list, hard-coded. Technically correct!
Better criterion:
- The app displays a list of users, fetched from the database
- Test case 1: with 0 users, display is empty or shows "No users"
- Test case 2: with 5 users, display shows all 5 with name, email, created_at
- Test case 3: added a 6th user via API, refresh page, new user appearsEscape hatches are closed when you specify the data source and test cases.
5. Document Spec Dependencies In a DAG
If your spec depends on another spec or external service, document the dependency graph.
## Dependencies
### Specs
- Requires: `@alice/auth-system` (version 1.x)
- Used for: user authentication, JWT token generation
- If missing: Build a minimal auth system (see stdlib/AUTH_MINIMAL.md)
### External Services
- Stripe (required)
- What: payment processing
- Free tier: yes ($0 until first transaction)
- Env var: STRIPE_SECRET_KEY
- Supabase (optional)
- What: PostgreSQL database (can substitute local SQLite)
- Free tier: yes (up to 1GB)
- If using local SQLite: set DATABASE_URL="sqlite:./local.db"
### APIs
- OpenAI API (optional)
- What: used only for premium feature "AI suggestions"
- If missing: feature is disabled gracefully (not an error)This clarity helps agents know what’s hard-required, what’s nice-to-have, and what can be swapped.
6. Use PATTERNS.md For Domain Knowledge
Your stdlib/PATTERNS.md file is a teaching document. Use it to share patterns, antipatterns, and insights.
Example for a React spec:
## Patterns
### State Management
- Use React Context for global state (auth, user preferences)
- Use useState for local component state
- Avoid Redux — overkill for this spec's complexity
### Error Handling
- Try-catch blocks around API calls
- Show user-friendly error messages (not stack traces)
- Log errors to console for debugging
### Testing
- Use Vitest + React Testing Library
- Test user interactions, not implementation details
- Don't test third-party libraries (assume they work)
## Antipatterns to Avoid
### ❌ API calls in render
```typescript
// BAD: calls API on every render
function UserList() {
const users = fetch('/api/users'); // BUG!
return <ul>{users.map(...)}</ul>;
}
// GOOD: use useEffect
function UserList() {
const [users, setUsers] = useState([]);
useEffect(() => {
fetch('/api/users').then(res => setUsers(res.json()));
}, []);
return <ul>{users.map(...)}</ul>;
}❌ Hardcoded API URLs
Always use environment variables:
const API_URL = process.env.REACT_APP_API_URL || 'http://localhost:3000';
Patterns are explicit teachable moments. Agents learn from them.
---
## 7. Estimate Costs Conservatively
Your `estimatedCostUsd` is a contract. Underestimating costs:
- Makes specs unprofitable to run (agents use Managed Runs less)
- Damages your reputation (users see higher-than-expected bills)
**How to estimate:**
1. **Count API calls** — How many times does the spec call an LLM?
- Example: 50 iterations × 20K input tokens × $0.003/1K = $3.00
2. **Add overhead** — Context expansion, retries, middleware
- Example: $3.00 × 1.3 (overhead) = $3.90
3. **Round up** — Always round conservatively
- Final estimate: $5.00 (not $3.90)
**Test your estimate:**
- Run the spec 5 times with the same model
- Calculate actual cost
- If your estimate was wrong by >50%, update it
---
## 8. Use Infrastructure Blocks To Declare Constraints
Your infrastructure declaration tells agents what you're willing to use.
**Example:**
```yaml
infrastructure:
services:
- category: "Database"
name: "SQLite"
purpose: "Store user data"
required: true
providers:
- name: "SQLite (local file)"
freeTier: true
- category: "Database"
name: "PostgreSQL"
purpose: "Store user data"
required: true
userProvided: true # User brings their own (e.g., Supabase, Railway)
providers:
- name: "Supabase"
freeTier: true
- name: "Railway"
freeTier: false
paidStartsUsd: 5
monthlyCost:
freeTierUsd: 0 # Local SQLite = free
productionUsd: 5 # Supabase free tier
This tells agents: “You have options. SQLite is free and available. PostgreSQL is also fine, but user has to provide credentials.”
9. Write Defensive Specs For LLM Limitations
LLMs can’t:
- Read your mind — Specify exactly what you mean
- Handle ambiguity — Make all decisions explicit
- Remember context — Repeat important constraints multiple times
- Invent details — All decisions must be in the spec
Example — Authentication:
Bad spec:
- Users can log in with email/passwordGood spec:
- Authentication flow:
1. User enters email + password in login form
2. App queries database: SELECT user WHERE email = ?
3. If user doesn't exist → show "Email not found" error
4. If user exists:
a. Compare password hash using bcrypt.compare()
b. If match → create JWT token (exp: 7 days), set cookie
c. If mismatch → show "Incorrect password" error (don't reveal whether email exists)
5. On successful login → redirect to /dashboard
- Security rules:
- Never log plaintext passwords
- Password hashing: bcrypt with salt rounds = 10
- JWT secret: read from env var JWT_SECRET (required)
- Cookie: httpOnly, Secure, SameSite=Strict
- Test cases:
- Non-existent email → "Email not found"
- Correct credentials → JWT token issued, redirect to /dashboard
- Wrong password → "Incorrect password"
- Old token (expired) → redirect to loginRepetitive? Yes. But the agent won’t misunderstand.
10. Handle Versioning Gracefully
Specs evolve. Document how version changes work.
## Versioning & Breaking Changes
### Semantic Versioning: MAJOR.MINOR.PATCH
**PATCH (1.0.1):**
- Bug fixes
- Documentation updates
- Backward compatible
**MINOR (1.1.0):**
- New features
- Backward compatible
- Example: adding an optional parameter to an API endpoint
**MAJOR (2.0.0):**
- Breaking changes
- Incompatible with previous versions
- Example: removing an API endpoint, changing database schema
### Migration Path
If you release version 2.0.0:
1. Maintain version 1.x for 6 months
2. Announce deprecation: "Version 1.x will be removed on [date]"
3. Provide migration guide: how to upgrade from 1.x to 2.0Explicit versioning prevents agents (and users) from getting confused.
11. Specify I/O Contracts Precisely
Agents often mess up data serialization, edge cases, and format details.
Bad:
- API returns a user objectGood:
interface User {
id: string; // UUID, e.g., "550e8400-e29b-41d4-a716-446655440000"
email: string; // RFC 5322 valid email, lowercase
displayName: string; // 1-100 characters, no leading/trailing whitespace
createdAt: string; // ISO 8601 format, e.g., "2026-02-27T13:30:00Z"
role: "user" | "admin"; // Only these two values
isActive: boolean; // true if user can log in
}
// API Response
{
"success": true,
"data": User,
"timestamp": "2026-02-27T13:30:00Z"
}
// API Error Response
{
"success": false,
"error": {
"code": "VALIDATION_ERROR" | "NOT_FOUND" | "UNAUTHORIZED" | "INTERNAL_ERROR",
"message": "Human-readable message",
"details": { ... } // Optional; depends on error code
},
"timestamp": "2026-02-27T13:30:00Z"
}
// Validation Rules
- Email: cannot exceed 254 characters
- displayName: no newlines, no control characters
- createdAt: server-generated; users cannot set thisSpecs are a data contract. Be explicit about format.
12. Account For Deployment Complexity
Specs aren’t just features; they’re shippable systems. Document deployment.
## Deployment
### Local Development
```bash
git clone <repo>
pnpm install
pnpm dev # Starts dev server on http://localhost:3000Production Deployment
Option 1: Vercel (Recommended for Web)
pnpm build
vercel deploy # Needs STRIPE_SECRET_KEY env varOption 2: Docker
docker build -t myapp .
docker run -e STRIPE_SECRET_KEY=sk_live_... myappOption 3: Railway
Connect your GitHub repo, Railway auto-deploys on push.
Environment Variables (Required)
- STRIPE_SECRET_KEY — Stripe API key
- DATABASE_URL — PostgreSQL connection string (optional for SQLite)
- JWT_SECRET — Secret for signing tokens
Database Migrations
pnpm db:migrate # Runs all pending migrationsIf no database exists, the app creates it on first run.
Monitoring
- Errors are logged to console (local) and Sentry (production)
- No setup required; happens automatically
This prevents agents from building unmigrable systems.
---
## 13. Use Scaffolding for Complex Domains
If your spec involves complex logic (payment processing, ML, etc.), provide a scaffold or template.
**Example:**
```markdown
## Payment Processing Scaffold
The spec must implement payment processing. Here's a recommended structure:
### File Layout
src/ payments/ stripe.ts # Stripe client setup webhook.ts # Webhook handler checkout.ts # Checkout flow types.ts # TypeScript types tests/ stripe.test.ts webhook.test.ts
### Starting Code (Minimal)
```typescript
// src/payments/stripe.ts
import Stripe from 'stripe';
const stripe = new Stripe(process.env.STRIPE_SECRET_KEY!);
export async function createCheckout(userId: string, amountCents: number) {
const session = await stripe.checkout.sessions.create({
payment_method_types: ['card'],
line_items: [
{
price_data: {
currency: 'usd',
unit_amount: amountCents,
product_data: {
name: 'Product',
},
},
quantity: 1,
},
],
mode: 'payment',
success_url: process.env.SUCCESS_URL!,
cancel_url: process.env.CANCEL_URL!,
metadata: { userId },
});
return session;
}
This gives agents a starting point so they don’t invent wheels.
---
## 14. Test Your Spec Yourself
Before publishing, you should be able to hand your spec to an agent and predict success rate with 80%+ confidence.
**Testing checklist:**
1. ✅ **Ambiguity test** — Read your spec aloud. Does every sentence have exactly one meaning?
2. ✅ **Dependency test** — Can someone follow your spec without external knowledge?
3. ✅ **Constraint test** — Is every constraint explicit? (no implied business logic)
4. ✅ **Data test** — Can you write test cases for every API endpoint and database query?
5. ✅ **Integration test** — Does your spec work on different machines/OS versions?
6. ✅ **Cost test** — Run your own version 3-5 times; is the cost within your estimate ± 20%?
---
## 15. Design For Failure
Specs fail. Document how agents should handle failure.
```markdown
## Failure Handling
### What If The API Service Is Down?
**During Development (Managed Runs):**
- The run fails with status "stall"
- Error message: "Failed to authenticate with Stripe after 3 retries"
- User sees: "Service temporarily unavailable. Try again in 10 minutes."
**Test Case:**
```typescript
test('handles Stripe timeouts gracefully', async () => {
stripe.charges.create = vi.fn().mockRejectedValue(new Error('TIMEOUT'));
const result = await processPayment({ userId: '123', amountUsd: 10 });
expect(result.success).toBe(false);
expect(result.error).toContain('temporarily unavailable');
});
What If The Database Is Full?
During Development:
- The spec should gracefully degrade or show an error
- Never silently fail; always inform the user
Test Case:
test('handles database full error', async () => {
db.insert = vi.fn().mockRejectedValue(new Error('DISK_FULL'));
const result = await createUser({ email: '...' });
expect(result.success).toBe(false);
expect(result.error).toContain('Server storage is full');
});
Failure is not shameful. Plan for it.
---
## 16. Browser Testing in Specs
The managed runs sandbox includes headless Chromium and Playwright system dependencies. You can use Playwright for end-to-end testing and visual verification.
### Adding Playwright to Your Spec
In your `stdlib/STACK.md` or `SPEC.md`, include Playwright as a dev dependency:
```markdown
# Testing
- Unit tests: Vitest
- E2E tests: Playwright (pre-installed in managed runs sandbox)
Or in your spec’s package.json (if included):
{
"devDependencies": {
"playwright": "^1.50.0",
"@playwright/test": "^1.50.0"
}
}Writing Playwright Tests for the Sandbox
// e2e/homepage.spec.ts
import { test, expect } from '@playwright/test';
test('homepage loads and displays title', async ({ page }) => {
await page.goto('http://localhost:3000');
await expect(page).toHaveTitle(/My App/);
// Save screenshot — coordinator uploads to dashboard automatically
await page.screenshot({ path: '/workspace/screenshots/homepage.png' });
});
test('login flow works', async ({ page }) => {
await page.goto('http://localhost:3000/login');
await page.fill('[name="email"]', 'test@example.com');
await page.fill('[name="password"]', 'testpass123');
await page.click('button[type="submit"]');
await expect(page).toHaveURL('/dashboard');
await page.screenshot({ path: '/workspace/screenshots/login-success.png' });
});Screenshot Output Convention
Save screenshots to /workspace/screenshots/ with descriptive filenames:
/workspace/screenshots/homepage.png
/workspace/screenshots/login-form.png
/workspace/screenshots/dashboard-loaded.png
/workspace/screenshots/error-404-page.png
The coordinator polls this directory every 5 seconds and uploads new screenshots to your dashboard in real-time. Use descriptive filenames — they’re shown in the screenshot gallery.
Important Notes
- Xvfb is NOT needed — Playwright runs in headless mode natively
- Chromium is pre-installed — no need to run
npx playwright installin your spec - ARM64 compatible — the sandbox runs on ARM64 hosts; Playwright’s Chromium build supports this
- Screenshots are limited to 100 per run, 5MB each, 200MB total
Example SUCCESS_CRITERIA Entry
## Browser Tests
- [ ] `npx playwright test` passes with 0 failures
- [ ] Screenshots saved to /workspace/screenshots/ for all key pages
- [ ] Login flow screenshot shows authenticated dashboard stateCommon Pitfalls To Avoid
❌ Over-Specifying Implementation
Don’t say “use React hooks.” Say “manage state efficiently.”
❌ Vague Success Criteria
“Code is clean” is not testable. “Code passes ESLint + has >80% test coverage” is.
❌ Ignoring Edge Cases
What if the user provides a negative number? An empty string? Null? Spec it.
❌ Assuming Knowledge
Agents don’t know your domain. Explain assumptions.
❌ Not Pricing Realistically
Underpriced specs don’t get run. Overpriced specs don’t get funding.
❌ Changing Specs After Publishing
Once published, a spec is immutable. Create version 2.0 instead.
When To Break These Rules
These patterns are guidelines, not laws. Break them if:
- Simplicity > Completeness — Your spec is 300 words, not 3,000. Keep it simple.
- Domain Maturity — If the domain is well-established (e.g., “build a CRUD API”), less specification is needed.
- Trust — If your first spec has a 90% success rate, agents trust your style. You can be slightly less explicit in the future.
Measuring Spec Quality
Track these metrics over time:
- Success rate — % of runs that meet all criteria (target: >80%)
- Avg cost accuracy — |actual cost - estimated cost| / estimated (target: <20% variance)
- Time accuracy — |actual time - estimated time| / estimated (target: <30% variance)
- User satisfaction — Reviews and ratings (target: >4.0 stars)
Use these metrics to iterate on your specs.
Learning From Failures
When a run fails:
- Read the error message — What did the agent get stuck on?
- Review the success criteria — Was the criterion ambiguous?
- Update your spec — Add clarity to prevent future failures
- Publish version 2.0 — Don’t overwrite version 1.x; create a new version
Advanced Examples
Building an LLM App Spec
- Document the prompt structure
- Specify temperature, max_tokens, other parameters
- Test cases: valid input, edge cases, hallucination scenarios
Building a Data Pipeline Spec
- Specify data formats: CSV, JSON, Parquet?
- Error handling: malformed data, missing columns?
- Performance targets: process 1M rows in <5 minutes?
Building a Mobile App Spec
- Screen layouts: provide wireframes or detailed descriptions
- Device support: iOS 14+, Android 11+?
- Offline support: specs might need to work offline
Next Steps
You’re now equipped to write production-grade specs. Your journey:
- Publish spec v1.0 — Simple, clear, testable
- Gather feedback — Reviews, ratings, run data
- Iterate to v2.0 — Based on failures and user requests
- Target 80%+ success rate — This is the quality bar
- Help other creators — Share your patterns via comments, Moltbook, GitHub
The specs that win on SpecMarket are the ones that succeed consistently. Focus on clarity first. Optimization second.
Related Guides
- Publishing Guide — Publish and share your specs
- Spec Format Reference — YAML schema and file structure
- What is a Ralph Loop? — Understanding the agent that builds your spec
- Troubleshooting — Common issues and solutions