How Airbnb Leveraged AI LLMs to Revolutionize Large-Scale Test Migration

Migrating a large legacy test suite—like Airbnb’s transition of nearly 3,500 Enzyme tests to React Testing Library (RTL)—typically takes over a year of manual engineering work. By leveraging large language models (LLMs) and robust automation pipelines, teams can compress that timeline to mere weeks. This article covers why legacy test migrations are so challenging, how modern LLMs excel at code transformation, and how to design a pipeline that scales a model-driven migration with human oversight, while preserving test intent and coverage.

Why Switch to React Testing Library (RTL)?

Migrating legacy test frameworks is a daunting task for any engineering organization. When Airbnb began adopting React Testing Library (RTL) for new tests in 2020, they faced thousands of Enzyme-based tests that no longer aligned with modern React practices. Without automated support, refactoring those tests by hand would have taken an estimated 1.5 years of engineering effort.

Large language models (LLMs) offer a new path forward: by interpreting code intent and translating API calls, they can drive bulk migrations while preserving the original test logic. In six weeks, Airbnb’s team completed what would have been 18 months of manual work, demonstrating LLMs’ potential to revolutionize large-scale code transformations.

The Challenge of Legacy Test Suites

Enzyme’s Historical Role

Enzyme served as Airbnb’s primary React testing framework since 2015, giving deep access to component internals and making white-box testing straightforward.

Misalignment with Modern React

As React evolved, Enzyme’s shallow and full DOM rendering approaches no longer matched the best practice of testing components through public hooks and user events; RTL encourages testing behavior over implementation details, leading to more robust, maintainable tests.

The Cost of Manual Migration

Hand-rewriting each Enzyme test to RTL idioms is time-consuming. Beyond simple API swaps, engineers must ensure that mocks, asynchronous flows, and custom utilities continue to work correctly. Deleting Enzyme tests outright would create coverage gaps and risk regressions. Traditional code-mod scripts can handle syntax changes but struggle with semantic intent—making pure automation brittle for a codebase of thousands of tests.

Rise of LLMs for Code Transformation

Core LLM Capabilities

OpenAI’s GPT-4 and similar LLMs excel at few-shot learning: given prompts with examples, they can generalize patterns to new inputs. For test migration, you provide input-output pairs (Enzyme→RTL) and let the model apply those transformations across files.

Translating API Calls

LLMs can map Enzyme methods (shallow, mount, find, etc.) to RTL counterparts (render, screen.getBy…, fireEvent, etc.), while preserving variable names, custom utilities, and file imports. This semantic mapping is harder for regex or AST-based scripts but natural for LLMs trained on vast code corpora.

Handling Edge Cases

Complex tests may include custom matchers, asynchronous flows, or proprietary utilities. By including a handful of these examples in your prompt, LLMs can learn to refactor them too. When they misstep, human reviewers catch and correct them, and revised prompts then steer the model toward better outputs on subsequent runs.

Designing an LLM-Powered Migration Pipeline

To scale an LLM migration from a few files to thousands, you need a structured, repeatable pipeline. Here’s how Airbnb and other teams can build one:

1. Prompt Engineering

Example-Driven Templates: Collect 5–10 representative Enzyme→RTL test pairs. Use these in your prompt to show the model exactly how to transform each pattern.
Minimal Context: Keep prompts concise to fit within token limits. Focus on method swaps and import updates, letting the model infer surrounding context.

2. Batching and Parallelization

Chunk Files: Group test files in batches (e.g., 50 at a time) to maximize throughput while staying under rate limits and token caps. Airbnb ran 75% of their files through initial prompts in just four hours by parallelizing 50-file batches across multiple GPU workers.
Cache Prompts: Reuse common scaffolding (imports, example snippets) across batches to reduce prompt size and cost.

3. Automated Validation

Lint and Snapshot Checks: After conversion, run linters and snapshot tests to catch syntax errors or obvious regressions. Failing files are flagged for human review.
CI Integration: Incorporate the pipeline into your CI/CD workflow. Every pull request triggers a migration pass, keeping your tests up to date.

4. Human-in-the-Loop

Review Failing Cases: Engineers examine errors, refine prompts, or write custom scripts for complex scenarios. This feedback loop trains the model for improved accuracy on subsequent iterations.
Iterative Refinement: Adjust prompts to handle edge cases (e.g., asynchronous queries, custom renderers) and re-run the pipeline on failed files until you reach your quality targets.

5. Tooling and Integration

CLI Wrappers: Wrap your LLM calls in a command-line tool or npm script, exposing simple commands like migrate-tests --path src/components for developers to run locally or in CI.
Monitoring and Metrics: Track conversion success rates, prompt costs, and manual review time to quantify ROI. Airbnb achieved 97 % automated migration success, leaving only the remaining 3 % for manual fixes.

Best Practices & Common Pitfalls

Avoid Hallucinations: Keep examples tightly scoped. If the model invents APIs or uses incorrect imports, shorten the prompt or add more representative examples to guide it.
Manage Costs: LLM inference can be expensive at scale. Monitor token usage, prefer smaller models for simpler files, and consider open-source alternatives for cost optimization.
Handle Non-Determinism: LLMs may produce slightly different outputs on each run. Lock your prompt templates and seed values when possible, or post-process outputs with deterministic scripts to normalize formatting.
Maintain Idempotence: Ensure that running the migration twice produces the same result. If the LLM wraps rendered elements differently each time, add post-migration formatters (e.g., Prettier) to enforce consistency.

Airbnb’s Migration Goals & Scope

Airbnb set out to migrate 3,500 React component test files from Enzyme to React Testing Library (RTL) in under six weeks, a task originally estimated to take 1.5 years of manual effort. They needed to preserve test intent and code coverage while aligning with modern React practices. Success metrics included automated migration rates, minimal manual fixes, and seamless developer adoption of RTL. This ambitious scope required both a robust automation pipeline and expert oversight to manage edge cases and ensure quality throughout the process.

Step-By-Step Migration Workflow

1. Preparation & Bootstrapping
Airbnb began by harvesting a corpus of example tests, pairing Enzyme and RTL versions to use as few-shot prompts for the LLM. They extracted common test patterns and assembled 40–50 related files per prompt to give the model deep contextual understanding.

2. Automated Conversion Pass
Files were processed in parallel batches of ~50, with each batch fed through a CLI-wrapped LLM call that attempted conversion, lint fixes, and type checks. The initial pass converted 75% of files in 4 hours, showcasing how parallelization and prompt caching can scale migrations effectively.

3. Validation & Quality Gates
Converted tests ran in CI pipelines where lint, type, and snapshot checks identified failures. Failing files were automatically rerouted back into the LLM pipeline for up to 10 retry attempts, each time with updated prompts that included the latest error messages and file states.

4. Iterative Refinement (“Sample, Tune, Sweep”)
To tackle the remaining 25%, Airbnb used a feedback loop: they sampled failing files, analyzed common errors, updated prompt templates or added new few-shot examples, then swept the fixes across all failing files. Over four days, this loop raised the automated success rate from 75% to 97%, leaving under 100 files for manual intervention.

5. Manual Finish & Rollout
For the final 3%, engineers used the LLM-generated refactors as baselines, completing edge-case fixes by hand in under a week. The full migration, end-to-end, closed in six weeks with RTL fully adopted across Airbnb’s codebase.

Results & Impact

Time Savings: Condensed 1.5 years of work into six weeks, a 90% reduction in timeline.
Coverage & Accuracy: Achieved 97% automated migration, preserving 100% of test intent and maintaining existing coverage thresholds.
Cost Efficiency: The combined cost of LLM API usage and six weeks of engineer time was significantly lower than the projected cost of 18 months of manual labor.
Developer Experience: Engineers avoided context-switch fatigue and were able to focus on writing new RTL tests and feature work instead of refactoring boilerplate.

Key Learnings & Recommendations

High-Quality Prompts Matter Most: Invest time in building representative few-shot examples and including related file context. It drives accuracy far more than prompt-tuning minutiae.
Distributed, Step-Based Pipelines Scale: Modeling your migration as a state machine with validation and refactor steps makes it easy to parallelize and monitor progress at scale.
Brute-Force Retry Loops: Simple retry strategies with dynamic prompts often outperform overly complex prompt engineering, especially on mid-complexity files.
Human-in-the-Loop for Edge Cases: Automate broadly but reserve human review for the long tail. Using LLM outputs as baselines accelerates manual fixes substantially.

Generalizing Beyond Airbnb

While Airbnb’s case focused on Enzyme→RTL, the same LLM-driven pipeline can apply to:

Framework Upgrades: Migrating between major versions of libraries (e.g., AngularJS→Angular) by teaching the model API mappings and patterns.
Language Transforms: Converting JavaScript to TypeScript, or Python 2 to Python 3, by feeding syntax and typing examples.
Style & Security Fixes: Bulk-refactoring code to align with style guides or fix known vulnerability patterns in open-source dependencies.

Building internal tooling—CLI wrappers, CI hooks, retry managers, and dashboarding—turns these one-off migrations into repeatable workflows. Team-specific patterns (e.g., naming conventions, custom utilities) can be captured via shared prompt libraries, ensuring consistency across projects.

Conclusion & Future Outlook

Airbnb’s six-week migration proves that LLMs can radically accelerate large-scale code transformations, turning multi-quarter efforts into multi-week sprints. By combining scalable pipelines, smart prompt engineering, and human oversight, teams can preserve code intent, maintain coverage, and redeploy engineering resources to higher-value work.

At Software Development AI, we’re building next-generation tools and services to help enterprises replicate these successes—whether that means migrating test suites, modernizing legacy code, or automating compliance checks. If you’re ready to explore how AI can transform your development lifecycle, let’s talk.

External References

https://medium.com/airbnb-engineering/accelerating-large-scale-test-migration-with-llms-9565c208023b

https://analyticsindiamag.com/global-tech/airbnb-uses-llms-to-pull-off-an-18-month-enzyme-to-rtl-migration-in-just-6-weeks/