3.4 Reverse Engineering, Re-engineering & Software Configuration Management
The legacy code problem
Most software work is not writing new code from scratch — it is understanding, modifying and extending existing code. Sommerville's data: ~60–80% of total software cost is spent on maintenance and enhancement of legacy systems.
When you inherit code with no documentation, written by someone who has long left, in a style you'd never choose — that's a legacy system. Three engineering activities help:
| Activity | Goal |
|---|---|
| Reverse Engineering | Understand what the legacy code does |
| Re-engineering | Improve the legacy system without changing its function |
| Forward engineering | Build new code from a (recovered or new) design |
---
Reverse Engineering
Reverse engineering is the process of analysing an existing system to identify its components and their inter-relationships, producing a representation at a higher abstraction level.
Source Code ──► Design Documents ──► Requirements
(low) (medium) (high)
Goals of reverse engineering
- Recover lost design documentation
- Understand undocumented systems
- Identify reusable components
- Aid in re-engineering decisions
- Locate defects or vulnerabilities
Reverse engineering activities
- Source code analysis — parse the code into AST
- Restructuring — improve internal structure without behaviour change
- Extraction — identify modules, data structures, control flow
- Abstraction — generate UML, ER, DFD from code
- Documentation generation — Doxygen / Javadoc-style automated docs
Tools
- UML reverse engineering: Enterprise Architect, StarUML, Visual Paradigm
- Decompilers: JD-GUI (Java), dotPeek (.NET), Ghidra (binaries)
- Source-code analysers: SonarQube, NDepend, Structure101
Reverse engineering is not...
| Reverse engineering is NOT | Why |
|---|---|
| Decompilation alone | Decompilation is a step, not the whole activity |
| Software piracy | RE is legitimate for maintenance, security, interoperability |
| Always legal | Some EULAs forbid it; check jurisdiction (DMCA, EU directives) |
---
Software Re-engineering
Re-engineering is the examination and alteration of a system to reconstitute it in a new form with new features, while preserving its essential function.
Re-engineering activities
Existing System
│
▼
[Reverse Engineering] ──► Higher-level model
│
▼
[Restructure / Refactor]
│
▼
[Forward Engineering] ──► Re-engineered System
- Inventory analysis — what do we have?
- Document restructuring — re-create missing documentation
- Reverse engineering — extract design
- Code restructuring — clean up the code (formatting, dead code removal, modularisation)
- Data restructuring — schema improvements, normalisation
- Forward engineering — build new modules using recovered design
When to re-engineer
| Re-engineer if... | Throw away if... |
|---|---|
| Business logic is still relevant | Requirements have fundamentally changed |
| Hardware/platform is being replaced | Better commercial off-the-shelf exists |
| Maintenance cost > 50% of replacement | Code is unsalvageable mess |
| Team has skills in current language | No team understands it anymore |
Famous re-engineering project: UK's National Air Traffic Services moved from a 1970s-era system to a re-engineered platform in the 2000s — they preserved business logic but rebuilt the technical foundation.
---
Reverse Engineering vs Re-engineering — comparison
| Aspect | Reverse Engineering | Re-engineering |
|---|---|---|
| Output | Understanding / documentation | New system |
| Changes code? | No | Yes |
| Effort | Smaller | Larger |
| Risk | Low | Medium-high |
| Goal | Comprehension | Improvement |
---
Code Restructuring
A subset of re-engineering that improves internal structure without changing external behaviour. The modern term is refactoring (Martin Fowler, 1999).
Common refactorings
| Refactoring | What it does |
|---|---|
| Rename variable | Make name reflect intent |
| Extract method | Move a code block to its own function |
| Extract class | Split a god-class into focused classes |
| Inline method | Reverse — eliminate trivial wrappers |
| Move method | Relocate to where it belongs |
| Replace conditional with polymorphism | Use OO instead of switch |
| Replace magic number with constant | if (status == 3) → if (status == APPROVED) |
| Remove dead code | Delete unreachable code |
Rules of refactoring
- Refactor in small steps
- Run tests after each step
- Never change behaviour and refactor in the same step
- Commit refactorings separately from feature changes
---
Software Configuration Management (SCM)
SCM is the discipline of identifying, organising and controlling modifications to the software being built. Without SCM, large software projects collapse into chaos — multiple developers overwrite each other's work, releases are unreproducible, and bugs in production cannot be tied back to specific code versions.
Configuration items (SCIs)
Anything that can change during the project is a configuration item:
- Source code files
- Build scripts and Makefiles
- Documentation (SRS, design docs, manuals)
- Test cases and test data
- Configuration files (
.env,.yaml) - Third-party libraries (with versions)
- Database schema and migrations
- Issue tracker entries
---
The 5 SCM functions
1. Configuration Identification
- Give each SCI a unique identifier
- Establish a baseline — frozen version that becomes the reference
- Common baselines: requirements baseline, design baseline, product baseline
2. Version Control
Track every change to every SCI with a history:
- Version — a numbered state (1.0, 1.1, 1.2)
- Revision — a change to a version
- Variant — a parallel version for a different audience
- Release — a version delivered to users
Tools: Git (Linus Torvalds, 2005), Mercurial, SVN, ClearCase, Perforce.
3. Change Control
Every modification follows a formal process:
Change Request (CR)
│
▼
Change Control Board (CCB) review
│
┌────┼────┐
│ │ │
Reject Hold Approve
│
▼
Implement change
│
▼
Test & verify
│
▼
Update baseline
4. Configuration Auditing
Periodic verification that:
- All SCIs are properly identified
- All changes are documented and approved
- Baselines are consistent with reality
5. Status Reporting
Communicate to stakeholders:
- Current state of each SCI
- Pending change requests
- Recent approved changes
- Baseline history
---
Version Control with Git — modern essentials
| Concept | Definition |
|---|---|
| Repository | Storage of all versions |
| Commit | A snapshot with a message |
| Branch | An independent line of development |
| Merge | Combine two branches |
| Pull request / Merge request | Proposed change for review |
| Tag | Named version (often a release) |
Branching strategies:
- Git Flow — develop, feature, release, hotfix branches (heavy)
- GitHub Flow — main + feature branches (light)
- Trunk-Based Development — one main branch, short-lived features
---
SCM Plan (SCMP) — typical contents (IEEE 828)
- Introduction
- SCM Management — roles and responsibilities
- SCM Activities — identification, change control, audits
- SCM Schedule — timing of audits and baselines
- SCM Resources — tools, hardware, training
- SCM Plan Maintenance — how the SCMP itself is updated
---
Key Terms — Lesson 3.4
The terms below define the vocabulary of legacy-system engineering and configuration management — every PYQ on RE/re-engineering or SCM expects them.
Legacy System — A software system that continues to deliver business value but uses outdated technology, lacks current documentation, or relies on people who have left the organisation. Most maintenance work in industry happens on legacy systems. Indian outsourcing firms built their early business on legacy mainframe migration.
Forward Engineering — The traditional development direction: requirements → design → code → executable. Forward engineering is what every SDLC model in Units I–III describes.
Reverse Engineering — Working in the opposite direction: from existing code or binaries back to a higher-level representation — design diagrams, requirements, or even alternative implementations. The goal is understanding, not modification. Output: documentation, UML diagrams, recovered specifications.
Re-engineering — Reverse engineering followed by forward engineering — understand the legacy system, then rebuild it in a new form (new technology, new architecture, sometimes new language) while preserving its essential business function. Re-engineering is far cheaper than redesigning from scratch when the business logic is still relevant.
Decompiler / Disassembler — Tools that convert compiled binaries back to higher-level form. A disassembler converts machine code to assembly. A decompiler goes further — assembly back to a high-level language (often C-like). JD-GUI for Java, dotPeek for .NET, Ghidra (NSA-released) and IDA Pro for native binaries.
Restructuring — A subset of re-engineering that changes the internal organisation of code without changing its external behaviour — modularising, removing dead code, normalising formatting, splitting god-classes. The modern term is refactoring.
Refactoring — Martin Fowler's 1999 term (and book) for the discipline of restructuring code in small, behaviour-preserving steps. Common refactorings: Rename, Extract Method, Extract Class, Inline Method, Move Method, Replace Conditional with Polymorphism, Remove Dead Code. Refactoring requires a comprehensive test suite that guarantees behaviour is preserved.
Code Smell — Martin Fowler's term for a surface symptom that suggests a deeper design problem — long methods, duplicated code, large classes, long parameter lists, feature envy, switch statements, divergent change, shotgun surgery. Smells aren't bugs; they're indicators that refactoring would help.
Technical Debt — Ward Cunningham's metaphor for the future cost of suboptimal design decisions taken to meet a short-term need — shortcuts, hard-coded values, missing tests, outdated dependencies. Like financial debt, technical debt accrues interest in the form of increased maintenance cost and is repaid through refactoring.
God Class / God Object — An anti-pattern where one class accumulates too many responsibilities — typically a class with hundreds of methods and thousands of lines. The classical example is a "Util" class that grew over years. Refactoring usually involves extracting cohesive sub-classes.
Joel Spolsky's "Things You Should Never Do" — Spolsky's classic essay (2000) arguing that rewriting working software from scratch is almost always a disastrous strategic mistake. The Netscape 6 rewrite took 3 years and lost the browser market; the same lesson applies to many ambitious rewrites since.
Software Configuration Management (SCM) — The discipline of identifying, organising, and controlling modifications to the software being built. Without SCM, multi-developer projects collapse into chaos. SCM has five canonical functions: identification, version control, change control, configuration auditing, status reporting.
Configuration Item (SCI) — Any artefact under SCM control — source code files, build scripts, documentation (SRS, design, manuals), test cases, test data, config files, third-party library versions, database schema, migrations, even issue-tracker entries. The principle: anything that can change, and whose change matters, should be a CI.
Baseline — A formally reviewed and approved version of a configuration item that becomes the reference for subsequent work. Common baselines: requirements baseline (after SRS sign-off), design baseline (after SDD sign-off), product baseline (after first release).
Version Control System (VCS) — A tool that tracks every change to every file in a project, recording who, what, when, and why. Git (Linus Torvalds, 2005) is overwhelmingly dominant; Mercurial, SVN (Subversion), Perforce, and ClearCase still survive in specific niches.
Git — Linus Torvalds's 2005 distributed version control system, now the universal standard. Every developer's working copy is a complete repository, not just a checkout. Core concepts: commit (atomic change), branch (independent line of development), merge (combine branches), remote (a copy on another machine), pull request (proposed change for review).
Branch — An independent line of development in version control. Modern teams use feature branches (one per feature in development), a main/master branch (always working state), and sometimes release branches (stable cut for shipping).
Merge / Pull Request / Merge Request — Combining one branch's changes into another. In GitHub, this is a Pull Request (PR); in GitLab, a Merge Request (MR). PRs are the unit of code review in modern development.
Git Flow — A specific branching strategy (Vincent Driessen, 2010) with develop, feature, release, and hotfix branches in addition to main. Heavy but disciplined; common in older enterprise teams.
GitHub Flow / Trunk-Based Development — Lightweight alternatives to Git Flow. GitHub Flow has just main + short-lived feature branches. Trunk-Based Development goes further — every developer integrates to main multiple times per day, behind feature flags if needed. Both are preferred for high-velocity CI/CD environments.
Change Request (CR) — A formal request to modify a baselined configuration item. Each CR is logged, reviewed by the Change Control Board, and either approved, deferred, or rejected. The CR is the paper trail that prevents uncontrolled change.
Change Control Board (CCB) — The committee that reviews and decides on change requests affecting baselined items. CCB membership typically includes the project manager, technical lead, customer representative, and QA lead. The CCB exists to prevent scope creep and to maintain traceability.
Configuration Audit — A periodic independent verification that the actual state of the project matches the documented configuration — every SCI accounted for, every change traceable to an approved CR, every baseline consistent with reality. ISO 9001 and CMMI both require periodic configuration audits.
Status Accounting / Status Reporting — The SCM activity of communicating the current configuration state to stakeholders — what is in the baseline, what change requests are pending, what changes were approved recently. Reports are produced at agreed intervals (weekly, monthly) and at milestones.
IEEE 828 — The IEEE standard for Software Configuration Management Plans. Defines the recommended SCMP structure — introduction, SCM management, SCM activities, SCM schedule, SCM resources, plan maintenance.
Build Script / Build System — The script that compiles source code, runs tests, packages deliverables, and produces artefacts — Maven (Java), Gradle (Java/Android/Kotlin), npm/yarn (Node), pip/poetry (Python), Make (C/C++), Bazel (multi-language at scale). Build scripts are themselves SCIs.
Continuous Integration / Continuous Deployment (CI/CD) — The modern automation of build, test, and deploy. CI automatically builds and tests every commit. CD automatically deploys passing builds to staging (or production). CI/CD pipelines are themselves SCIs, defined in YAML files under version control.
Infrastructure-as-Code (IaC) — The modern DevOps practice of defining infrastructure (servers, networks, databases) in version-controlled text files — Terraform, AWS CloudFormation, Pulumi, Ansible, Kubernetes manifests. IaC brings SCM's discipline (review, baseline, audit, rollback) to infrastructure that was historically managed manually.
Tag / Release — A named, immutable pointer to a specific commit in Git, typically used to mark a released version (v1.0, v2.3.1). Tags are part of the SCM audit trail: "what was in production on date X" can be answered by checking out the tag.
Semantic Versioning (SemVer) — A convention for version numbers — MAJOR.MINOR.PATCH — where MAJOR bumps for breaking changes, MINOR for backward-compatible features, PATCH for bug fixes. The de-facto standard for libraries published to npm, PyPI, Maven Central.
---
Study deep
- Configuration management is invisible until it fails. When CM works, no one notices. When it fails (lost code, broken build, can't reproduce a release) the project grinds to a halt. Invest in CM early.
- Git is the universal tool. Despite its complexity, Git has won industry-wide. Modern developers must be fluent in branching, merging, rebasing, conflict resolution. Indian outsourcing companies often use Git + Bitbucket + JIRA as the standard stack.
- DevOps blurs the boundary. Modern DevOps treats everything as code — infrastructure (Terraform), pipelines (GitHub Actions), configuration (Ansible). All under SCM. The set of SCIs has grown dramatically.
- Reverse engineering for security is huge. Malware analysis, vulnerability research, and forensics are all forms of reverse engineering. Tools like Ghidra (NSA, open-sourced 2019) and IDA Pro are industry standards.
- Re-engineering is harder than rewriting. Rewriting from scratch is tempting but historically disastrous — Joel Spolsky's "Things You Should Never Do" essay (Netscape 6 took 3 years to rebuild and lost the market). Re-engineering preserves the working business logic that took years to perfect.
PYQ pattern: "Differentiate reverse engineering and re-engineering." — Define both; table the comparison (output, changes code?, effort, risk, goal); end with an example (NATS air-traffic system).
PYQ pattern: "What is Software Configuration Management? Explain its activities." — Define SCM, name configuration items, list the 5 functions (identification, version control, change control, audit, reporting); mention Git as the modern tool.