Agentic Large Language Models for Conceptual Systems Engineering and Design
Table of Contents
Introduction
The Challenge of Early-Stage Engineering Design
Language Models in Engineering: An Evolving Frontier
Two-Agent vs. Multi-Agent Architectures
Case Study: Solar-Powered Water Filtration System
The Design-State Graph (DSG): Structure and Purpose
Experimental Setup and Methodology
Comparing LLaMA 3.3 70B and DeepSeek R1 70B
JSON Validity and Embodiment Tagging
Requirements Coverage: A Bottleneck
Code Compatibility and Execution
Workflow Completion Rates
Runtime and Graph Size Comparison
Why Multi-Agent Systems (MAS) Enhance Design Granularity
The Role of Reasoning-Distilled LLMs
Limitations of Current Systems
Recommendations for Future Development
Implications for Engineering Education and Practice
The Future of Agentic Design Tools
Conclusion
1. Introduction
Artificial intelligence is changing how we build things—from microchips to entire infrastructure systems. Large Language Models (LLMs), known for their success in natural language processing, are being tested in increasingly technical domains, including systems engineering and early-stage design. The question is not just whether LLMs can answer questions, but whether they can reason iteratively across multiple engineering tasks.
This article explores how agentic LLM systems, specifically two-agent systems (2AS) and structured multi-agent systems (MAS), perform in conceptual engineering design, using the design of a solar-powered water filtration system as the benchmark.
2. The Challenge of Early-Stage Engineering Design
Early-stage design is messy. It involves ambiguous requirements, iterative decomposition of problems, selection of physical components, and constant reevaluation. Human engineers perform well because they understand context, draw from diverse domains, and adapt in real time. For LLMs to participate meaningfully, they must:
Maintain task continuity across steps
Generate structured outputs (e.g., JSON, executable code)
Integrate physical and functional reasoning
But current LLM-based workflows often collapse under the weight of such complexity.
3. Language Models in Engineering: An Evolving Frontier
LLMs like LLaMA 3.3 70B and DeepSeek R1 70B (reasoning-distilled) are now capable of:
Reading design briefs and extracting requirements
Decomposing complex functions
Writing physics-based simulation code in Python
Collaborating via agent-like protocols
However, few studies have evaluated how to structure these models into effective workflows for technical design.
4. Two-Agent vs. Multi-Agent Architectures
The two systems tested in this research are:
a. Two-Agent System (2AS)
One generator
One reflector
Operates in a loop
Simpler, less computational overhead
b. Multi-Agent System (MAS)
9 roles with specific design tasks
Communicate and iterate via a shared design space
Structured orchestration
Increased complexity, but potential for better detail and continuity
5. Case Study: Solar-Powered Water Filtration System
The design task provided to both systems involved creating a solar-powered water filtration system based on a "cahier des charges" (design brief). Goals included:
Clean water output for rural communities
Solar-powered energy source
Modular components
Simulatable system in Python
The systems were judged on how well they handled each step—from interpreting goals to writing working simulator code.
6. The Design-State Graph (DSG): Structure and Purpose
A central innovation was the Design-State Graph (DSG), a JSON-serializable graph used to capture:
Extracted requirements
Functional blocks
Physical embodiments
Linked physics models
Each node in the graph corresponds to a design component. The DSG provides a persistent structure that both agents and humans can interpret, evaluate, and iterate on.
7. Experimental Setup and Methodology
Each model (LLaMA 3.3 and DeepSeek R1 70B) was run in both 2AS and MAS modes:
60 experiments total
2 models × 2 agent setups × 3 temperatures × 5 seeds
Metrics included:
JSON validity
Requirement coverage
Embodiment tagging
Code execution success
Workflow completion flags
Runtime
DSG size
8. Comparing LLaMA 3.3 70B and DeepSeek R1 70B
LLaMA 3.3 70B
High fluency
Poor at task persistence
Inconsistent workflow completion
DeepSeek R1 70B (Reasoning-Distilled)
Superior reasoning chains
Better task completion detection
More stable across seeds
Less verbose, more structured outputs
DeepSeek clearly benefited from its reinforcement learning training regime, which emphasizes coherent and goal-oriented thinking.
9. JSON Validity and Embodiment Tagging
All configurations—regardless of agent setup or model—maintained 100% JSON validity and successfully tagged physical embodiments. This is a strong baseline, indicating that LLMs can reliably produce structured machine-readable outputs.
10. Requirements Coverage: A Bottleneck
Despite correct JSON and embodiment tags, requirement coverage remained below 20% across all runs. This means:
Most extracted designs missed key constraints from the design brief
Many outputs focused on mechanical or electrical components but skipped user needs or safety parameters
This highlights the gap between linguistic fluency and design fidelity.
11. Code Compatibility and Execution
Python-based simulation code generation was a mixed success:
In 2AS setups, code compatibility peaked at 100% in a few runs
MAS runs averaged <50%, likely due to more complex interdependencies
Syntax was usually correct, but logic often failed to reflect real physics
Code generation remains a major challenge—valid ≠ functional.
12. Workflow Completion Rates
Only DeepSeek R1 70B consistently flagged when the task was complete, showing awareness of design progress. In contrast, LLaMA 3.3 often looped or skipped critical steps without feedback.
This kind of meta-cognition (knowing you’ve completed a task) is essential for scaling LLMs into autonomous agents.
13. Runtime and Graph Size Comparison
MAS workflows took longer but yielded:
More detailed graphs (5–6 nodes)
Greater modularity
Clearer separation of requirements and embodiments
2AS systems completed faster, but the DSGs were shallow, often collapsing the entire system into one or two vague blocks.
14. Why Multi-Agent Systems (MAS) Enhance Design Granularity
MAS offers benefits including:
Role specialization (e.g., one agent extracts requirements, another writes code)
Built-in error correction (agents can reflect or challenge earlier steps)
Persistent memory via DSG
This results in deeper system understanding, although at the cost of increased coordination.
15. The Role of Reasoning-Distilled LLMs
DeepSeek R1 70B proved especially valuable due to:
Chain-of-thought generation
Consistent completion detection
High logical accuracy in decomposition tasks
Its training method—reinforcement learning without supervised fine-tuning—may be a game-changer in agentic design.
16. Limitations of Current Systems
Despite promising results, key limitations persist:
Low requirement recall (<20%)
Incomplete simulator functionality
Lack of visual design feedback
No integration with CAD or real-world simulators
These systems are still prototypes, not replacements for engineering teams.
17. Recommendations for Future Development
To improve performance:
Add retrieval-augmented generation to fetch relevant design cases
Incorporate feedback loops from real simulators (e.g., MATLAB, SolidWorks)
Train with engineering-specific corpora, not just internet data
Develop agent memory banks to recall design history across sessions
Explore graph neural networks for DSG refinement
18. Implications for Engineering Education and Practice
Agentic LLMs offer potential benefits in:
Rapid ideation for student projects
Teaching functional decomposition
Creating documentation and test benches
Translating requirements into functional blocks
They could become teaching assistants or design collaborators in the near future.
19. The Future of Agentic Design Tools
We are entering a new era where LLMs:
Don’t just respond, but collaborate
Maintain task states and design intent
Work in teams of agents with different goals
Deliver structured designs ready for validation or prototyping
Such systems could democratize engineering design, enabling global participation and rapid innovation.
20. Conclusion
This comprehensive evaluation of agentic LLM systems in conceptual systems engineering design reveals both exciting capabilities and sobering limitations. Multi-agent orchestration improves design fidelity, while reasoning-distilled models like DeepSeek R1 70B are crucial for workflow reliability.
However, until requirement coverage and code functionality improve, these tools will remain augmentative, not autonomous. With targeted improvements, agentic LLMs could transform not just design workflows but how we define creativity and engineering itself.