Agentic Large Language Models for Conceptual Systems Engineering and Design

ic_writer ds66
ic_date 2024-11-13
blogs

Table of Contents

  1. Introduction

  2. The Challenge of Early-Stage Engineering Design

  3. Language Models in Engineering: An Evolving Frontier

  4. Two-Agent vs. Multi-Agent Architectures

  5. Case Study: Solar-Powered Water Filtration System

  6. The Design-State Graph (DSG): Structure and Purpose

  7. Experimental Setup and Methodology

  8. Comparing LLaMA 3.3 70B and DeepSeek R1 70B

  9. JSON Validity and Embodiment Tagging

  10. Requirements Coverage: A Bottleneck

  11. Code Compatibility and Execution

  12. Workflow Completion Rates

  13. Runtime and Graph Size Comparison

  14. Why Multi-Agent Systems (MAS) Enhance Design Granularity

  15. The Role of Reasoning-Distilled LLMs

  16. Limitations of Current Systems

  17. Recommendations for Future Development

  18. Implications for Engineering Education and Practice

  19. The Future of Agentic Design Tools

  20. Conclusion

1. Introduction

Artificial intelligence is changing how we build things—from microchips to entire infrastructure systems. Large Language Models (LLMs), known for their success in natural language processing, are being tested in increasingly technical domains, including systems engineering and early-stage design. The question is not just whether LLMs can answer questions, but whether they can reason iteratively across multiple engineering tasks.

46679_2weg_1788.jpeg

This article explores how agentic LLM systems, specifically two-agent systems (2AS) and structured multi-agent systems (MAS), perform in conceptual engineering design, using the design of a solar-powered water filtration system as the benchmark.

2. The Challenge of Early-Stage Engineering Design

Early-stage design is messy. It involves ambiguous requirements, iterative decomposition of problems, selection of physical components, and constant reevaluation. Human engineers perform well because they understand context, draw from diverse domains, and adapt in real time. For LLMs to participate meaningfully, they must:

  • Maintain task continuity across steps

  • Generate structured outputs (e.g., JSON, executable code)

  • Integrate physical and functional reasoning

But current LLM-based workflows often collapse under the weight of such complexity.

3. Language Models in Engineering: An Evolving Frontier

LLMs like LLaMA 3.3 70B and DeepSeek R1 70B (reasoning-distilled) are now capable of:

  • Reading design briefs and extracting requirements

  • Decomposing complex functions

  • Writing physics-based simulation code in Python

  • Collaborating via agent-like protocols

However, few studies have evaluated how to structure these models into effective workflows for technical design.

4. Two-Agent vs. Multi-Agent Architectures

The two systems tested in this research are:

a. Two-Agent System (2AS)

  • One generator

  • One reflector

  • Operates in a loop

  • Simpler, less computational overhead

b. Multi-Agent System (MAS)

  • 9 roles with specific design tasks

  • Communicate and iterate via a shared design space

  • Structured orchestration

  • Increased complexity, but potential for better detail and continuity

5. Case Study: Solar-Powered Water Filtration System

The design task provided to both systems involved creating a solar-powered water filtration system based on a "cahier des charges" (design brief). Goals included:

  • Clean water output for rural communities

  • Solar-powered energy source

  • Modular components

  • Simulatable system in Python

The systems were judged on how well they handled each step—from interpreting goals to writing working simulator code.

6. The Design-State Graph (DSG): Structure and Purpose

A central innovation was the Design-State Graph (DSG), a JSON-serializable graph used to capture:

  • Extracted requirements

  • Functional blocks

  • Physical embodiments

  • Linked physics models

Each node in the graph corresponds to a design component. The DSG provides a persistent structure that both agents and humans can interpret, evaluate, and iterate on.

7. Experimental Setup and Methodology

Each model (LLaMA 3.3 and DeepSeek R1 70B) was run in both 2AS and MAS modes:

  • 60 experiments total

    • 2 models × 2 agent setups × 3 temperatures × 5 seeds

  • Metrics included:

    • JSON validity

    • Requirement coverage

    • Embodiment tagging

    • Code execution success

    • Workflow completion flags

    • Runtime

    • DSG size

8. Comparing LLaMA 3.3 70B and DeepSeek R1 70B

LLaMA 3.3 70B

  • High fluency

  • Poor at task persistence

  • Inconsistent workflow completion

DeepSeek R1 70B (Reasoning-Distilled)

  • Superior reasoning chains

  • Better task completion detection

  • More stable across seeds

  • Less verbose, more structured outputs

DeepSeek clearly benefited from its reinforcement learning training regime, which emphasizes coherent and goal-oriented thinking.

9. JSON Validity and Embodiment Tagging

All configurations—regardless of agent setup or model—maintained 100% JSON validity and successfully tagged physical embodiments. This is a strong baseline, indicating that LLMs can reliably produce structured machine-readable outputs.

10. Requirements Coverage: A Bottleneck

Despite correct JSON and embodiment tags, requirement coverage remained below 20% across all runs. This means:

  • Most extracted designs missed key constraints from the design brief

  • Many outputs focused on mechanical or electrical components but skipped user needs or safety parameters

This highlights the gap between linguistic fluency and design fidelity.

11. Code Compatibility and Execution

Python-based simulation code generation was a mixed success:

  • In 2AS setups, code compatibility peaked at 100% in a few runs

  • MAS runs averaged <50%, likely due to more complex interdependencies

  • Syntax was usually correct, but logic often failed to reflect real physics

Code generation remains a major challenge—valid ≠ functional.

12. Workflow Completion Rates

Only DeepSeek R1 70B consistently flagged when the task was complete, showing awareness of design progress. In contrast, LLaMA 3.3 often looped or skipped critical steps without feedback.

This kind of meta-cognition (knowing you’ve completed a task) is essential for scaling LLMs into autonomous agents.

13. Runtime and Graph Size Comparison

MAS workflows took longer but yielded:

  • More detailed graphs (5–6 nodes)

  • Greater modularity

  • Clearer separation of requirements and embodiments

2AS systems completed faster, but the DSGs were shallow, often collapsing the entire system into one or two vague blocks.

14. Why Multi-Agent Systems (MAS) Enhance Design Granularity

MAS offers benefits including:

  • Role specialization (e.g., one agent extracts requirements, another writes code)

  • Built-in error correction (agents can reflect or challenge earlier steps)

  • Persistent memory via DSG

This results in deeper system understanding, although at the cost of increased coordination.

15. The Role of Reasoning-Distilled LLMs

DeepSeek R1 70B proved especially valuable due to:

  • Chain-of-thought generation

  • Consistent completion detection

  • High logical accuracy in decomposition tasks

Its training method—reinforcement learning without supervised fine-tuning—may be a game-changer in agentic design.

16. Limitations of Current Systems

Despite promising results, key limitations persist:

  • Low requirement recall (<20%)

  • Incomplete simulator functionality

  • Lack of visual design feedback

  • No integration with CAD or real-world simulators

These systems are still prototypes, not replacements for engineering teams.

17. Recommendations for Future Development

To improve performance:

  1. Add retrieval-augmented generation to fetch relevant design cases

  2. Incorporate feedback loops from real simulators (e.g., MATLAB, SolidWorks)

  3. Train with engineering-specific corpora, not just internet data

  4. Develop agent memory banks to recall design history across sessions

  5. Explore graph neural networks for DSG refinement

18. Implications for Engineering Education and Practice

Agentic LLMs offer potential benefits in:

  • Rapid ideation for student projects

  • Teaching functional decomposition

  • Creating documentation and test benches

  • Translating requirements into functional blocks

They could become teaching assistants or design collaborators in the near future.

19. The Future of Agentic Design Tools

We are entering a new era where LLMs:

  • Don’t just respond, but collaborate

  • Maintain task states and design intent

  • Work in teams of agents with different goals

  • Deliver structured designs ready for validation or prototyping

Such systems could democratize engineering design, enabling global participation and rapid innovation.

20. Conclusion

This comprehensive evaluation of agentic LLM systems in conceptual systems engineering design reveals both exciting capabilities and sobering limitations. Multi-agent orchestration improves design fidelity, while reasoning-distilled models like DeepSeek R1 70B are crucial for workflow reliability.

However, until requirement coverage and code functionality improve, these tools will remain augmentative, not autonomous. With targeted improvements, agentic LLMs could transform not just design workflows but how we define creativity and engineering itself.