ArchGPT: harnessing large language models for supporting renovation and conservation of traditional architectural heritage

Experiments settings
In our experiments, the LLM controllers we used include Alpaca-7b [34], Vicuna-7b [35], and GPT-3.5Footnote 7. To make the outputs of the LLM more stable, we followed the practice of HuggingGPT to set the decoding temperature to 0. For the VLM model with Image Captioning capability, we loaded the BLIP-base model [32] from Hugging Face to generate captions for images.
For CLS model, we fine-tuned the ViT-B/16 model from CLIP on a custom-made architectural classification datasetFootnote 8 of 1000 images, achieving a classification accuracy of 90.6% on the test set. In Fig. 5, we also provide a confusion matrix for the 5-category classification, showing excellent performance across most categories. However, the model’s precision on the Improved Buildings category is slightly lower (0.852), possibly due to significant overlap in features of improved buildings with other categories (such as renovated or preserved buildings), leading to some misclassifications. Overall, the CLS task-specific model is able to assist ArchGPT well in identifying the input architectural type, supplementing the LLMs with knowledge about architectural renovation.
Based on architectural renovation guidelines, we utilized GPT-4 to generate 2000 pairs of queries and I (\(I_t\) and \(I_c\) from the guideline introduced in ) to create a retrieval dataset. Specifically, we provided prompts to GPT-4 such as, “I’m giving you {document:I}, please generate a series of questions users might ask, following the format query:xx,” to produce 2000 data pairs. I serves as the ground truth for the query, used to evaluate the retrieval accuracy of our proposed retrieval algorithm.
As illustrated in Table 1, we compared three information retrieval methods: BM25, BERT, and BM25+BERT. The results show that BM25+BERT outperforms both BM25 and BERT across all key metrics. This indicates that the BM25+BERT method, which integrates the human prior knowledge of BM25 and the semantic understanding ability of BERT, can achieve the highest accuracy in information retrieval.

Confusion matrix for the architectural classification task, where “Pres.” stands for Preserved Buildings, “Reno.” represents Renovated Buildings, “Impr.” denotes Improved Buildings, “Reta.” signifies Retained Buildings, and “Tran.” corresponds to Transformed Buildings. The matrix showcases model performance in terms of Precision and Recall for each category
Qualitative results
Fig. 2 shows three demos for the pipeline of ArchGPT. In the first demo, the user’s request involves understanding the content of an image and is unrelated to the architectural category. Therefore, ArchGPT plans only to use the VLM to generate an image caption to supplement information on the appearance of the building. Based on the sufficient architectural information, the LLM then makes predictions about the building’s history and provides reasons. Finally, ArchGPT receives positive feedback from the user, completing the Normal Dialogue workflow.
The process of the second demo is similar to the first, except that this time ArchGPT parses the task as Building Repair Guidance. Therefore, it needs to retrieve the corresponding repair guidelines from the architectural guideline document to enhance the LLM’s knowledge, eventually providing a repair suggestion that meets the architectural guidelines and fulfills the user’s request. However, the response provided by the LLM is too generic, and after integrating user feedback, ArchGPT will execute the entire Building Repair Guidance workflow again.

The dialogue process of ArchGPT in practical usage scenarios
The third demo is more complex, requiring ArchGPT to call upon and coordinate the use of more external tools to complete the image editing task. Specifically, ArchGPT first parses the user’s request as a Repair Rendering Generation task, then uses the CLS and VLM to parse image information to supplement the image editing requirements, and finally inputs the supplemented editing requirements and the original image into the ControlNet model to obtain a reasonable edited image.
The above examples prove that ArchGPT indeed has the ability to leverage the intelligence of the LLMs to accurately parse user intentions and methodically call upon tools to solve real problems.
In Fig. 6, we present the dialogue flow of ArchGPT in real-use scenarios. The first flow depicts a case where ArchGPT fails to process a user’s question correctly. ArchGPT mistakenly parses the user’s intent as Repair Rendering Generation, subsequently calling the CLS and VLM models to supplement the “prompt”-“text”, and finally using ControlNet to obtain a photo of the repaired building. However, the user is not interested in updates to the house’s exterior but is more interested in receiving some suggestions for renovations inside the house. The second process demonstrates a correct and user satisfied Repair Rendering Generation process.
Quantitative evaluation
In ArchGPT, task parsing plays a crucial role throughout the workflow as it determines which tasks will be executed in the subsequent pipeline. Therefore, we consider the quality of task parsing as a measure of the LLM’s capability as a controller within ArchGPT. For this purpose, we conduct a quantitative evaluation based on the ability to perform task parsing and the strength of this ability, to assess ArchGPT’s task parsing capability.
Metric To quantify whether ArchGPT has the ability to complete task parsing, we track how many attempts it takes for ArchGPT to correctly parse the “task” field into the correct task type (Normal Dialogue, Building Repair Guidance, Repair Rendering Generation) after receiving a user request, without considering other fields. Additionally, to quantify how strong ArchGPT’s ability is to complete task parsing, as long as ArchGPT correctly parses the “task” field within 4 times (note that each new attempt will include the feedback from the user), we use GPT-4 as a critic to evaluate whether the task parsing dictionary is reasonable (following HuggingGPT).
When using GPT-4 to judge whether the task parsing dictionaries generated by ArchGPT are reasonable, the prompt given to GPT-4 is: “We will next provide examples of high-quality and low-quality task parsing dictionaries that interpret user requests. There are five examples of each type, presented in the format “High-quality examples: High-quality examples, Low-quality examples: Low-quality examples”. Please learn from these to develop your evaluation skills. Afterward, I will give you some new examples, and you only need to answer “High-quality” or “Low-quality”.
Dataset We created 100 requests for each of the three types of tasks, totaling 300 requests, which were created by 6 architecture students. These submissions were collected to create an evaluation dataset, with user annotated task type labels, used to evaluate whether ArchGPT can complete task parsing. In the evaluation of how strong ArchGPT’s ability is to complete task parsing, GPT-4 is used to judge whether the task parsing dictionaries generated by ArchGPT are of “High-quality” or “Low-quality”. Note that the task parsing dictionary to be evaluated is a dictionary that LLM has successfully and correctly parsed no more than three times.
Performance Our experimental evaluation covered various LLMs, including Alpaca-7b, Vicuna-7b, and the GPT-3.5 model. In Tables 2 and 3, Alpaca and Vicuna refer to Alpaca-7b and Vicuna-7b, respectively.
In Table 2, GPT-3.5 demonstrated superior performance across all three task types, especially in Normal Dialogue and Repair Rendering Generation tasks, where it significantly outperformed Alpaca-7b and Vicuna-7b in the number of correct parses. This result indicates that GPT-3.5 has higher accuracy and efficiency in understanding user requests and accurately classifying them into the corresponding tasks, reflecting its strong planning capabilities in complex scenarios. Table 3 shows that the high-quality task parsing dictionaries obtained from GPT-3.5 have a high proportion, particularly in repair rendering generation tasks, where the proportion of high-quality dictionaries reached 96/100. This further confirms that GPT-3.5 not only has a high accuracy in generating specific task parsing dictionaries but also ensures quality. In contrast, although Vicuna-7b performed better in repair rendering generation tasks than Alpaca-7b, their performance in other task types was similar and both were lower than GPT-3.5. These results not only prove the capability of GPT-3.5 as a controller in task parsing and execution but also suggest that improving the technology for complex task planning of LLMs is crucial for future research and development.
Human evaluation
In addition to objective evaluations, we also follow HuggingGPT to invite human experts to perform subjective assessments in our experiments. The significance of incorporating human evaluations lies in their ability to provide nuanced insights that go beyond the quantitative metrics typically used in objective evaluations. While objective metrics are essential for measuring performance, they often fail to capture the qualitative aspects of how well an AI system meets user needs in real-world scenarios. By involving human experts, particularly in a specialized field such as architectural heritage, we ensure that the evaluations consider practical, contextual, and experiential factors that are critical for the successful application of AI technologies.
In our experiments, from our custom set of 300 requests, we extracted 30 requests from each of the three types of tasks, providing a total of 90 requests to different LLMs. Three architectural heritage experts from Nanchang University evaluated the performance of ArchGPT in three stages (Task Parsing, Tool Utilization, and Answer). The tasks involved and the metrics assessed are described below. (the results were decided by a vote of these 3 experts, and at least two agreeing counts as a pass).
-
\(\circ\) Task parsing: We collect the number of correctly parsing task types for the first time and the corresponding number of reasonable dictionaries. The Correctness here refers to the number that LLM correctly classifies the task type for the first time. The Rationality here refers to the number of generating a reasonable task parsing dictionary, which is judged by humans.
-
\(\circ\) Tool utilization: During the Tool Utilization phase, we use correctly classified and reasonable task parsing dictionaries to guide LLM in calling different tools to complete the entire workflow. However, during the execution process of LLM, even if the task parsing dictionary is reasonable for the planning of the entire workflow, LLM may encounter problems in parameter transfer or tool invocation due to its limited instruction-following ability, resulting in the inability to obtain effective output from the tools or task-specific models, and thus the workflow cannot be completed. So we define the Completion as the number of correctly calling the tools and collecting intermediate outputs to complete the entire workflow. At the same time, we also counted the Numbers of tools and task-specific models used in complete workflows to evaluate the task completion efficiency of LLM.
-
\(\circ\) Answer: We use Success to represent how many of the 90 requests received responses (answers generated by ArchGPT) that satisfied user requirements.
From Table 4, we see that all LLMs can fulfill user requests in one go with a success rate of over fifty percent. If feedback is introduced for multi-turn dialogue, success rates are expected to further increase. We also observe that Alpaca and Vicuna exhibit similar levels of Correctness and Numbers, but Vicuna significantly leads in Rationality and Completion. Therefore, we believe that the reasonableness of LLMs in parsing dictionaries and the strength of their command-following capabilities are crucial for successful responses. Comparing these three LLMs, GPT-3.5 is notably superior to open-source LLMs such as Alpaca-7b and Vicuna-7b from the Task Parsing to Answer stage, with lesser dependency on tools. This aligns with previous objective assessments and underscores the necessity of a strong LLM as a controller within an AI agent. In summary, ArchGPT fully explores the potential of LLMs, demonstrating the feasibility of AI Agents in solving architectural tasks, with significant efficiency in Task Parsing and Tool Utilization.
We organize the evaluation data from human experts across different applications and LLMs to showcase the diverse applications and effectiveness of ArchGPT.
We organize the evaluation data from human experts across different applications and LLMs in Table 5 to showcase the diverse applications and effectiveness of ArchGPT. Comparing the performance of Normal Dialogue and Building Repair Guidance tasks, despite Building Repair Guidance requiring more specialized architectural knowledge, ArchGPT performs better in this task. We attribute this advantage to the retrieval module proposed in ArchGPT. To substantiate this, we conduct ablation experiments on the retrieval module within the Building Repair Guidance tasks. As indicated in Table 6, without the retrieval module, the performance of all models decreases significantly (from 21 to 12, 24 to 11, and 27 to 15). This underscores the importance of incorporating a retrieval-augmented generation mechanism in ArchGPT to enhance performance effectively.
link