About the project
Objective
This project aims to develop and demonstrate AgentVision, a novel video–language model that combines video-based deep learning with multimodal large language model (mLLM) reasoning to enhance environmental perception for autonomous driving. Focusing on pedestrian crossing intention prediction, the project includes developing and training AgentVision using open datasets, deploying it on KTH’s Research Concept Vehicle (RCV-E) at ITRL, and testing it at the Arlanda track. Through integration of contextual reasoning and real-time perception, the project seeks to improve safety, robustness, and trust in autonomous vehicle systems, providing a blueprint for future intelligent and cooperative mobility solutions.
Background
Autonomous vehicles (AVs) depend on visual perception for situational awareness, yet existing systems face major challenges in complex real-world environments. Traditional vision models struggle with limited visibility, occlusions, and ambiguous interactions among multiple agents. Moreover, deep learning-based perception relies heavily on large, high-quality datasets, limiting robustness and generalization in diverse traffic conditions. These challenges hinder accurate understanding of pedestrian behaviors and other dynamic elements essential for safe navigation. Therefore, developing more adaptable and context-aware perception systems is crucial to enhance reliability, interpret complex scenes, and ensure safety in intelligent transportation and autonomous driving applications.
Crossdisciplinary collaboration
The project is an interdisciplinary collaboration among KTH’s ABE, SCI, EECS schools and a startup company FleetMQ, integrating transport science and causal AI expertise to advance intelligent mobility innovation.
PI: Zhenliang Ma
Co-PI: Mikael Nybacka

