Vector Space Reasoning for Vision-Only Pose and Trajectory Estimation in Dynamic Environments
Recent advances in vision-based autonomous systems have demonstrated that multi-camera setups can enable robust perception without relying on active sensors such as LiDAR or radar. For example, end-to-end systems like those employed in Tesla’s perception stack achieve object detection, tracking, and motion estimation by processing raw image data through geometric rectification, deep convolutional backbones (e.g., RegNet), feature pyramid networks (FPN), and transformer-based modules.
These developments suggest the feasibility of cost-effective, camera-only pipelines for pose and trajectory estimation in complex environments. Such systems could support a broad range of applications, including autonomous vehicles, robotics, augmented reality (AR), and human-computer interaction. However, several challenges persist: occlusions, calibration inconsistencies in multi-camera systems, rapid scene dynamics, and the integration of physical motion constraints into learned representations.
This thesis will investigate a vision-only, multi-camera approach to estimating object and human pose and trajectory, with a particular focus on reasoning over learned latent feature spaces and incorporating kinematic priors. The proposed architecture is inspired by recent developments such as BEVFormer and TriFormer. It will consist of:
(i) image rectification and alignment for multi-camera input,
(ii) a deep convolutional backbone (e.g., RegNet) for spatial feature extraction,
(iii) feature pyramid networks (FPN) for multi-scale representation, and
(iv) transformer-based modules for spatiotemporal fusion.
Vector Space Reasoning and Kinematic Priors
The core contribution of this work lies in the integration of kinematic priors into structured latent feature spaces—referred to as vector space reasoning. In this context, the vector space represents a structured latent domain (e.g., bird’s-eye view grids or rectified camera-aligned feature maps), over which transformer attention mechanisms operate to model temporal dynamics. Kinematic priors to be considered include assumptions of constant velocity or bounded acceleration, smoothness of trajectories, and non-holonomic constraints for vehicles. These priors will be incorporated through two complementary strategies:
(1) Loss-level integration, via regularization terms penalizing physically implausible motion estimates, and
(2) Architecture-level integration, by biasing attention masks or feature propagation in the temporal domain to reflect expected motion behavior.
This fusion of learned features with physically-informed constraints is expected to enhance robustness in the presence of occlusion, improve temporal coherence, and provide better generalization to unseen dynamics.
Model Outputs
The system will output temporally consistent 3D object bounding boxes and trajectory predictions. For vehicle motion (nuScenes/KITTI), outputs will include position, orientation, and motion vectors over time. In extended experiments involving human subjects (e.g., from indoor datasets), 2D/3D keypoint tracks may also be estimated for articulated pose tracking.
Possible research questions: - To what extent can a vision-only system reliably estimate human and object pose and trajectory in dynamic, cluttered environments using consumer-grade hardware?
- How can spatial and temporal fusion be optimized for multi-camera setups with overlapping fields of view to enhance motion understanding?
- What is the role of vector space reasoning and integrated kinematic constraints in improving accuracy and robustness?
- How well can the system generalize across different domains (e.g., indoor vs. outdoor environments, human vs. vehicle motion)?
- What are the most effective methods for injecting kinematic priors into neural architectures: loss-based regularization, architectural inductive biases, or input-level augmentation?
The project will compare transformer-based architectures that reason over spatial-temporal features in both bird’s-eye view and camera-space representations. The evaluation will examine how different modeling approaches—BEV-based fusion using geometric priors versus temporal attention-based learning—affect the system’s ability to estimate pose and motion from purely visual input. Key performance metrics will include 3D IoU, Average Displacement Error (ADE), Final Displacement Error (FDE), and trajectory continuity scores.
The project will be supervised by Dr. Yuliya Shapovalova from Radboud University.
Contact: yuliya.shapovalova@ru.nl
References 1. Li, Y., Zhang, Y., Wang, X., & Sun, J. (2023). BEVFormer: Bird’s-Eye View Spatial-Temporal Transformers for Multi-Camera 3D Object Detection. CVPR 2023, pp. 16856-16866.
2. Wu, Y., Fang, T., & Wang, Y. (2024). TriFormer: Transformer-based Multi-Camera 3D Detection via Adaptive Multi-View Fusion. ICCV 2024, pp. 1427-1436.
3. Caesar, H., Bankiti, V., Lang, A. H., Vora, S., Liong, V. E., Xu, Q., ... & Beijbom, O. (2020). nuScenes: A multimodal dataset for autonomous driving. CVPR 2020, pp. 11621-11631.
4. Geiger, A., Lenz, P., & Urtasun, R. (2012). Are we ready for Autonomous Driving? The KITTI Vision Benchmark Suite. CVPR 2012, pp. 3354–3361.