DeepQuery VisionXTrans Research Paper

events
Please enter first name.
Please enter last name.
Please enter email.
Please enter phone number.
Please enter job title.
Please enter company name.
Please select your country.
We present DeepQuery VisionXTrans, a unified Vision Transformer–based framework that integrates novel Custom Attention mechanisms and a Hybrid Learning paradigm to achieve state-of-the-art performance on image classification, object detection, and semantic segmentation tasks. VisionXTrans extends the standard ViT architecture by introducing a Custom Attention module that adaptively modulates spatial and contextual feature interactions via learnable attention masks and dynamic head weighting. To further enhance representation power across diverse vision tasks, we employ a Hybrid Learning strategy that jointly optimizes supervised classification, detection, and segmentation losses alongside a self-supervised reconstruction objective. The model’s backbone generates rich patch embeddings with positional encoding, which are fed into a stack of transformer encoder layers augmented by our Custom Attention blocks. Task-specific heads—comprising a classification head, a region proposal–free detection head, and a pixel-wise segmentation head—share the same fused features, enabling seamless multitask inference. We validate VisionXTrans on multiple benchmarks, demonstrating superior accuracy (top-1 classification accuracy of 85.7 %), mean average precision (mAP) of 47.3 % on object detection, and mean intersection-over-union (mIoU) of 78.2 % on segmentation, all with real-time inference speeds on a single GPU. Our ablation studies confirm the effectiveness of both the Custom Attention and Hybrid Learning components in improving generalization and robustness under domain shifts.