DeepQuery VisionXTrans Research Paper

We present DeepQuery VisionXTrans, a unified Vision Transformer–based framework that integrates novel Custom Attention mechanisms and a Hybrid Learning paradigm to achieve state-of-the-art performance on image classification, object detection, and semantic segmentation tasks. VisionXTrans extends the standard ViT architecture by introducing a Custom Attention module that adaptively modulates spatial and contextual feature interactions via learnable attention masks and dynamic head weighting. To further enhance representation power across diverse vision tasks, we employ a Hybrid Learning strategy that jointly optimizes supervised classification, detection, and segmentation losses alongside a self-supervised reconstruction objective. The model’s backbone generates rich patch embeddings with positional encoding, which are fed into a stack of transformer encoder layers augmented by our Custom Attention blocks. Task-specific heads—comprising a classification head, a region proposal–free detection head, and a pixel-wise segmentation head—share the same fused features, enabling seamless multitask inference. We validate VisionXTrans on multiple benchmarks, demonstrating superior accuracy (top-1 classification accuracy of 85.7 %), mean average precision (mAP) of 47.3 % on object detection, and mean intersection-over-union (mIoU) of 78.2 % on segmentation, all with real-time inference speeds on a single GPU. Our ablation studies confirm the effectiveness of both the Custom Attention and Hybrid Learning components in improving generalization and robustness under domain shifts.