SLAN: Self-Locator Aided Network for Vision-Language Understanding Task-Oriented Multi-Modal Mutual Leaning for Vision-Language Models TinyCLIP: CLIP Distillation via Affinity Mimicking and Weight Inheritance In-Style: Bridging Text and Uncurated Videos with Style Transfer for Text-Video Retrieval Preserving Modality Structure Improves Multi-Modal Learning Distribution-Aware Prompt Tuning for Vision-Language Models SupFusion: Supervised LiDAR-Camera Fusion for 3D Object Detection SupFusion: Supervised LiDAR-Camera Fusion for 3D Object DetectionFg-T2M: Fine-Grained Text-Driven Human Motion Generation via Diffusion ModelCross-Modal Orthogonal High-Rank Augmentation for RGB-Event Transformer-TrackerseP-ALM: Efficient Perceptual Augmentation of Language ModelsGenerating Visual Scenes from TouchMuscles in ActionMulti-Event Video-Text RetrievalReferring Image Segmentation using Text SupervisionAudio-Visual Deception Detection: DOLOS Dataset and Parameter-Efficient Crossmodal LearningCLIP2Point: Transfer CLIP to Point Cloud Classification with Image-Depth Pre-Training CLIP2Point: Transferring CLIP to Point Cloud Classification with Image-Depth Pre-TrainingSpeech2Lip: High-Fidelity Speech to Lip Generation by Learning from a Short VideoGrowCLIP: Data-Aware Automatic Model Growing for Large-Scale Contrastive Language-Image Pre-TrainingGrowCLIP: Data-Aware Automatic Model Growing for Large-Scale Contrastive Language-Image Pre-TrainingChartReader: A Unified Framework for Chart Derendering and Comprehension without Heuristic RulesChartReader: A Unified Framework for Chart Derendering and Comprehension without Heuristic RulesBoosting Multi-Modal Model Performance with Adaptive Gradient ModulationViLLA: Fine-Grained Vision-Language Representation Learning from Real-World DataViLLA: Learning Fine-Grained Vision-Language Representation from Real-World DataRobust Referring Video Object Segmentation with Cyclic Structural Consensus Robust Reference Video Object Segmentation with Cyclic Structural Consensus Fantasia3D: Disentangling Geometry and Appearance for High-Quality Text-to-3D Content Creation Fantasia3D: Disentangling Geometry and Appearance for High-Quality Text-to-3D Content Creation CTP: Towards Vision-Language Continual Pretraining via Compatible Momentum Contrast and Topology Preservation Narrator: Towards Natural Control of Human-Scene Interaction Generation via Relationship Reasoning
You Might Like
Recommended ContentMore
Open source project More
Popular Components
Searched by Users
Just Take a LookMore
Trending Downloads
Trending ArticlesMore