您当前的位置：首页 > IT编程 > 学术与代码
\| C语言 \| Java \| VB \| VC \| python \| Android \| TensorFlow \| C++ \| oracle \| 学术与代码 \| cnn卷积神经网络 \| gnn \| 图像修复 \| Keras \| 数据集 \| Neo4j \| 自然语言处理 \| 深度学习 \| 医学CAD \| 医学影像 \| 超参数 \| pointnet \| pytorch \| 异常检测 \|

自学教程：计算机视觉与模式识别学术l论文最新成果

51自学网 2023-06-29 16:36:16

学术与代码

这篇教程计算机视觉与模式识别学术l论文最新成果写得很实用，希望能帮到您。

计算机视觉与模式识别学术

cs.CV 方向，今日共计71篇

Transformer(2篇)

【1】 DProST: 6-DoF Object Pose Estimation Using Space Carving and Dynamic Projective Spatial Transformer 标题：DProST：基于空间雕刻和动态投影空间变换的6-DOF目标位姿估计链接：https://arxiv.org/abs/2112.08775

作者：Jaewoo Park,Nam Ik Cho 摘要：预测物体的姿态是计算机视觉的核心任务。大多数基于深度学习的姿势估计方法要求CAD数据使用3D中间表示或投影2D外观。但是，当感兴趣对象的CAD数据不可用时，不能使用这些方法。此外，现有的方法并没有准确反映学习过程中的视角扭曲。此外，由于自遮挡导致的信息丢失还没有得到很好的研究。在这方面，我们提出了一种新的姿态估计系统，该系统由一个空间雕刻模块组成，该模块重建一个参考3D特征以替换CAD数据。此外，我们的新变换模块动态投影空间变换器（DProST）在考虑透视失真的同时变换参考3D特征以反映姿势。此外，我们还通过一种新的双向Z缓冲（BiZ-buffer）方法克服了自遮挡问题，该方法提取了对象的前视图和自遮挡后视图。最后，我们提出了一种透视网格距离损失（PGDL），可以在没有CAD数据的情况下稳定地学习姿态估计器。实验结果表明，我们的方法在LINEMOD数据集上的性能优于最新的方法，在LINEMOD-OCCLUSION数据集上的性能与网络训练中需要CAD数据的方法相当。摘要：Predicting the pose of an object is a core computer vision task. Most deep learning-based pose estimation methods require CAD data to use 3D intermediate representations or project 2D appearance. However, these methods cannot be used when CAD data for objects of interest are unavailable. Besides, the existing methods did not precisely reflect the perspective distortion to the learning process. In addition, information loss due to self-occlusion has not been studied well. In this regard, we propose a new pose estimation system consisting of a space carving module that reconstructs a reference 3D feature to replace the CAD data. Moreover, Our new transformation module, Dynamic Projective Spatial Transformer (DProST), transforms a reference 3D feature to reflect the pose while considering perspective distortion. Also, we overcome the self-occlusion problem by a new Bidirectional Z-buffering (BiZ-buffer) method, which extracts both the front view and the self-occluded back view of the object. Lastly, we suggest a Perspective Grid Distance Loss (PGDL), enabling stable learning of the pose estimator without CAD data. Experimental results show that our method outperforms the state-of-the-art method on the LINEMOD dataset and comparable performance on LINEMOD-OCCLUSION dataset even compared to the methods that require CAD data in network training.

【2】 TransZero++: Cross Attribute-Guided Transformer for Zero-Shot Learning 标题：TransZero++：用于零距离学习的交叉属性引导转换器链接：https://arxiv.org/abs/2112.08643

作者：Shiming Chen,Ziming Hong,Guo-Sen Xie,Jian Zhao,Xinge You,Shuicheng Yan,Ling Shao 备注：This is an extention of AAAI'22 paper (TransZero). Submitted to TPAMI. arXiv admin note: substantial text overlap with arXiv:2112.01683 摘要：Zero-Shot学习（Zero-shot learning，ZSL）通过将语义知识从可见类转移到不可见类来解决新的类识别问题。现有的基于注意的模型仅利用单向注意来学习单个图像中的次区域特征，忽略了视觉特征的可转移性和区分性属性定位。在本文中，我们提出了一种称为TransZero++的跨属性引导Transformer网络，用于细化视觉特征和学习ZSL中语义增强视觉嵌入表示的准确属性定位。TransZero++由属性$\rightarrow$可视转换器子网（AVT）和属性$\rightarrow$可视转换器子网（VAT）组成。具体地说，AVT首先采用特征增强编码器来缓解交叉数据集问题，并通过减少区域特征之间纠缠的相对几何关系来提高视觉特征的可转移性。然后，使用属性$\rightarrow$视觉解码器定位与给定图像中每个属性最相关的图像区域，以实现基于属性的视觉特征表示。类似地，VAT使用类似的特征增强编码器来细化视觉特征，这些特征进一步应用于visual$\rightarrow$属性解码器，以学习基于视觉的属性特征。通过进一步引入语义协作损失，两个属性引导的转换器通过语义协作学习相互学习语义增强的视觉嵌入。大量的实验表明，TransZero++在三个具有挑战性的ZSL基准上获得了最新的结果。代码可从以下网址获得：\url{https://github.com/shiming-chen/TransZero_pp}. 摘要：Zero-shot learning (ZSL) tackles the novel class recognition problem by transferring semantic knowledge from seen classes to unseen ones. Existing attention-based models have struggled to learn inferior region features in a single image by solely using unidirectional attention, which ignore the transferability and discriminative attribute localization of visual features. In this paper, we propose a cross attribute-guided Transformer network, termed TransZero++, to refine visual features and learn accurate attribute localization for semantic-augmented visual embedding representations in ZSL. TransZero++ consists of an attribute$\rightarrow$visual Transformer sub-net (AVT) and a visual$\rightarrow$attribute Transformer sub-net (VAT). Specifically, AVT first takes a feature augmentation encoder to alleviate the cross-dataset problem, and improves the transferability of visual features by reducing the entangled relative geometry relationships among region features. Then, an attribute$\rightarrow$visual decoder is employed to localize the image regions most relevant to each attribute in a given image for attribute-based visual feature representations. Analogously, VAT uses the similar feature augmentation encoder to refine the visual features, which are further applied in visual$\rightarrow$attribute decoder to learn visual-based attribute features. By further introducing semantical collaborative losses, the two attribute-guided transformers teach each other to learn semantic-augmented visual embeddings via semantical collaborative learning. Extensive experiments show that TransZero++ achieves the new state-of-the-art results on three challenging ZSL benchmarks. The codes are available at: \url{https://github.com/shiming-chen/TransZero_pp}.

检测相关(10篇)

【1】 The MVTec 3D-AD Dataset for Unsupervised 3D Anomaly Detection and Localization 标题：用于无监督三维异常检测和定位的MVTec 3D-AD数据集链接：https://arxiv.org/abs/2112.09045

作者：Paul Bergmann,Xin Jin,David Sattlegger,Carsten Steger 备注：Accepted for presentation at VISAPP 2022 摘要：我们介绍了第一个用于无监督异常检测和定位任务的综合3D数据集。它的灵感来源于现实世界中的视觉检查场景，在这种场景中，模型必须检测制造产品上的各种类型的缺陷，即使它只接受无异常数据的训练。有些缺陷表现为物体几何结构的异常。这些会导致数据的三维表示出现重大偏差。我们使用高分辨率工业3D传感器获取10种不同对象类别的深度扫描。对于所有对象类别，我们提供了一个训练和验证集，每个训练和验证集仅由无异常样本的扫描组成。相应的测试集包含显示各种缺陷的样品，如划痕、凹痕、孔洞、污染或变形。为每个异常测试样本提供精确的地面真值注释。我们的数据集上3D异常检测方法的初始基准表明有很大的改进空间。摘要：We introduce the first comprehensive 3D dataset for the task of unsupervised anomaly detection and localization. It is inspired by real-world visual inspection scenarios in which a model has to detect various types of defects on manufactured products, even if it is trained only on anomaly-free data. There are defects that manifest themselves as anomalies in the geometric structure of an object. These cause significant deviations in a 3D representation of the data. We employed a high-resolution industrial 3D sensor to acquire depth scans of 10 different object categories. For all object categories, we present a training and validation set, each of which solely consists of scans of anomaly-free samples. The corresponding test sets contain samples showing various defects such as scratches, dents, holes, contaminations, or deformations. Precise ground-truth annotations are provided for every anomalous test sample. An initial benchmark of 3D anomaly detection methods on our dataset indicates a considerable room for improvement.

【2】 MVSS-Net: Multi-View Multi-Scale Supervised Networks for Image Manipulation Detection 标题：MVSS-Net：用于图像篡改检测的多视点多尺度监督网络链接：https://arxiv.org/abs/2112.08935

作者：Chengbo Dong,Xinru Chen,Ruohan Hu,Juan Cao,Xirong Li 备注：arXiv admin note: substantial text overlap with arXiv:2104.06832 摘要：图像操纵检测的关键研究问题是如何学习对新数据操纵敏感的可概括特征，同时具体防止真实图像上的假警报。目前的研究强调敏感性，而忽视了特异性。在本文中，我们通过多视图特征学习和多尺度监控来解决这两个问题。前者通过利用篡改区域周围的噪声分布和边界伪影，旨在学习语义不可知的特征，从而获得更一般化的特征。后者允许我们从依赖于语义分割损失的现有技术所考虑的非平凡的真实图像中学习。我们的思想是通过一个新的网络来实现的，我们称之为MVSS-Net及其增强版MVSS-Net++。在六个公共基准数据集上的综合实验证明了MVSS网络系列在像素级和图像级操作检测方面的可行性。摘要：The key research question for image manipulation detection is how to learn generalizable features that are sensitive to manipulations in novel data, whilst specific to prevent false alarms on authentic images. Current research emphasizes the sensitivity, with the specificity mostly ignored. In this paper we address both aspects by multi-view feature learning and multi-scale supervision. By exploiting noise distribution and boundary artifacts surrounding tampered regions, the former aims to learn semantic-agnostic and thus more generalizable features. The latter allows us to learn from authentic images which are nontrivial to be taken into account by the prior art that relies on a semantic segmentation loss. Our thoughts are realized by a new network which we term MVSS-Net and its enhanced version MVSS-Net++. Comprehensive experiments on six public benchmark datasets justify the viability of the MVSS-Net series for both pixel-level and image-level manipulation detection.

【3】 Toward Minimal Misalignment at Minimal Cost in One-Stage and Anchor-Free Object Detection 标题：在一步无锚定目标检测中以最小代价实现最小未对准链接：https://arxiv.org/abs/2112.08902

作者：Shuaizheng Hao,Hongzhe Liu,Ningwei Wang,Cheng Xu 摘要：常见的目标检测模型由分类和回归两个分支组成，由于任务驱动因素的不同，这两个分支对同一尺度、同一空间位置的特征具有不同的敏感性。基于点的预测方法基于高分类置信点具有高回归质量的假设，导致了失准问题。我们的分析表明，该问题进一步具体包括尺度失调和空间失调。我们的目标是以最小的成本解决这一现象：对头部网络进行微调，并用一种新的标签分配方法取代刚性的标签分配方法。我们的实验表明，与基线FCOS（一种单阶段、无锚定的目标检测模型）相比，我们的模型在不同主干的情况下持续获得约3个AP改进，证明了我们方法的简单性和效率。摘要：Common object detection models consist of classification and regression branches, due to different task drivers, these two branches have different sensibility to the features from the same scale level and the same spatial location. The point-based prediction method, which is based on the assumption that the high classification confidence point has the high regression quality, leads to the misalignment problem. Our analysis shows, the problem is further composed of scale misalignment and spatial misalignment specifically. We aim to resolve the phenomenon at minimal cost: a minor adjustment of the head network and a new label assignment method replacing the rigid one. Our experiments show that, compared to the baseline FCOS, a one-stage and anchor-free object detection model, our model consistently get around 3 AP improvement with different backbones, demonstrating both simplicity and efficiency of our method.

【4】 Improved YOLOv5 network for real-time multi-scale traffic sign detection 标题：用于实时多尺度交通标志检测的改进YOLOv5网络链接：https://arxiv.org/abs/2112.08782

作者：Junfan Wang,Yi Chen,Mingyu Gao,Zhekang Dong 摘要：交通标志检测对于无人驾驶系统来说是一项具有挑战性的任务，尤其是对于多尺度目标的检测和检测的实时性问题。在交通标志检测过程中，目标的尺度变化很大，这会对检测精度产生一定的影响。特征金字塔被广泛用于解决这一问题，但它可能会破坏不同尺度交通标志的特征一致性。此外，在实际应用中，常规方法难以在保证实时检测的同时提高多尺度交通标志的检测精度。在本文中，我们提出了一种改进的特征金字塔模型AF-FPN，该模型利用自适应注意模块（AAM）和特征增强模块（FEM）来减少特征地图生成过程中的信息损失，增强特征金字塔的表示能力。我们用AF-FPN替换了YOLOv5中原有的特征金字塔网络，在保证实时检测的前提下，提高了YOLOv5网络对多尺度目标的检测性能。此外，还提出了一种新的自动学习数据扩充方法，以丰富数据集，提高模型的鲁棒性，使其更适合实际场景。在清华腾讯100K（TT100K）数据集上的大量实验结果表明，与几种最先进的方法相比，该方法具有有效性和优越性。摘要：Traffic sign detection is a challenging task for the unmanned driving system, especially for the detection of multi-scale targets and the real-time problem of detection. In the traffic sign detection process, the scale of the targets changes greatly, which will have a certain impact on the detection accuracy. Feature pyramid is widely used to solve this problem but it might break the feature consistency across different scales of traffic signs. Moreover, in practical application, it is difficult for common methods to improve the detection accuracy of multi-scale traffic signs while ensuring real-time detection. In this paper, we propose an improved feature pyramid model, named AF-FPN, which utilizes the adaptive attention module (AAM) and feature enhancement module (FEM) to reduce the information loss in the process of feature map generation and enhance the representation ability of the feature pyramid. We replaced the original feature pyramid network in YOLOv5 with AF-FPN, which improves the detection performance for multi-scale targets of the YOLOv5 network under the premise of ensuring real-time detection. Furthermore, a new automatic learning data augmentation method is proposed to enrich the dataset and improve the robustness of the model to make it more suitable for practical scenarios. Extensive experimental results on the Tsinghua-Tencent 100K (TT100K) dataset demonstrate the effectiveness and superiority of the proposed method when compared with several state-of-the-art methods.

【5】 Radio-Assisted Human Detection 标题：无线电辅助人体检测链接：https://arxiv.org/abs/2112.08743

作者：Chengrun Qiu,Dongheng Zhang,Yang Hu,Houqiang Li,Qibin Sun,Yan Chen 摘要：在本文中，我们提出了一个无线电辅助人体检测框架，将无线电信息纳入最先进的检测方法中，包括基于锚的单级检测器和两级检测器。我们从无线电信号中提取无线电定位和识别信息来辅助人体检测，从而大大缓解了误报和漏报问题。对于这两种检测器，我们使用基于无线电定位的置信度评分修正来提高检测性能。对于两阶段检测方法，我们建议利用无线电定位生成的区域建议，而不是依赖于区域建议网络（RPN）。此外，利用无线识别信息，提出了一种具有无线定位约束的非最大值抑制方法，进一步抑制误检测，减少漏检。在模拟的微软COCO数据集和加州理工学院行人数据集上的实验表明，借助无线电信息，先进检测方法的平均精度（mAP）和漏检率可以得到提高。最后，我们在真实场景中进行了实验，以验证我们提出的方法在实践中的可行性。摘要：In this paper, we propose a radio-assisted human detection framework by incorporating radio information into the state-of-the-art detection methods, including anchor-based onestage detectors and two-stage detectors. We extract the radio localization and identifer information from the radio signals to assist the human detection, due to which the problem of false positives and false negatives can be greatly alleviated. For both detectors, we use the confidence score revision based on the radio localization to improve the detection performance. For two-stage detection methods, we propose to utilize the region proposals generated from radio localization rather than relying on region proposal network (RPN). Moreover, with the radio identifier information, a non-max suppression method with the radio localization constraint has also been proposed to further suppress the false detections and reduce miss detections. Experiments on the simulative Microsoft COCO dataset and Caltech pedestrian datasets show that the mean average precision (mAP) and the miss rate of the state-of-the-art detection methods can be improved with the aid of radio information. Finally, we conduct experiments in real-world scenarios to demonstrate the feasibility of our proposed method in practice.

【6】 QAHOI: Query-Based Anchors for Human-Object Interaction Detection 标题：QAHOI：基于查询的人-物交互检测锚链接：https://arxiv.org/abs/2112.08647

作者：Junwen Chen,Keiji Yanai 摘要：人-物交互（HOI）检测作为目标检测任务的下游，需要定位人和对象对，并从图像中提取人和对象之间的语义关系。近年来，单阶段方法由于其高效性而成为这项任务的新趋势。然而，这些方法侧重于检测可能的交互点或过滤人类-对象对，忽略了不同对象在空间尺度上的位置和大小的变化。为了解决这个问题，我们提出了一种基于转换器的方法QAHOI（基于查询的人-对象交互检测锚），它利用多尺度体系结构从不同的空间尺度提取特征，并使用基于查询的锚来预测HOI实例的所有元素。我们进一步研究了一个强大的主干显著提高了QAHOI的准确性，并且基于Transformer主干的QAHOI在HICO-DET基准上比最新的先进方法有很大的优势。源代码位于$\href{https://github.com/cjw2021/QAHOI}{\text{此https URL}}$。摘要：Human-object interaction (HOI) detection as a downstream of object detection tasks requires localizing pairs of humans and objects and extracting the semantic relationships between humans and objects from an image. Recently, one-stage approaches have become a new trend for this task due to their high efficiency. However, these approaches focus on detecting possible interaction points or filtering human-object pairs, ignoring the variability in the location and size of different objects at spatial scales. To address this problem, we propose a transformer-based method, QAHOI (Query-Based Anchors for Human-Object Interaction detection), which leverages a multi-scale architecture to extract features from different spatial scales and uses query-based anchors to predict all the elements of an HOI instance. We further investigate that a powerful backbone significantly increases accuracy for QAHOI, and QAHOI with a transformer-based backbone outperforms recent state-of-the-art methods by large margins on the HICO-DET benchmark. The source code is available at $\href{https://github.com/cjw2021/QAHOI}{\text{this https URL}}$.

【7】 Frequency Spectrum Augmentation Consistency for Domain Adaptive Object Detection 标题：一种域自适应目标检测的频谱增强一致性算法链接：https://arxiv.org/abs/2112.08605

作者：Rui Liu,Yahong Han,Yaowei Wang,Qi Tian 摘要：领域自适应目标检测（DAOD）旨在提高训练和测试数据来自不同领域时检测器的泛化能力。考虑到显著的域差距，一些典型的方法，例如基于CycleGAN的方法，采用中间域逐步桥接源域和目标域。然而，基于CycleGAN的中间域缺少pix或实例级别的对象检测监控，这导致语义差异。为了解决这个问题，在本文中，我们引入了一个频谱增强一致性（FSAC）框架，其中包含四种不同的低频滤波器操作。这样，我们就可以得到一系列的增广数据作为中间域。具体来说，我们提出了一个两阶段优化框架。在第一阶段中，我们利用所有原始和扩充的源数据来训练目标检测器。在第二阶段，采用带有伪标签的增广源数据和目标数据进行预测一致性的自训练。并利用均值教师优化的教师模型进一步修正伪标签。在实验中，我们分别对单目标DAOD和复合目标DAOD进行了评估，证明了我们的方法的有效性。摘要：Domain adaptive object detection (DAOD) aims to improve the generalization ability of detectors when the training and test data are from different domains. Considering the significant domain gap, some typical methods, e.g., CycleGAN-based methods, adopt the intermediate domain to bridge the source and target domains progressively. However, the CycleGAN-based intermediate domain lacks the pix- or instance-level supervision for object detection, which leads to semantic differences. To address this problem, in this paper, we introduce a Frequency Spectrum Augmentation Consistency (FSAC) framework with four different low-frequency filter operations. In this way, we can obtain a series of augmented data as the intermediate domain. Concretely, we propose a two-stage optimization framework. In the first stage, we utilize all the original and augmented source data to train an object detector. In the second stage, augmented source and target data with pseudo labels are adopted to perform the self-training for prediction consistency. And a teacher model optimized using Mean Teacher is used to further revise the pseudo labels. In the experiment, we evaluate our method on the single- and compound- target DAOD separately, which demonstrate the effectiveness of our method.

【8】 FIgLib & SmokeyNet: Dataset and Deep Learning Model for Real-Time Wildland Fire Smoke Detection 标题：FIgLib&SmokeyNet：野外火灾烟雾实时探测的数据集和深度学习模型链接：https://arxiv.org/abs/2112.08598

作者：Anshuman Dewangan,Yash Pande,Hans-Werner Braun,Frank Vernon,Ismael Perez,Ilkay Atlintas,Gary Cottrell,Mai H. Nguyen 摘要：近几年来，美国西部野火的规模和频率急剧增加。在火灾风险高的日子里，小火点会迅速扩大并失去控制。早期检测初始烟雾引发的火灾有助于在火灾变得难以管理之前对此类火灾作出反应。过去用于野火烟雾探测的深度学习方法受到小数据集或不可靠数据集的影响，这使得很难将性能推断到真实场景中。在这项工作中，我们展示了Fire Ignition Library（FIgLib），这是一个公开可用的数据集，包含近25000张标记的野火烟雾图像，这些图像来自部署在南加州的固定视角摄像机。我们还介绍了SmokeyNet，这是一种新的深度学习体系结构，它利用摄像机图像中的时空信息进行实时野火烟雾探测。当在FIgLib数据集上进行训练时，SmokeyNet的表现优于可比基线，并与人类绩效相媲美。我们希望，FIgLib数据集和SmokeyNet体系结构的可用性将激发对野火烟雾探测深度学习方法的进一步研究，从而实现自动通知系统，缩短野火响应时间。摘要：The size and frequency of wildland fires in the western United States have dramatically increased in recent years. On high fire-risk days, a small fire ignition can rapidly grow and get out of control. Early detection of fire ignitions from initial smoke can assist the response to such fires before they become difficult to manage. Past deep learning approaches for wildfire smoke detection have suffered from small or unreliable datasets that make it difficult to extrapolate performance to real-world scenarios. In this work, we present the Fire Ignition Library (FIgLib), a publicly-available dataset of nearly 25,000 labeled wildfire smoke images as seen from fixed-view cameras deployed in Southern California. We also introduce SmokeyNet, a novel deep learning architecture using spatio-temporal information from camera imagery for real-time wildfire smoke detection. When trained on the FIgLib dataset, SmokeyNet outperforms comparable baselines and rivals human performance. We hope that the availability of the FIgLib dataset and the SmokeyNet architecture will inspire further research into deep learning methods for wildfire smoke detection, leading to automated notification systems that reduce the time to wildfire response.

【9】 Twitter-COMMs: Detecting Climate, COVID, and Military Multimodal Misinformation 标题：Twitter-Comms：检测气候、COVID和军事多模式错误信息链接：https://arxiv.org/abs/2112.08594

作者：Giscard Biamby,Grace Luo,Trevor Darrell,Anna Rohrbach 备注：11 pages, 6 figures 摘要：检测断章取义的媒体，比如推特上的“误选”图像，通常需要检测两种模式之间的不一致。本文描述了我们对DARPA语义取证（SemaFor）程序的图像-文本不一致性检测挑战的方法。首先，我们推特推2019冠状病毒疾病，COVID-19和军用车辆的主题，与一个大型的多模态数据集。我们训练我们的方法，基于最先进的剪辑模型，利用自动生成的随机和硬底片。然后在一个隐藏的人工生成的评估集上测试我们的方法。我们在节目排行榜上取得了最好的成绩，与Zero-Shot剪辑基线相比，在高精度区域的检测率提高了11%。摘要：Detecting out-of-context media, such as "miscaptioned" images on Twitter, often requires detecting inconsistencies between the two modalities. This paper describes our approach to the Image-Text Inconsistency Detection challenge of the DARPA Semantic Forensics (SemaFor) Program. First, we collect Twitter-COMMs, a large-scale multimodal dataset with 884k tweets relevant to the topics of Climate Change, COVID-19, and Military Vehicles. We train our approach, based on the state-of-the-art CLIP model, leveraging automatically generated random and hard negatives. Our method is then tested on a hidden human-generated evaluation set. We achieve the best result on the program leaderboard, with 11% detection improvement in a high precision regime over a zero-shot CLIP baseline.

【10】 Insta-VAX: A Multimodal Benchmark for Anti-Vaccine and Misinformation Posts Detection on Social Media 标题：Insta-VAX：社交媒体上抗疫苗和错误信息帖子检测的多模态基准链接：https://arxiv.org/abs/2112.08470

作者：Mingyang Zhou,Mahasweta Chakraborti,Sijia Qian,Zhou Yu,Jingwen Zhang 摘要：在社交媒体上分享抗疫苗帖子（包括错误信息帖子）已被证明会造成混乱，降低公众对疫苗的信心，导致疫苗犹豫不决和抗药性。近年来，在线网络中以各种语言和视觉形式出现的此类反疫苗帖子迅速崛起，对有效的内容控制和跟踪提出了巨大挑战。本文扩展了以前利用文本信息理解疫苗信息的工作，介绍了Insta VAX，这是一个新的多模式数据集，由64957篇Instagram帖子组成，这些帖子与人类疫苗有关。我们对该数据集应用了一个众包注释程序，该程序由两位训练有素的专家评委验证。然后，我们对几个最先进的NLP和计算机视觉分类器进行了台架标记，以检测这些帖子是否表现出抗疫苗态度，以及它们是否包含错误信息。大量的实验和分析表明，多模态模型比单模态模型能更准确地对文章进行分类，但在视觉语境理解和外部知识合作方面仍需改进。数据集和分类器有助于监测和跟踪疫苗讨论情况，以便社会科学和公共卫生努力解决疫苗误报问题。摘要：Sharing of anti-vaccine posts on social media, including misinformation posts, has been shown to create confusion and reduce the publics confidence in vaccines, leading to vaccine hesitancy and resistance. Recent years have witnessed the fast rise of such anti-vaccine posts in a variety of linguistic and visual forms in online networks, posing a great challenge for effective content moderation and tracking. Extending previous work on leveraging textual information to understand vaccine information, this paper presents Insta-VAX, a new multi-modal dataset consisting of a sample of 64,957 Instagram posts related to human vaccines. We applied a crowdsourced annotation procedure verified by two trained expert judges to this dataset. We then bench-marked several state-of-the-art NLP and computer vision classifiers to detect whether the posts show anti-vaccine attitude and whether they contain misinformation. Extensive experiments and analyses demonstrate the multimodal models can classify the posts more accurately than the uni-modal models, but still need improvement especially on visual context understanding and external knowledge cooperation. The dataset and classifiers contribute to monitoring and tracking of vaccine discussions for social scientific and public health efforts in combating the problem of vaccine misinformation.

分类|识别相关(10篇)

【1】 Decoupling and Recoupling Spatiotemporal Representation for RGB-D-based Motion Recognition 标题：基于RGB-D的运动识别时空解耦与解耦链接：https://arxiv.org/abs/2112.09129

作者：Benjia Zhou,Pichao Wang,Jun Wan,Yanyan Liang,Fan Wang,Du Zhang,Zhen Lei,Hao Li,Rong Jin 备注：open sourced; codes and models are available:this https URL; transformer-based method 摘要：解耦时空表示是指将时空特征分解为与维度无关的因子。尽管以前基于RGB-D的运动识别方法通过紧密耦合的多模态时空表示获得了令人满意的性能，但它们仍然存在以下问题：（1）由于紧密的时空纠缠建模，在小数据集下难以进行优化；（ii）信息冗余，因为它通常包含大量与分类弱相关的边缘信息；（iii）由于后期融合不足，多模态时空信息之间的交互作用较低。为了缓解这些缺点，我们提出了对基于RGB-D的运动识别的时空表示进行解耦和重耦。具体来说，我们将学习时空表示的任务分解为三个子任务：（1）通过时空解耦建模网络学习高质量和维度无关的特征。（2）重新耦合解耦表示以建立更强的时空依赖性。（3）引入跨模态自适应后验融合（CAPF）机制，从RGB-D数据中获取跨模态时空信息。这些新颖设计的无缝结合形成了强健的时空表示，并在四个公共运动数据集上实现了比最先进方法更好的性能。我们的代码可在https://github.com/damo-cv/MotionRGBD. 摘要：Decoupling spatiotemporal representation refers to decomposing the spatial and temporal features into dimension-independent factors. Although previous RGB-D-based motion recognition methods have achieved promising performance through the tightly coupled multi-modal spatiotemporal representation, they still suffer from (i) optimization difficulty under small data setting due to the tightly spatiotemporal-entangled modeling;(ii) information redundancy as it usually contains lots of marginal information that is weakly relevant to classification; and (iii) low interaction between multi-modal spatiotemporal information caused by insufficient late fusion. To alleviate these drawbacks, we propose to decouple and recouple spatiotemporal representation for RGB-D-based motion recognition. Specifically, we disentangle the task of learning spatiotemporal representation into 3 sub-tasks: (1) Learning high-quality and dimension independent features through a decoupled spatial and temporal modeling network. (2) Recoupling the decoupled representation to establish stronger space-time dependency. (3) Introducing a Cross-modal Adaptive Posterior Fusion (CAPF) mechanism to capture cross-modal spatiotemporal information from RGB-D data. Seamless combination of these novel designs forms a robust spatialtemporal representation and achieves better performance than state-of-the-art methods on four public motion datasets. Our code is available at https://github.com/damo-cv/MotionRGBD.

【2】 Progressive Graph Convolution Network for EEG Emotion Recognition 标题：渐进图卷积网络在脑电情感识别中的应用链接：https://arxiv.org/abs/2112.09069

作者：Yijin Zhou,Fu Li,Yang Li,Youshuo Ji,Guangming Shi,Wenming Zheng,Lijian Zhang,Yuanfang Chen,Rui Cheng 备注：11 pages, 5 figures 摘要：神经科学领域的研究揭示了情绪模式与大脑功能区域之间的关系，表明不同大脑区域之间的动态关系是影响通过脑电图（EEG）确定的情绪识别的关键因素。此外，在脑电情感识别中，我们可以观察到，基于相同的脑电数据，粗粒度情感之间比细粒度情感之间存在更清晰的边界；这表明大的粗粒度和小的细粒度情感变化同时存在。因此，从粗粒度到细粒度的渐进分类过程可能有助于脑电情感识别。因此，在本研究中，我们提出了一种渐进图卷积网络（PGCN），用于捕捉EEG情绪信号中的这一固有特征，并逐步学习区分性EEG特征。为了适应不同的脑电模式，我们构建了一个双图模块来描述不同脑电通道之间的内在关系，包含了神经科学研究中大脑区域的动态功能连接和静态空间接近信息。此外，出于对粗粒度和细粒度情绪之间关系的观察，我们采用了一个双头模块，该模块使PGCN能够逐步学习更多区分性EEG特征，从粗粒度（容易）到细粒度类别（困难），参考情绪的层次特征。为了验证我们模型的性能，在两个公共数据集：SEED-IV和多模态生理情绪数据库（MPED）上进行了大量实验。摘要：Studies in the area of neuroscience have revealed the relationship between emotional patterns and brain functional regions, demonstrating that dynamic relationships between different brain regions are an essential factor affecting emotion recognition determined through electroencephalography (EEG). Moreover, in EEG emotion recognition, we can observe that clearer boundaries exist between coarse-grained emotions than those between fine-grained emotions, based on the same EEG data; this indicates the concurrence of large coarse- and small fine-grained emotion variations. Thus, the progressive classification process from coarse- to fine-grained categories may be helpful for EEG emotion recognition. Consequently, in this study, we propose a progressive graph convolution network (PGCN) for capturing this inherent characteristic in EEG emotional signals and progressively learning the discriminative EEG features. To fit different EEG patterns, we constructed a dual-graph module to characterize the intrinsic relationship between different EEG channels, containing the dynamic functional connections and static spatial proximity information of brain regions from neuroscience research. Moreover, motivated by the observation of the relationship between coarse- and fine-grained emotions, we adopt a dual-head module that enables the PGCN to progressively learn more discriminative EEG features, from coarse-grained (easy) to fine-grained categories (difficult), referring to the hierarchical characteristic of emotion. To verify the performance of our model, extensive experiments were conducted on two public datasets: SEED-IV and multi-modal physiological emotion database (MPED).

【3】 A CNN based method for Sub-pixel Urban Land Cover Classification using Landsat-5 TM and Resourcesat-1 LISS-IV Imagery 标题：基于CNN的Landsat-5TM和Resourcesat-1 LISS-IV影像城市土地覆盖亚像素分类方法链接：https://arxiv.org/abs/2112.08841

作者：Krishna Kumar Perikamana,Krishnachandran Balakrishnan,Pratyush Tripathy 备注：29 pages, 14 figures (including appendix), 8 tables (including appendix) 摘要：城市土地覆盖的时间序列数据在分析城市增长模式、不透水表面和植被分布的变化及其对城市微气候的影响方面具有重要的应用价值。由于自由图像的时间序列较长，陆地卫星数据非常适合进行此类分析，但传统的每像素硬分类无法充分发挥陆地卫星数据的潜力。本文提出了一种利用Landsat-5tm和Resourcesat-1liss-IV传感器时间重叠的亚像素分类方法。我们训练了一个卷积神经网络，从30米的Landsat-5 TM数据预测部分土地覆盖图。参考土地覆盖率是根据2011年班加罗尔580万LISS-IV硬分类图像估算的。此外，我们使用2009年孟买的数据证明了该模型的通用性和优越性能，并将其与使用随机森林分类器获得的结果进行了比较。对于班加鲁（2011年）和孟买（2009年）的数据，我们的CNN模型在30m单元水平上的建筑物和植被比例预测的平均绝对百分比误差在7.2到11.3之间。与最近使用有限空间范围内的数据进行验证的研究不同，我们的模型已经使用两个特大城市在两个不同时间段内的完整空间范围的数据进行了训练和验证。因此，它可以从Landsat-5 TM时间序列数据可靠地生成30米的建筑和植被比例图，以分析长期城市增长模式。摘要：Time series data of urban land cover is of great utility in analyzing urban growth patterns, changes in distribution of impervious surface and vegetation and resulting impacts on urban micro climate. While Landsat data is ideal for such analysis due to the long time series of free imagery, traditional per-pixel hard classification fails to yield full potential of the Landsat data. This paper proposes a sub-pixel classification method that leverages the temporal overlap of Landsat-5 TM and Resourcesat-1 LISS-IV sensors. We train a convolutional neural network to predict fractional land cover maps from 30m Landsat-5 TM data. The reference land cover fractions are estimated from a hard-classified 5.8m LISS-IV image for Bengaluru from 2011. Further, we demonstrate the generalizability and superior performance of the proposed model using data for Mumbai from 2009 and comparing it to the results obtained using a Random Forest classifier. For both Bengaluru (2011) and Mumbai (2009) data, Mean Absolute Percentage Error of our CNN model is in the range of 7.2 to 11.3 for both built-up and vegetation fraction prediction at the 30m cell level. Unlike most recent studies where validation is conducted using data for a limited spatial extent, our model has been trained and validated using data for the complete spatial extent of two mega cities for two different time periods. Hence it can reliably generate 30m built-up and vegetation fraction maps from Landsat-5 TM time series data to analyze long term urban growth patterns.

【4】 Pure Noise to the Rescue of Insufficient Data: Improving Imbalanced Classification by Training on Random Noise Images 标题：纯噪声拯救数据不足：通过随机噪声图像训练改进不平衡分类链接：https://arxiv.org/abs/2112.08810

作者：Shiran Zada,Itay Benou,Michal Irani 摘要：尽管在视觉识别任务方面取得了显著的进展，但当训练数据稀少或高度不平衡时，深层神经网络仍然难以很好地概括，这使得它们极易受到现实世界示例的影响。在本文中，我们提出了一种令人惊讶的简单但高效的方法来缓解这一限制：使用纯噪声图像作为额外的训练数据。与通常使用加性噪声或对抗性噪声进行数据增强不同，我们提出了一种完全不同的观点，即直接在纯随机噪声图像上进行训练。我们提出了一种新的分布感知路由批量规范化层（DAR-BN），该层支持在同一网络中对纯噪声图像和自然图像进行训练。这会鼓励泛化并抑制过度拟合。我们提出的方法显著提高了非平衡分类性能，在大量长尾图像分类数据集（CIFAR-10-LT、CIFAR-100-LT、ImageNet LT、Places LT和CelebA-5）上获得了最先进的结果。此外，我们的方法非常简单，易于作为一种通用的新增强工具（在现有增强的基础上）使用，并且可以合并到任何训练方案中。它不需要任何专门的数据生成或训练程序，从而保持训练的快速和高效摘要：Despite remarkable progress on visual recognition tasks, deep neural-nets still struggle to generalize well when training data is scarce or highly imbalanced, rendering them extremely vulnerable to real-world examples. In this paper, we present a surprisingly simple yet highly effective method to mitigate this limitation: using pure noise images as additional training data. Unlike the common use of additive noise or adversarial noise for data augmentation, we propose an entirely different perspective by directly training on pure random noise images. We present a new Distribution-Aware Routing Batch Normalization layer (DAR-BN), which enables training on pure noise images in addition to natural images within the same network. This encourages generalization and suppresses overfitting. Our proposed method significantly improves imbalanced classification performance, obtaining state-of-the-art results on a large variety of long-tailed image classification datasets (CIFAR-10-LT, CIFAR-100-LT, ImageNet-LT, Places-LT, and CelebA-5). Furthermore, our method is extremely simple and easy to use as a general new augmentation tool (on top of existing augmentations), and can be incorporated in any training scheme. It does not require any specialized data generation or training procedures, thus keeping training fast and efficient

【5】 Feature Erasing and Diffusion Network for Occluded Person Re-Identification 标题：基于特征删除和扩散网络的遮挡人物再识别链接：https://arxiv.org/abs/2112.08740

作者：Zhikang Wang,Feng Zhu,Shixiang Tang,Rui Zhao,Lihuo He,Jiangning Song 备注：10 pages, 5 figures 摘要：遮挡人再识别（ReID）旨在将遮挡人图像与不同摄像机视角下的整体图像进行匹配。目标行人（TP）通常受到非行人遮挡（NPO）和非目标行人（NTP）的干扰。以前的方法主要关注于增强模型对NPO的鲁棒性，而忽略了NTP的特征污染。在本文中，我们提出了一种新的特征擦除和扩散网络（FED）来同时处理NPO和NTP。具体而言，我们提出的遮挡消除模块（OEM）消除了NPO特征，辅助NPO增强策略模拟整体行人图像上的NPO并生成精确的遮挡遮罩。随后，我们将行人表征与其他记忆特征进行扩散，以在特征空间中合成NTP特征，该特征空间由一种新的特征扩散模块（FDM）通过可学习的交叉注意机制实现。在原始设备制造商（OEM）遮挡评分的指导下，特征扩散过程主要在可见人体部位进行，从而保证了合成NTP特征的质量。通过在我们提出的FED网络中联合优化OEM和FDM，我们可以大大提高模型对TP的感知能力，并减轻NPO和NTP的影响。此外，所提出的FDM仅作为训练的辅助模块，在推理阶段将被丢弃，从而引入很少的推理计算开销。对封闭式和整体式个人里德基准测试的实验证明了美联储优于最新技术，美联储在封闭式里德测试中达到86.3%的排名1，超过其他人至少4.7%。摘要：Occluded person re-identification (ReID) aims at matching occluded person images to holistic ones across different camera views. Target Pedestrians (TP) are usually disturbed by Non-Pedestrian Occlusions (NPO) and NonTarget Pedestrians (NTP). Previous methods mainly focus on increasing model's robustness against NPO while ignoring feature contamination from NTP. In this paper, we propose a novel Feature Erasing and Diffusion Network (FED) to simultaneously handle NPO and NTP. Specifically, NPO features are eliminated by our proposed Occlusion Erasing Module (OEM), aided by the NPO augmentation strategy which simulates NPO on holistic pedestrian images and generates precise occlusion masks. Subsequently, we Subsequently, we diffuse the pedestrian representations with other memorized features to synthesize NTP characteristics in the feature space which is achieved by a novel Feature Diffusion Module (FDM) through a learnable cross attention mechanism. With the guidance of the occlusion scores from OEM, the feature diffusion process is mainly conducted on visible body parts, which guarantees the quality of the synthesized NTP characteristics. By jointly optimizing OEM and FDM in our proposed FED network, we can greatly improve the model's perception ability towards TP and alleviate the influence of NPO and NTP. Furthermore, the proposed FDM only works as an auxiliary module for training and will be discarded in the inference phase, thus introducing little inference computational overhead. Experiments on occluded and holistic person ReID benchmarks demonstrate the superiority of FED over state-of-the-arts, where FED achieves 86.3% Rank-1 accuracy on Occluded-REID, surpassing others by at least 4.7%.

【6】 META: Mimicking Embedding via oThers' Aggregation for Generalizable Person Re-identification 标题：Meta：通过其他人的聚集模仿嵌入以实现可泛化的人重新识别链接：https://arxiv.org/abs/2112.08684

作者：Boqiang Xu,Jian Liang,Lingxiao He,Zhenan Sun 摘要：域概括（DG）人员再识别（ReID）旨在在训练时不访问目标域数据的情况下跨未知域进行测试，这是一个现实但具有挑战性的问题。与为不同领域假设相同模型的方法不同，混合专家（MoE）利用多个领域特定的网络来利用领域之间的互补信息，获得了令人印象深刻的结果。然而，现有的基于MoE的DG-ReID方法随着源域数目的增加，模型尺寸越来越大，并且大多数方法忽略了域不变特性的利用。为了解决上述两个问题，本文提出了一种新的DG-ReID方法，称为通过他人聚合（META）模拟嵌入。为了避免较大的模型尺寸，META专家不为每个源域添加分支网络，而是共享除批处理规范化层之外的所有参数。除了多个专家之外，META还利用实例规范化（IN）并将其引入到全局分支中，以追求跨域的不变特性。同时，META通过标准化统计考虑未知目标样本和源域的相关性，并开发了一个聚合网络来自适应地集成多个专家来模拟未知目标域。得益于提出的一致性损失和一种幕式训练算法，我们可以期望元模拟嵌入一个真正看不见的目标域。大量的实验证明，META大大超过了最先进的DG ReID方法。摘要：Domain generalizable (DG) person re-identification (ReID) aims to test across unseen domains without access to the target domain data at training time, which is a realistic but challenging problem. In contrast to methods assuming an identical model for different domains, Mixture of Experts (MoE) exploits multiple domain-specific networks for leveraging complementary information between domains, obtaining impressive results. However, prior MoE-based DG ReID methods suffer from a large model size with the increase of the number of source domains, and most of them overlook the exploitation of domain-invariant characteristics. To handle the two issues above, this paper presents a new approach called Mimicking Embedding via oThers' Aggregation (META) for DG ReID. To avoid the large model size, experts in META do not add a branch network for each source domain but share all the parameters except for the batch normalization layers. Besides multiple experts, META leverages Instance Normalization (IN) and introduces it into a global branch to pursue invariant features across domains. Meanwhile, META considers the relevance of an unseen target sample and source domains via normalization statistics and develops an aggregation network to adaptively integrate multiple experts for mimicking unseen target domain. Benefiting from a proposed consistency loss and an episodic training algorithm, we can expect META to mimic embedding for a truly unseen target domain. Extensive experiments verify that META surpasses state-of-the-art DG ReID methods by a large margin.

【7】 Analysis and Evaluation of Kinect-based Action Recognition Algorithms 标题：基于Kinect的动作识别算法分析与评价链接：https://arxiv.org/abs/2112.08626

作者：Lei Wang 备注：Master's thesis, 22 pages 摘要：人体动作识别在不同的领域得到了广泛的应用，但仍然存在许多具有挑战性的问题，如不同的视点、遮挡、光照条件、人体大小和动作执行速度等。为了应对这些挑战，开发了Kinect深度传感器来记录实时深度序列，该序列对人类衣服的颜色和照明条件不敏感。文献中已经报道了许多识别人类行为的方法，如HON4D、HOPC、RBD和HDG，它们分别使用4D表面法线、点云、基于骨架的模型和深度梯度来从深度视频或骨架数据中捕获鉴别信息。在本研究项目中，将使用五个基准数据集对上述四种算法的性能进行分析和评估，这些数据集涵盖了噪声、视点变化、背景杂波和遮挡等具有挑战性的问题。我们还实现并改进了HDG算法，并使用UWA3D多视图活动数据集将其应用于交叉视图动作识别。此外，我们在HDG中使用不同的特征向量组合进行性能评估。实验结果表明，我们改进的HDG算法优于其他三种最先进的交叉视角动作识别算法。摘要：Human action recognition still exists many challenging problems such as different viewpoints, occlusion, lighting conditions, human body size and the speed of action execution, although it has been widely used in different areas. To tackle these challenges, the Kinect depth sensor has been developed to record real time depth sequences, which are insensitive to the color of human clothes and illumination conditions. Many methods on recognizing human action have been reported in the literature such as HON4D, HOPC, RBD and HDG, which use the 4D surface normals, pointclouds, skeleton-based model and depth gradients respectively to capture discriminative information from depth videos or skeleton data. In this research project, the performance of four aforementioned algorithms will be analyzed and evaluated using five benchmark datasets, which cover challenging issues such as noise, change of viewpoints, background clutters and occlusions. We also implemented and improved the HDG algorithm, and applied it in cross-view action recognition using the UWA3D Multiview Activity dataset. Moreover, we used different combinations of individual feature vectors in HDG for performance evaluation. The experimental results show that our improvement of HDG outperforms other three state-of-the-art algorithms for cross-view action recognition.

【8】 Rethinking Nearest Neighbors for Visual Classification 标题：视觉分类中最近邻问题的再思考链接：https://arxiv.org/abs/2112.08459

作者：Menglin Jia,Bor-Chun Chen,Zuxuan Wu,Claire Cardie,Serge Belongie,Ser-Nam Lim 摘要：神经网络分类器已成为当前“先训练后微调”视觉分类范式的事实选择。在本文中，我们研究了$k$-最近邻（k-NN）分类器，这是前深度学习时代的一种经典的无模型学习方法，作为对现代基于神经网络方法的补充。作为一种惰性学习方法，k-NN简单地将测试图像与训练集中的top-k邻域之间的距离进行聚合。我们采用k-NN，通过监督或自我监督的方法在两个步骤中生成预训练的视觉表示：（1）利用k-NN预测概率作为训练过程中容易或困难示例的指示。（2）将k-NN预测分布与增广分类器的分布进行线性插值。通过对大量分类任务的大量实验，我们的研究揭示了k-NN集成的通用性和灵活性，并提供了额外的见解：（1）k-NN获得了有竞争力的结果，有时甚至优于标准线性分类器。（2）结合k-NN对于参数分类器性能较差和/或数据量较低的任务尤其有利。我们希望这些发现将鼓励人们重新思考深度学习前的作用，这是计算机视觉中的经典方法。我们的代码可从以下网址获得：https://github.com/KMnP/nn-revisit. 摘要：Neural network classifiers have become the de-facto choice for current "pre-train then fine-tune" paradigms of visual classification. In this paper, we investigate $k$-Nearest-Neighbor (k-NN) classifiers, a classical model-free learning method from the pre-deep learning era, as an augmentation to modern neural network based approaches. As a lazy learning method, k-NN simply aggregates the distance between the test image and top-k neighbors in a training set. We adopt k-NN with pre-trained visual representations produced by either supervised or self-supervised methods in two steps: (1) Leverage k-NN predicted probabilities as indications for easy \vs~hard examples during training. (2) Linearly interpolate the k-NN predicted distribution with that of the augmented classifier. Via extensive experiments on a wide range of classification tasks, our study reveals the generality and flexibility of k-NN integration with additional insights: (1) k-NN achieves competitive results, sometimes even outperforming a standard linear classifier. (2) Incorporating k-NN is especially beneficial for tasks where parametric classifiers perform poorly and / or in low-data regimes. We hope these discoveries will encourage people to rethink the role of pre-deep learning, classical methods in computer vision. Our code is available at: https://github.com/KMnP/nn-revisit.

【9】 Classification of diffraction patterns using a convolutional neural network in single particle imaging experiments performed at X-ray free-electron lasers 标题：X射线自由电子激光器单粒子成像实验中用卷积神经网络分类衍射图链接：https://arxiv.org/abs/2112.09020

作者：Dameli Assalauova,Alexandr Ignatenko,Fabian Isensee,Sergey Bobkov,Darya Trofimova,Ivan A. Vartanyants 备注：Main text: 28 pages, 7 figures, Supporting Information: 12 pages, 6 figures 摘要：X射线自由电子激光器（XFELs）的单粒子成像（SPI）特别适合于确定粒子在其自然环境中的三维结构。为了成功地重建，必须从大量获得的衍射图案中分离出一次击中产生的衍射图案。我们建议将此任务描述为一个图像分类问题，并使用卷积神经网络（CNN）结构解决它。开发了两种CNN配置：一种最大化F1分数，另一种强调高回忆。我们还将CNN与期望最大化（EM）选择以及大小过滤相结合。我们观察到，与我们先前工作中使用的EM选择相比，CNN选择的功率谱密度函数的对比度较低。然而，我们基于CNN的选择的重建给出了类似的结果。将CNN引入SPI实验可以简化重建流程，使研究人员能够动态地对模式进行分类，从而使他们能够严格控制实验的持续时间。我们认为，在描述良好的SPI分析工作流中引入基于非标准人工智能（AI）的解决方案可能有利于SPI实验的未来发展。摘要：Single particle imaging (SPI) at X-ray free electron lasers (XFELs) is particularly well suited to determine the 3D structure of particles in their native environment. For a successful reconstruction, diffraction patterns originating from a single hit must be isolated from a large number of acquired patterns. We propose to formulate this task as an image classification problem and solve it using convolutional neural network (CNN) architectures. Two CNN configurations are developed: one that maximises the F1-score and one that emphasises high recall. We also combine the CNNs with expectation maximization (EM) selection as well as size filtering. We observed that our CNN selections have lower contrast in power spectral density functions relative to the EM selection, used in our previous work. However, the reconstruction of our CNN-based selections gives similar results. Introducing CNNs into SPI experiments allows streamlining the reconstruction pipeline, enables researchers to classify patterns on the fly, and, as a consequence, enables them to tightly control the duration of their experiments. We think that bringing non-standard artificial intelligence (AI) based solutions in a well-described SPI analysis workflow may be beneficial for the future development of the SPI experiments.

【10】 Classification Under Ambiguity: When Is Average-K Better Than Top-K? 标题：歧义下的分类：Average-K何时优于Top-K？链接：https://arxiv.org/abs/2112.08851

作者：Titouan Lorieul,Alexis Joly,Dennis Shasha 备注：53 pages, 21 figures 摘要：当可能有多个标签时，选择单个标签可能会导致精度低。一个常见的替代方法，称为top-$K$分类，是选择一些数字$K$（通常约为5），并返回得分最高的$K$标签。不幸的是，对于明确的情况，$K>1$太多，对于非常模糊的情况，$K\leq 5$（例如）可能太小。另一种明智的策略是使用自适应方法，其中返回的标签数量随计算模糊度的函数而变化，但必须在所有样本上平均到某个特定的$K$。我们表示该替代平均值-$K$分类。本文正式描述了当平均$K$分类比固定顶部$K$分类能够获得更低的错误率时的模糊度分布。此外，它为固定大小和自适应分类器提供了自然的估计过程，并证明了它们的一致性。最后，它报告了对真实世界图像数据集的实验，揭示了在实践中平均$K$分类比最高$K$分类的好处。总的来说，当模糊度被精确地知道时，平均值-$K$永远不会比最高值-$K$差，而且在我们的实验中，当它被估计时，这也成立。摘要：When many labels are possible, choosing a single one can lead to low precision. A common alternative, referred to as top-$K$ classification, is to choose some number $K$ (commonly around 5) and to return the $K$ labels with the highest scores. Unfortunately, for unambiguous cases, $K>1$ is too many and, for very ambiguous cases, $K \leq 5$ (for example) can be too small. An alternative sensible strategy is to use an adaptive approach in which the number of labels returned varies as a function of the computed ambiguity, but must average to some particular $K$ over all the samples. We denote this alternative average-$K$ classification. This paper formally characterizes the ambiguity profile when average-$K$ classification can achieve a lower error rate than a fixed top-$K$ classification. Moreover, it provides natural estimation procedures for both the fixed-size and the adaptive classifier and proves their consistency. Finally, it reports experiments on real-world image data sets revealing the benefit of average-$K$ classification over top-$K$ in practice. Overall, when the ambiguity is known precisely, average-$K$ is never worse than top-$K$, and, in our experiments, when it is estimated, this also holds.

分割|语义相关(8篇)

【1】 HODOR: High-level Object Descriptors for Object Re-segmentation in Video Learned from Static Images 标题：HOODOR：基于静电学习的视频对象再分割高级对象描述符链接：https://arxiv.org/abs/2112.09131

作者：Ali Athar,Jonathon Luiten,Alexander Hermans,Deva Ramanan,Bastian Leibe 摘要：现有最先进的视频对象分割（VOS）方法学习帧之间的低级别像素到像素的对应关系，以便在视频中传播对象遮罩。这需要大量密集注释的视频数据，注释成本很高，并且由于视频中的帧高度相关，因此在很大程度上是冗余的。有鉴于此，我们提出了HODOR：一种通过有效利用带注释的静态图像来理解对象外观和场景上下文来解决VOS的新方法。我们将图像帧中的对象实例和场景信息编码为健壮的高级描述符，然后使用这些描述符在不同帧中重新分割这些对象。因此，与未经视频注释训练的现有方法相比，HODOR在DAVIS和YouTube VOS基准上实现了最先进的性能。在没有任何架构修改的情况下，HODOR还可以利用循环一致性从单个带注释的视频帧周围的视频上下文中学习，而其他方法则依赖于密集的、时间一致的注释。摘要：Existing state-of-the-art methods for Video Object Segmentation (VOS) learn low-level pixel-to-pixel correspondences between frames to propagate object masks across video. This requires a large amount of densely annotated video data, which is costly to annotate, and largely redundant since frames within a video are highly correlated. In light of this, we propose HODOR: a novel method that tackles VOS by effectively leveraging annotated static images for understanding object appearance and scene context. We encode object instances and scene information from an image frame into robust high-level descriptors which can then be used to re-segment those objects in different frames. As a result, HODOR achieves state-of-the-art performance on the DAVIS and YouTube-VOS benchmarks compared to existing methods trained without video annotations. Without any architectural modification, HODOR can also learn from video context around single annotated video frames by utilizing cyclic consistency, whereas other methods rely on dense, temporally consistent annotations.

【2】 Neural Style Transfer and Unpaired Image-to-Image Translation to deal with the Domain Shift Problem on Spheroid Segmentation 标题：基于神经样式转换和不成对图像到图像转换的椭球体分割中的域漂移问题链接：https://arxiv.org/abs/2112.09043

作者：Manuel García-Domínguez,César Domínguez,Jónathan Heras,Eloy Mata,Vico Pascual 摘要：背景和目标。域转移是机器学习模型的一个推广问题，当训练集的数据分布与模型部署时遇到的数据分布不同时，就会出现这种问题。由于实验条件、设备和捕获设置的变化，这在生物医学图像分割中很常见。在这项工作中，我们通过研究肿瘤球体分割背景下的神经风格转换算法和未配对图像到图像的转换方法来应对这一挑战。方法。我们已经用4种深度学习分割模型说明了球体分割中的域转移问题，当使用训练分布后的图像进行测试时，这些模型的IoU超过97%，但当应用于在不同条件下捕获的图像时，其性能下降到84%。为了解决这个问题，我们探索了3种风格转换算法（NST、深度图像类比和STROTSS）和6种未配对图像到图像转换算法（CycleGAN、DualGAN、ForkGAN、GANILLA、CUT和FastCUT）。这些算法已集成到一个高级API中，该API有助于将它们应用到发生域转移问题的其他上下文中。后果通过使用样式转换和图像到图像的转换算法，我们将这4种分割模型应用于在不同条件下捕获的图像，大大提高了性能。特别是，有2种样式转换算法（NST和深度图像模拟）和1种未配对图像到图像转换算法（CycleGAN），可在0.24到76.07的范围内改进模型的IoU。因此，达到与使用模型获得的性能相似的性能将应用于遵循训练分布的图像。摘要：Background and objectives. Domain shift is a generalisation problem of machine learning models that occurs when the data distribution of the training set is different to the data distribution encountered by the model when it is deployed. This is common in the context of biomedical image segmentation due to the variance of experimental conditions, equipment, and capturing settings. In this work, we address this challenge by studying both neural style transfer algorithms and unpaired image-to-image translation methods in the context of the segmentation of tumour spheroids. Methods. We have illustrated the domain shift problem in the context of spheroid segmentation with 4 deep learning segmentation models that achieved an IoU over 97% when tested with images following the training distribution, but whose performance decreased up to an 84\% when applied to images captured under different conditions. In order to deal with this problem, we have explored 3 style transfer algorithms (NST, deep image analogy, and STROTSS), and 6 unpaired image-to-image translations algorithms (CycleGAN, DualGAN, ForkGAN, GANILLA, CUT, and FastCUT). These algorithms have been integrated into a high-level API that facilitates their application to other contexts where the domain-shift problem occurs. Results. We have considerably improved the performance of the 4 segmentation models when applied to images captured under different conditions by using both style transfer and image-to-image translation algorithms. In particular, there are 2 style transfer algorithms (NST and deep image analogy) and 1 unpaired image-to-image translations algorithm (CycleGAN) that improve the IoU of the models in a range from 0.24 to 76.07. Therefore, reaching a similar performance to the one obtained with the models are applied to images following the training distribution.

【3】 Activation Modulation and Recalibration Scheme for Weakly Supervised Semantic Segmentation 标题：弱监督语义分割的激活调制和重校准方案链接：https://arxiv.org/abs/2112.08996

作者：Jie Qin,Jie Wu,Xuefeng Xiao,Lujun Li,Xingang Wang 备注：Accepted by AAAI2022 摘要：图像级弱监督语义分割（WSSS）是一项基本但极具挑战性的计算机视觉任务，有助于场景理解和自动驾驶。现有的方法大多采用基于分类的类激活图（CAMs）作为初始的伪标签，这些伪标签往往集中在有区别的图像区域，并且缺乏用于分割任务的定制特征。为了缓解这一问题，我们提出了一种新的激活调制和再校准（AMR）方案，该方案利用聚光灯分支和补偿分支获得加权CAM，从而提供再校准监督和任务特定概念。具体地说，注意调制模块（AMM）用于从信道空间顺序的角度重新安排特征重要性的分布，这有助于显式地建模信道相关性和空间编码，以自适应地调制面向分段的激活响应。此外，我们还针对双分支引入了一种交叉伪监督机制，它可以被视为一种语义相似的正则化机制来相互细化两个分支。大量的实验表明，AMR在PASCAL VOC 2012数据集上建立了一种新的最先进的性能，不仅超过了当前使用图像监督级别训练的方法，还超过了一些依赖于更强监督的方法，如显著性标签。实验还表明，我们的方案是即插即用的，可以与其他方法结合以提高性能。摘要：Image-level weakly supervised semantic segmentation (WSSS) is a fundamental yet challenging computer vision task facilitating scene understanding and automatic driving. Most existing methods resort to classification-based Class Activation Maps (CAMs) to play as the initial pseudo labels, which tend to focus on the discriminative image regions and lack customized characteristics for the segmentation task. To alleviate this issue, we propose a novel activation modulation and recalibration (AMR) scheme, which leverages a spotlight branch and a compensation branch to obtain weighted CAMs that can provide recalibration supervision and task-specific concepts. Specifically, an attention modulation module (AMM) is employed to rearrange the distribution of feature importance from the channel-spatial sequential perspective, which helps to explicitly model channel-wise interdependencies and spatial encodings to adaptively modulate segmentation-oriented activation responses. Furthermore, we introduce a cross pseudo supervision for dual branches, which can be regarded as a semantic similar regularization to mutually refine two branches. Extensive experiments show that AMR establishes a new state-of-the-art performance on the PASCAL VOC 2012 dataset, surpassing not only current methods trained with the image-level of supervision but also some methods relying on stronger supervision, such as saliency label. Experiments also reveal that our scheme is plug-and-play and can be incorporated with other approaches to boost their performance.

【4】 Slot-VPS: Object-centric Representation Learning for Video Panoptic Segmentation 标题：Slot-VPS：视频全景分割中的以对象为中心的表示学习链接：https://arxiv.org/abs/2112.08949

作者：Yi Zhou,Hui Zhang,Hana Lee,Shuyang Sun,Pingjun Li,Yangguang Zhu,ByungIn Yoo,Xiaojuan Qi,Jae-Joon Han 摘要：视频全景分割（VPS）的目的是为每个像素分配一个类别标签，在所有帧中唯一地分割和识别所有对象实例。经典解决方案通常将VPS任务分解为多个子任务，并使用多个代理（例如框和遮罩、中心和偏移）来表示对象。然而，这种分而治之的策略需要在空间和时间域进行复杂的后处理，并且容易受到代理任务失败的影响。在本文中，受以对象为中心的学习的启发，我们学习紧凑而健壮的对象表示，我们提出了Slot-VPS，这是该任务的第一个端到端框架。我们对视频中的所有全景实体进行编码，包括前景实例和背景语义，并使用一种称为全景插槽的统一表示。所提出的视频全景检索器将相干时空对象的信息检索并编码到全景时隙中，使其能够以统一的方式定位、分割、区分和关联对象。最后，输出的全景窗口可以直接转换为视频中全景对象的类、掩码和对象ID。我们进行了广泛的消融研究，并在两个基准数据集Cityscapes VPS（\textit{val}和测试集）和VIPER（\textit{val}集）上证明了我们方法的有效性，分别达到了63.7、63.3和56.2 VPQ的最新性能。摘要：Video Panoptic Segmentation (VPS) aims at assigning a class label to each pixel, uniquely segmenting and identifying all object instances consistently across all frames. Classic solutions usually decompose the VPS task into several sub-tasks and utilize multiple surrogates (e.g. boxes and masks, centres and offsets) to represent objects. However, this divide-and-conquer strategy requires complex post-processing in both spatial and temporal domains and is vulnerable to failures from surrogate tasks. In this paper, inspired by object-centric learning which learns compact and robust object representations, we present Slot-VPS, the first end-to-end framework for this task. We encode all panoptic entities in a video, including both foreground instances and background semantics, with a unified representation called panoptic slots. The coherent spatio-temporal object's information is retrieved and encoded into the panoptic slots by the proposed Video Panoptic Retriever, enabling it to localize, segment, differentiate, and associate objects in a unified manner. Finally, the output panoptic slots can be directly converted into the class, mask, and object ID of panoptic objects in the video. We conduct extensive ablation studies and demonstrate the effectiveness of our approach on two benchmark datasets, Cityscapes-VPS (\textit{val} and test sets) and VIPER (\textit{val} set), achieving new state-of-the-art performance of 63.7, 63.3 and 56.2 VPQ, respectively.

【5】 Search for temporal cell segmentation robustness in phase-contrast microscopy videos 标题：相衬显微镜视频中时间细胞分割鲁棒性的研究链接：https://arxiv.org/abs/2112.08817

作者：Estibaliz Gómez-de-Mariscal,Hasini Jayatilaka,Özgün Çiçek,Thomas Brox,Denis Wirtz,Arrate Muñoz-Barrutia 摘要：研究细胞形态随时间的变化对于理解细胞迁移机制至关重要。在这项工作中，我们提出了一个基于深度学习的工作流程，以分割嵌入三维胶原基质中的癌细胞，并用相差显微镜成像。我们的方法使用转移学习和循环卷积长短时记忆单元来利用过去的时间信息，并提供一致的分割结果。最后，我们提出了一种研究癌细胞形态的几何表征方法。我们的方法在时间上提供了稳定的结果，并且对不同的权重初始化或训练数据采样具有鲁棒性。我们引入了一个新的用于二维细胞分割和跟踪的带注释数据集，以及一个开源实现来复制实验或使其适应新的图像处理问题。摘要：Studying cell morphology changes in time is critical to understanding cell migration mechanisms. In this work, we present a deep learning-based workflow to segment cancer cells embedded in 3D collagen matrices and imaged with phase-contrast microscopy. Our approach uses transfer learning and recurrent convolutional long-short term memory units to exploit the temporal information from the past and provide a consistent segmentation result. Lastly, we propose a geometrical-characterization approach to studying cancer cell morphology. Our approach provides stable results in time, and it is robust to the different weight initialization or training data sampling. We introduce a new annotated dataset for 2D cell segmentation and tracking, and an open-source implementation to replicate the experiments or adapt them to new image processing problems.

【6】 Dense Video Captioning Using Unsupervised Semantic Information 标题：基于无监督语义信息的密集视频字幕链接：https://arxiv.org/abs/2112.08455

作者：Valter Estevam,Rayson Laroca,Helio Pedrini,David Menotti 摘要：我们介绍了一种学习无监督语义视觉信息的方法，其前提是复杂事件（例如分钟）可以分解为简单事件（例如几秒钟），并且这些简单事件在多个复杂事件之间共享。我们将长视频分割成短帧序列，用三维卷积神经网络提取它们的潜在表示。聚类方法用于对产生可视码本的表示进行分组（即，长视频由聚类标签给出的整数序列表示）。通过编码码本条目的共现概率矩阵来学习稠密表示。我们将演示此表示如何在仅具有视觉功能的场景中利用密集视频字幕任务的性能。作为这种方法的结果，我们能够在双模变换（BMT）方法中替换音频信号，并产生具有可比性能的时序方案。此外，与只探索视觉特征的方法相比，我们采用普通变换方法将视觉信号与我们的描述符连接起来，以实现字幕显示方面的最先进性能，以及与多模态方法的竞争性能。我们的代码可在https://github.com/valterlej/dvcusi. 摘要：We introduce a method to learn unsupervised semantic visual information based on the premise that complex events (e.g., minutes) can be decomposed into simpler events (e.g., a few seconds), and that these simple events are shared across several complex events. We split a long video into short frame sequences to extract their latent representation with three-dimensional convolutional neural networks. A clustering method is used to group representations producing a visual codebook (i.e., a long video is represented by a sequence of integers given by the cluster labels). A dense representation is learned by encoding the co-occurrence probability matrix for the codebook entries. We demonstrate how this representation can leverage the performance of the dense video captioning task in a scenario with only visual features. As a result of this approach, we are able to replace the audio signal in the Bi-Modal Transformer (BMT) method and produce temporal proposals with comparable performance. Furthermore, we concatenate the visual signal with our descriptor in a vanilla transformer method to achieve state-of-the-art performance in captioning compared to the methods that explore only visual features, as well as a competitive performance with multi-modal methods. Our code is available at https://github.com/valterlej/dvcusi.

【7】 Quality monitoring of federated Covid-19 lesion segmentation 标题：联合冠状病毒病变分割的质量监测链接：https://arxiv.org/abs/2112.08974

作者：Camila Gonzalez,Christian Harder,Amin Ranem,Ricarda Fischbach,Isabel Kaltenborn,Armin Dadras,Andreas Bucher,Anirban Mukhopadhyay 摘要：联邦学习是训练健壮的深度学习模型以分割胸部CT中新冠病毒-19相关发现的最有希望的方法。通过分散式学习，可以利用各种来源和采集协议的异构数据，同时确保患者隐私。然而，持续监控模型的性能至关重要。然而，当涉及到弥漫性肺部病变的分割时，快速目视检查不足以评估质量，由专家放射科医生对所有网络输出进行彻底监测是不可行的。在这项工作中，我们提出了一系列轻量级指标，这些指标可以在每个医院本地计算，然后聚合起来用于联邦系统的集中监控。我们的线性模型在分布外数据集上检测到70%以上的低质量分段，因此可靠地表示模型性能下降。摘要：Federated Learning is the most promising way to train robust Deep Learning models for the segmentation of Covid-19-related findings in chest CTs. By learning in a decentralized fashion, heterogeneous data can be leveraged from a variety of sources and acquisition protocols whilst ensuring patient privacy. It is, however, crucial to continuously monitor the performance of the model. Yet when it comes to the segmentation of diffuse lung lesions, a quick visual inspection is not enough to assess the quality, and thorough monitoring of all network outputs by expert radiologists is not feasible. In this work, we present an array of lightweight metrics that can be calculated locally in each hospital and then aggregated for central monitoring of a federated system. Our linear model detects over 70% of low-quality segmentations on an out-of-distribution dataset and thus reliably signals a decline in model performance.

【8】 Automated segmentation of 3-D body composition on computed tomography 标题：基于CT的三维体成分自动分割链接：https://arxiv.org/abs/2112.08968

作者：Lucy Pu,Syed F. Ashraf,Naciye S Gezer,Iclal Ocak,Rajeev Dhupar 摘要：目的：开发并验证一种计算机工具，用于自动同时分割计算机断层扫描（CT）显示的以下组织的身体成分：内脏脂肪（VAT）、皮下脂肪（SAT）、肌间脂肪（IMAT）、骨骼肌（SM）和骨骼。方法：使用从肿瘤影像档案（TCIA）获得的100个CT扫描队列-50个全身正电子发射断层扫描（PET）-CT，25个胸部和25个腹部。手动注释五种不同的身体成分（VAT、SAT、IMAT、SM和骨骼）。为了提高效率，采用了边训练边注释的策略。UNet模型使用已注释的案例进行训练。然后，使用该模型对其余案例进行半自动注释。使用10倍交叉验证方法来开发和验证几种卷积神经网络（CNN）的性能，包括UNet、递归剩余UNet（R2Unet）和UNet++。在训练CNN模型时，使用了三维面片采样操作。对单独训练的CNN模型进行测试，看它们是否比联合分割它们能获得更好的性能。配对样本t检验用于检验统计显著性。结果：在三个CNN模型中，UNet在联合分割五种身体成分方面表现出最佳的整体性能，骰子系数为0.840+/-0.091、0.908+/-0.067、0.603+/-0.084、0.889+/-0.027和0.884+/-0.031，Jaccard指数为0.734+/-0.119、0.837+/-0.096、0.437+/-0.082、0.800+/-0.042、0.793+/-0.049，分别用于VAT、SAT、IMAT、SM和bone。结论：CNN模型在分割身体成分方面没有显著差异，但联合分割身体成分比单独分割效果更好。摘要：Purpose: To develop and validate a computer tool for automatic and simultaneous segmentation of body composition depicted on computed tomography (CT) scans for the following tissues: visceral adipose (VAT), subcutaneous adipose (SAT), intermuscular adipose (IMAT), skeletal muscle (SM), and bone. Approach: A cohort of 100 CT scans acquired from The Cancer Imaging Archive (TCIA) was used - 50 whole-body positron emission tomography (PET)-CTs, 25 chest, and 25 abdominal. Five different body compositions were manually annotated (VAT, SAT, IMAT, SM, and bone). A training-while-annotating strategy was used for efficiency. The UNet model was trained using the already annotated cases. Then, this model was used to enable semi-automatic annotation for the remaining cases. The 10-fold cross-validation method was used to develop and validate the performance of several convolutional neural networks (CNNs), including UNet, Recurrent Residual UNet (R2Unet), and UNet++. A 3-D patch sampling operation was used when training the CNN models. The separately trained CNN models were tested to see if they could achieve a better performance than segmenting them jointly. Paired-samples t-test was used to test for statistical significance. Results: Among the three CNN models, UNet demonstrated the best overall performance in jointly segmenting the five body compositions with a Dice coefficient of 0.840+/-0.091, 0.908+/-0.067, 0.603+/-0.084, 0.889+/-0.027, and 0.884+/-0.031, and a Jaccard index of 0.734+/-0.119, 0.837+/-0.096, 0.437+/-0.082, 0.800+/-0.042, 0.793+/-0.049, respectively for VAT, SAT, IMAT, SM, and bone. Conclusion: There were no significant differences among the CNN models in segmenting body composition, but jointly segmenting body compositions achieved a better performance than segmenting them separately.

Zero/Few Shot|迁移|域适配|自适应(2篇)

【1】 UMAD: Universal Model Adaptation under Domain and Category Shift 标题：UMAD：领域和类别转换下的通用模型适应链接：https://arxiv.org/abs/2112.08553

作者：Jian Liang,Dapeng Hu,Jiashi Feng,Ran He 摘要：学习拒绝目标域中的未知样本（源类中不存在）对于无监督域自适应（UDA）非常重要。存在两种典型的UDA场景，即开放集和开放部分集，后者假设并非所有源类都出现在目标域中。然而，大多数以前的方法都是为一个UDA场景设计的，并且在另一个UDA场景中的性能总是很差。此外，它们在适应过程中还需要标记的源数据，这限制了它们在数据隐私敏感应用程序中的可用性。为了解决这些问题，本文提出了一个通用模型适应（UMAD）框架，该框架既可以处理UDA场景，又不需要访问源数据，也不需要事先了解域之间的类别转换。具体来说，我们的目标是学习一个源模型和一个精心设计的双头分类器，并将其提供给目标域。在适应过程中，我们开发了一个信息一致性评分，以帮助区分未知样本和已知样本。为了在目标域实现双边自适应，我们进一步最大化局部互信息，使已知样本与源分类器对齐，并分别利用熵损失将未知样本推离源分类边界。在开放集和开放部分集UDA场景上的实验表明，UMAD作为一种不访问源数据的统一方法，其性能与最先进的依赖数据的方法相当，甚至更高。摘要：Learning to reject unknown samples (not present in the source classes) in the target domain is fairly important for unsupervised domain adaptation (UDA). There exist two typical UDA scenarios, i.e., open-set, and open-partial-set, and the latter assumes that not all source classes appear in the target domain. However, most prior methods are designed for one UDA scenario and always perform badly on the other UDA scenario. Moreover, they also require the labeled source data during adaptation, limiting their usability in data privacy-sensitive applications. To address these issues, this paper proposes a Universal Model ADaptation (UMAD) framework which handles both UDA scenarios without access to the source data nor prior knowledge about the category shift between domains. Specifically, we aim to learn a source model with an elegantly designed two-head classifier and provide it to the target domain. During adaptation, we develop an informative consistency score to help distinguish unknown samples from known samples. To achieve bilateral adaptation in the target domain, we further maximize localized mutual information to align known samples with the source classifier and employ an entropic loss to push unknown samples far away from the source classification boundary, respectively. Experiments on open-set and open-partial-set UDA scenarios demonstrate that UMAD, as a unified approach without access to source data, exhibits comparable, if not superior, performance to state-of-the-art data-dependent methods.

【2】 Adaptation and Attention for Neural Video Coding 标题：神经视频编码的适应性和注意事项链接：https://arxiv.org/abs/2112.08767

作者：Nannan Zou,Honglei Zhang,Francesco Cricri,Ramin G. Youvalari,Hamed R. Tavakoli,Jani Lainema,Emre Aksu,Miska Hannuksela,Esa Rahtu 摘要：神经图像编码代表了目前最先进的图像压缩方法。然而，在视频领域还有很多工作要做。在这项工作中，我们提出了一种端到端的学习视频编解码器，该编解码器围绕适应和注意力的概念引入了一些架构新颖性和训练新颖性。我们的编解码器组织为帧内编解码器和帧间编解码器。作为一种新颖的架构，我们建议训练帧间编解码器模型，以适应基于输入视频分辨率的运动估计过程。第二个架构新颖之处是一个新的神经块，它结合了基于分散注意力的神经网络和DenseNets的概念。最后，我们建议在推断时过度拟合一组解码器端乘法参数。通过烧蚀研究和与现有技术的比较，我们展示了我们提出的技术在编码增益方面的优势。我们比较我们的编解码器VCV/H.266和RLVC，这代表了最先进的传统和端到端学习的编解码器，并在2021 CLIC竞争，E2EZTYOL的最高执行端到端的学习方法。我们的编解码器明显优于E2E____________________________________。摘要：Neural image coding represents now the state-of-the-art image compression approach. However, a lot of work is still to be done in the video domain. In this work, we propose an end-to-end learned video codec that introduces several architectural novelties as well as training novelties, revolving around the concepts of adaptation and attention. Our codec is organized as an intra-frame codec paired with an inter-frame codec. As one architectural novelty, we propose to train the inter-frame codec model to adapt the motion estimation process based on the resolution of the input video. A second architectural novelty is a new neural block that combines concepts from split-attention based neural networks and from DenseNets. Finally, we propose to overfit a set of decoder-side multiplicative parameters at inference time. Through ablation studies and comparisons to prior art, we show the benefits of our proposed techniques in terms of coding gains. We compare our codec to VVC/H.266 and RLVC, which represent the state-of-the-art traditional and end-to-end learned codecs, respectively, and to the top performing end-to-end learned approach in 2021 CLIC competition, E2E_T_OL. Our codec clearly outperforms E2E_T_OL, and compare favorably to VVC and RLVC in some settings.

半弱无监督|主动学习|不确定性(7篇)

【1】 Masked Feature Prediction for Self-Supervised Visual Pre-Training 标题：用于自监督视觉预训练的掩蔽特征预测链接：https://arxiv.org/abs/2112.09133

作者：Chen Wei,Haoqi Fan,Saining Xie,Chao-Yuan Wu,Alan Yuille,Christoph Feichtenhofer 备注：Technical report 摘要：我们提出了用于视频模型自监督预训练的蒙面特征预测（MaskFeat）。我们的方法首先随机屏蔽部分输入序列，然后预测屏蔽区域的特征。我们研究了五种不同类型的特征，发现定向梯度直方图（HOG）是一种手工制作的特征描述符，在性能和效率方面都非常有效。我们观察到，HOG中的局部对比度归一化对于良好的结果至关重要，这与早期使用HOG进行视觉识别的工作是一致的。我们的方法可以学习丰富的可视化知识并驱动基于Transformer的大型模型。在不使用额外模型重量或监督的情况下，对未标记视频进行预训练的MaskFeat在Kinetics-400上的MViT-L、Kinetics-600上的88.3%、Kinetics-700上的80.4%、AVA上的38.8%和SSv2上的75.0%获得了前所未有的结果。MaskFeat进一步推广到图像输入，可以将其解释为单帧视频，并在ImageNet上获得具有竞争力的结果。摘要：We present Masked Feature Prediction (MaskFeat) for self-supervised pre-training of video models. Our approach first randomly masks out a portion of the input sequence and then predicts the feature of the masked regions. We study five different types of features and find Histograms of Oriented Gradients (HOG), a hand-crafted feature descriptor, works particularly well in terms of both performance and efficiency. We observe that the local contrast normalization in HOG is essential for good results, which is in line with earlier work using HOG for visual recognition. Our approach can learn abundant visual knowledge and drive large-scale Transformer-based models. Without using extra model weights or supervision, MaskFeat pre-trained on unlabeled videos achieves unprecedented results of 86.7% with MViT-L on Kinetics-400, 88.3% on Kinetics-600, 80.4% on Kinetics-700, 38.8 mAP on AVA, and 75.0% on SSv2. MaskFeat further generalizes to image input, which can be interpreted as a video with a single frame and obtains competitive results on ImageNet.

【2】 Contrastive Spatio-Temporal Pretext Learning for Self-supervised Video Representation 标题：用于自监督视频表示的时空对比借口学习链接：https://arxiv.org/abs/2112.08913

作者：Yujia Zhang,Lai-Man Po,Xuyuan Xu,Mengyang Liu,Yexin Wang,Weifeng Ou,Yuzhi Zhao,Wing-Yin Yu 备注：Accepted by AAAI 2022, Preprint version with Appendix 摘要：时空表示学习是视频自监督表示的关键。最近的方法主要使用对比学习和借口任务。然而，这些方法通过潜在空间中的特征相似性来区分样本实例，同时忽略学习表示的中间状态，从而限制了整体性能。在这项工作中，考虑到采样实例的相似程度作为中间状态，我们提出了一种新的借口任务-时空重叠率（STOR）预测。这源于观察到人类能够辨别视频在空间和时间上的重叠率。此任务鼓励模型区分两个生成样本的STOR以学习表示。此外，我们采用了一种将借口任务与对比学习相结合的联合优化方法来进一步增强时空表征学习。我们还研究了该方案中各组成部分之间的相互影响。大量实验表明，我们提出的STOR任务可以同时支持对比学习和借口任务。该联合优化方案可以显著改善视频理解中的时空表示。该守则可于https://github.com/Katou2/CSTP. 摘要：Spatio-temporal representation learning is critical for video self-supervised representation. Recent approaches mainly use contrastive learning and pretext tasks. However, these approaches learn representation by discriminating sampled instances via feature similarity in the latent space while ignoring the intermediate state of the learned representations, which limits the overall performance. In this work, taking into account the degree of similarity of sampled instances as the intermediate state, we propose a novel pretext task - spatio-temporal overlap rate (STOR) prediction. It stems from the observation that humans are capable of discriminating the overlap rates of videos in space and time. This task encourages the model to discriminate the STOR of two generated samples to learn the representations. Moreover, we employ a joint optimization combining pretext tasks with contrastive learning to further enhance the spatio-temporal representation learning. We also study the mutual influence of each component in the proposed scheme. Extensive experiments demonstrate that our proposed STOR task can favor both contrastive learning and pretext tasks. The joint optimization scheme can significantly improve the spatio-temporal representation in video understanding. The code is available at https://github.com/Katou2/CSTP.

【3】 Self-supervised Enhancement of Latent Discovery in GANs 标题：GANS中潜在发现的自我监督增强链接：https://arxiv.org/abs/2112.08835

作者：Silpa Vadakkeeveetil Sreelatha,Adarsh Kappiyath,S Sumitra 备注：Accepted to the 36th AAAI Conference on Artificial Intelligence (AAAI 2022) 摘要：已经提出了几种在预先训练的GANs的潜在空间中发现可解释方向的方法。与监督方法相比，无监督方法发现的潜在语义相对较少，因为它们不使用预先训练的属性分类器。我们提出了规模排名估计器（SRE），它是使用自我监督训练。SRE在现有无监督解纠缠技术获得的方向上增强解纠缠。这些方向被更新，以保持潜在空间中每个方向内的变化顺序。对发现方向的定性和定量评估表明，我们提出的方法显著改善了各种数据集中的解纠缠。我们还表明，学习的SRE可以用于执行基于属性的图像检索任务，而无需进一步训练。摘要：Several methods for discovering interpretable directions in the latent space of pre-trained GANs have been proposed. Latent semantics discovered by unsupervised methods are relatively less disentangled than supervised methods since they do not use pre-trained attribute classifiers. We propose Scale Ranking Estimator (SRE), which is trained using self-supervision. SRE enhances the disentanglement in directions obtained by existing unsupervised disentanglement techniques. These directions are updated to preserve the ordering of variation within each direction in latent space. Qualitative and quantitative evaluation of the discovered directions demonstrates that our proposed method significantly improves disentanglement in various datasets. We also show that the learned SRE can be used to perform Attribute-based image retrieval task without further training.

【4】 An Unsupervised Way to Understand Artifact Generating Internal Units in Generative Neural Networks 标题：一种无监督理解产生式神经网络中伪迹生成内部单元的方法链接：https://arxiv.org/abs/2112.08814

作者：Haedong Jeong,Jiyeon Han,Jaesik Choi 备注：AAAI22 accepted paper 摘要：尽管生成性对抗网络（GAN）的图像生成性能有了显著改善，但仍观察到低视觉保真度的生成。由于广泛使用的GAN指标更多地关注模型的整体性能，因此对单个世代的质量评估或缺陷世代的检测具有挑战性。虽然最近的研究试图检测导致伪影的featuremap单元并评估单个样本，但这些方法需要额外的资源，如外部网络或大量训练数据来近似真实的数据流形。在这项工作中，我们提出了局部激活的概念，并设计了一个关于局部激活的度量来检测工件的生成，而无需额外的监督。我们的经验证明，我们的方法可以检测和纠正来自具有各种数据集的GAN的工件生成。最后，我们讨论了几何分析，以部分揭示所提出的概念和低视觉保真度之间的关系。摘要：Despite significant improvements on the image generation performance of Generative Adversarial Networks (GANs), generations with low visual fidelity still have been observed. As widely used metrics for GANs focus more on the overall performance of the model, evaluation on the quality of individual generations or detection of defective generations is challenging. While recent studies try to detect featuremap units that cause artifacts and evaluate individual samples, these approaches require additional resources such as external networks or a number of training data to approximate the real data manifold. In this work, we propose the concept of local activation, and devise a metric on the local activation to detect artifact generations without additional supervision. We empirically verify that our approach can detect and correct artifact generations from GANs with various datasets. Finally, we discuss a geometrical analysis to partially reveal the relation between the proposed concept and low visual fidelity.

【5】 Lacuna Reconstruction: Self-supervised Pre-training for Low-Resource Historical Document Transcription 标题：缺陷性重建：低资源历史文献抄写的自我监督预训练链接：https://arxiv.org/abs/2112.08692

作者：Nikolai Vogler,Jonathan Parkes Allen,Matthew Thomas Miller,Taylor Berg-Kirkpatrick 摘要：我们提出了一种自我监督的预训练方法，用于学习手写和印刷历史文档转录的丰富视觉语言表示。在监督下对我们预先训练的编码器表示进行微调，以实现两种语言的低资源文档转录后，（1）一组异构的手写伊斯兰手稿图像和（2）早期现代英语印刷文档，与从头开始训练的同一监督模型相比，我们显示了识别准确率的显著提高，仅需30行图像转录进行训练。我们的蒙面语言模型风格预训练策略，其中模型经过训练，能够从同一行中采样的干扰物中识别真实的蒙面视觉表示，鼓励学习对涂鸦书写风格和文档中存在的打印噪音保持不变的鲁棒语境化语言表示。摘要：We present a self-supervised pre-training approach for learning rich visual language representations for both handwritten and printed historical document transcription. After supervised fine-tuning of our pre-trained encoder representations for low-resource document transcription on two languages, (1) a heterogeneous set of handwritten Islamicate manuscript images and (2) early modern English printed documents, we show a meaningful improvement in recognition accuracy over the same supervised model trained from scratch with as few as 30 line image transcriptions for training. Our masked language model-style pre-training strategy, where the model is trained to be able to identify the true masked visual representation from distractors sampled from within the same line, encourages learning robust contextualized language representations invariant to scribal writing style and printing noise present across documents.

【6】 Performance or Trust? Why Not Both. Deep AUC Maximization with Self-Supervised Learning for COVID-19 Chest X-ray Classifications 标题：表现还是信任？为什么不能两个都去呢？基于自监督学习的深度AUC最大化在冠状病毒胸片分类中的应用链接：https://arxiv.org/abs/2112.08363

作者：Siyuan He,Pengcheng Xi,Ashkan Ebadi,Stephane Tremblay,Alexander Wong 备注：None 摘要：有效的表征学习是提高医学图像分析模型性能的关键。在训练深度学习模型时，通常必须在性能和信任之间进行折衷，这两者对于医疗应用都至关重要。此外，采用交叉熵损失优化的模型在多数阶级中往往会出现不必要的过度自信，而在少数阶级中则会出现过度谨慎。在这项工作中，我们集成了一个新的代理损失与自我监督学习的计算机辅助筛选COVID-19患者使用放射线图像。此外，我们采用了一个新的量化分数来衡量模型的可信度。对特征学习方法和损失函数的性能和信任度进行了研究。比较表明，在自监督模型上利用新的代理损失可以产生高性能和可信的标签有效网络。摘要：Effective representation learning is the key in improving model performance for medical image analysis. In training deep learning models, a compromise often must be made between performance and trust, both of which are essential for medical applications. Moreover, models optimized with cross-entropy loss tend to suffer from unwarranted overconfidence in the majority class and over-cautiousness in the minority class. In this work, we integrate a new surrogate loss with self-supervised learning for computer-aided screening of COVID-19 patients using radiography images. In addition, we adopt a new quantification score to measure a model's trustworthiness. Ablation study is conducted for both the performance and the trust on feature learning methods and loss functions. Comparisons show that leveraging the new surrogate loss on self-supervised models can produce label-efficient networks that are both high-performing and trustworthy.

【7】 Improving Unsupervised Stain-To-Stain Translation using Self-Supervision and Meta-Learning 标题：利用自我监督和元学习改进无监督染色翻译链接：https://arxiv.org/abs/2112.08837

作者：Nassim Bouteldja,Barbara Mara Klinkhammer,Tarek Schlaich,Peter Boor,Dorit Merhof 摘要：在数字病理学中，许多图像分析任务都面临着需要大量耗时的手动数据注释的挑战，以应对图像域中的各种变化源。基于图像到图像转换的无监督域自适应通过在不需要人工开销的情况下处理变量而在该领域获得了越来越重要的地位。在这里，我们通过无监督的染色到染色的转换来处理不同组织染色的变化，以实现深度学习分割模型的染色独立适用性。我们使用CycleGANs在肾脏组织病理学中进行染色-染色转换，并提出两种新的方法来提高转换效率。首先，我们将先验分割网络集成到CycleGAN中，以便通过语义指导进行自我监督、面向应用程序的翻译优化；其次，我们将额外的通道合并到翻译输出中，以隐式分离人工元信息，否则编码用于处理欠确定的重建。后者显示出部分优于未改性CycleGAN的性能，但前者在所有染色中表现最好，为大多数肾脏结构（如肾小球、小管和静脉）提供了78%到92%的实例级Dice分数。然而，CycleGANs在其他结构（如动脉）的翻译方面表现有限。我们的研究还发现，与原始染色中的分割相比，所有染色中所有结构的性能都有所降低。我们的研究表明，目前的无监督技术似乎不太可能产生普遍适用的假污渍。摘要：In digital pathology, many image analysis tasks are challenged by the need for large and time-consuming manual data annotations to cope with various sources of variability in the image domain. Unsupervised domain adaptation based on image-to-image translation is gaining importance in this field by addressing variabilities without the manual overhead. Here, we tackle the variation of different histological stains by unsupervised stain-to-stain translation to enable a stain-independent applicability of a deep learning segmentation model. We use CycleGANs for stain-to-stain translation in kidney histopathology, and propose two novel approaches to improve translational effectivity. First, we integrate a prior segmentation network into the CycleGAN for a self-supervised, application-oriented optimization of translation through semantic guidance, and second, we incorporate extra channels to the translation output to implicitly separate artificial meta-information otherwise encoded for tackling underdetermined reconstructions. The latter showed partially superior performances to the unmodified CycleGAN, but the former performed best in all stains providing instance-level Dice scores ranging between 78% and 92% for most kidney structures, such as glomeruli, tubules, and veins. However, CycleGANs showed only limited performance in the translation of other structures, e.g. arteries. Our study also found somewhat lower performance for all structures in all stains when compared to segmentation in the original stain. Our study suggests that with current unsupervised technologies, it seems unlikely to produce generally applicable fake stains.

时序|行为识别|姿态|视频|运动估计(2篇)

【1】 Stable Long-Term Recurrent Video Super-Resolution 标题：稳定的长期循环视频超分辨率链接：https://arxiv.org/abs/2112.08950

作者：Benjamin Naoto Chiche,Arnaud Woiselle,Joana Frontera-Pons,Jean-Luc Starck 备注：9 pages, 8 figures 摘要：与基于滑动窗口的模型相比，基于深度学习（DL）的视频超分辨率（VSR）中的递归模型具有更高的计算效率、时间感受野和时间一致性，因此得到了广泛的应用。然而，当对呈现低运动的长视频序列（即场景的某些部分几乎不移动）进行推断时，递归模型通过递归处理发散，产生高频伪影。据我们所知，没有任何关于VSR的研究指出这种不稳定性问题，这对于一些实际应用来说是至关重要的。视频监控是一个典型的例子，在这种情况下会出现这种伪影，因为相机和场景都会长时间保持静止。在这项工作中，我们暴露了现有的循环VSR网络在低运动的长序列上的不稳定性。我们在我们创建的一个新的长序列数据集准静态视频集上演示了它。最后，基于Lipschitz稳定性理论，我们提出了一种新的循环VSR网络框架，它既稳定又有竞争性。基于此框架，我们提出了一种新的递归VSR网络，即中间递归视频超分辨率（MRVSR）。我们通过经验证明了它在低运动的长序列上的竞争性能。摘要：Recurrent models have gained popularity in deep learning (DL) based video super-resolution (VSR), due to their increased computational efficiency, temporal receptive field and temporal consistency compared to sliding-window based models. However, when inferring on long video sequences presenting low motion (i.e. in which some parts of the scene barely move), recurrent models diverge through recurrent processing, generating high frequency artifacts. To the best of our knowledge, no study about VSR pointed out this instability problem, which can be critical for some real-world applications. Video surveillance is a typical example where such artifacts would occur, as both the camera and the scene stay static for a long time. In this work, we expose instabilities of existing recurrent VSR networks on long sequences with low motion. We demonstrate it on a new long sequence dataset Quasi-Static Video Set, that we have created. Finally, we introduce a new framework of recurrent VSR networks that is both stable and competitive, based on Lipschitz stability theory. We propose a new recurrent VSR network, coined Middle Recurrent Video Super-Resolution (MRVSR), based on this framework. We empirically show its competitive performance on long sequences with low motion.

【2】 Road-aware Monocular Structure from Motion and Homography Estimation 标题：基于运动和单调估计的道路感知单目结构链接：https://arxiv.org/abs/2112.08635

作者：Wei Sui,Teng Chen,Jiaxin Zhang,Jiao Lu,Qian Zhang 备注：10 pages 摘要：运动结构（SFM）和地平面单应估计对于自主驾驶和其他机器人应用至关重要。近年来，深部神经网络分别用于SFM和单应性估计取得了很大进展。然而，直接应用现有的地平面单应性估计方法可能会失败，因为道路通常是场景的一小部分。此外，深度SFM方法的性能仍低于传统方法。在本文中，我们提出了一种方法，可以学习以端到端的方式解决这两个问题，从而提高这两个方面的性能。建议的网络包括深度CNN、姿势CNN和地面CNN。深度CNN和姿态CNN分别估计密集深度图和自我运动，求解SFM，而姿态CNN和地面CNN再加上单应层，则解决了地面估计问题。通过加强SFM和单应性估计结果之间的一致性，可以使用光度损失和单应性损失对整个网络进行端到端的训练，除了现成的分段器提供的道路分割外，没有任何地面真实性。在KITTI benchmark上进行了全面的实验，与各种最先进的方法相比，结果令人满意。摘要：Structure from motion (SFM) and ground plane homography estimation are critical to autonomous driving and other robotics applications. Recently, much progress has been made in using deep neural networks for SFM and homography estimation respectively. However, directly applying existing methods for ground plane homography estimation may fail because the road is often a small part of the scene. Besides, the performances of deep SFM approaches are still inferior to traditional methods. In this paper, we propose a method that learns to solve both problems in an end-to-end manner, improving performance on both. The proposed networks consist of a Depth-CNN, a Pose-CNN and a Ground-CNN. The Depth-CNN and Pose-CNN estimate dense depth map and ego-motion respectively, solving SFM, while the Pose-CNN and Ground-CNN followed by a homography layer solve the ground plane estimation problem. By enforcing coherency between SFM and homography estimation results, the whole network can be trained end to end using photometric loss and homography loss without any groundtruth except the road segmentation provided by an off-the-shelf segmenter. Comprehensive experiments are conducted on KITTI benchmark to demonstrate promising results compared with various state-of-the-art approaches.

GAN|对抗|攻击|生成相关(6篇)

【1】 Ensembling Off-the-shelf Models for GAN Training 标题：用于GaN训练的现成模型集成链接：https://arxiv.org/abs/2112.09130

作者：Nupur Kumari,Richard Zhang,Eli Shechtman,Jun-Yan Zhu 备注：GitHub: this https URL Project webpage: this https URL 摘要：大规模训练的出现产生了大量强大的视觉识别模型。然而，生成模型，如GANs，传统上是以无监督的方式从头开始训练的。从大量预先训练的视觉模型中获得的集体“知识”能否被用来改进训练？如果是的话，那么有这么多的模型可供选择，应该选择哪一个，它们以什么方式最有效？我们发现，预训练的计算机视觉模型可以显著提高性能时，用于集成鉴别器。值得注意的是，所选模型的特定子集会极大地影响性能。我们提出了一种有效的选择机制，通过探测预训练模型嵌入中真实和虚假样本之间的线性可分性，选择最精确的模型，并逐步将其添加到鉴别器集合中。有趣的是，我们的方法可以在有限的数据和大规模的环境中改进GAN训练。仅给出10k训练样本，我们在LSUN Cat上的FID与在1.6M图像上训练的StyleGAN2相匹配。在完整的数据集上，我们的方法将LSUN的猫、教堂和马类别的FID提高了1.5倍到2倍。摘要：The advent of large-scale training has produced a cornucopia of powerful visual recognition models. However, generative models, such as GANs, have traditionally been trained from scratch in an unsupervised manner. Can the collective "knowledge" from a large bank of pretrained vision models be leveraged to improve GAN training? If so, with so many models to choose from, which one(s) should be selected, and in what manner are they most effective? We find that pretrained computer vision models can significantly improve performance when used in an ensemble of discriminators. Notably, the particular subset of selected models greatly affects performance. We propose an effective selection mechanism, by probing the linear separability between real and fake samples in pretrained model embeddings, choosing the most accurate model, and progressively adding it to the discriminator ensemble. Interestingly, our method can improve GAN training in both limited data and large-scale settings. Given only 10k training samples, our FID on LSUN Cat matches the StyleGAN2 trained on 1.6M images. On the full dataset, our method improves FID by 1.5x to 2x on cat, church, and horse categories of LSUN.

【2】 GRAM: Generative Radiance Manifolds for 3D-Aware Image Generation 标题：GRAM：用于3D感知图像生成的生成辐射度流形链接：https://arxiv.org/abs/2112.08867

作者：Yu Deng,Jiaolong Yang,Jianfeng Xiang,Xin Tong 摘要：3D感知图像生成建模旨在生成具有明确可控相机姿态的3D一致图像。最近的工作显示了在非结构化二维图像上训练神经辐射场（NeRF）发生器的良好效果，但仍然无法生成具有精细细节的高真实感图像。一个关键原因是，体积表示学习的高内存和计算成本极大地限制了训练期间用于辐射积分的点样本数量。采样不足不仅限制了生成器处理精细细节的表达能力，而且由于不稳定蒙特卡罗采样引起的噪声，还阻碍了有效的GAN训练。我们提出了一种新的方法，在二维流形上调节点采样和辐射场学习，在三维体积中体现为一组学习的隐式曲面。对于每个观察光线，我们计算光线曲面交点，并累积网络生成的它们的辐射度。通过训练和渲染这些辐射流形，我们的生成器可以生成高质量的图像，具有逼真的精细细节和强大的视觉3D一致性。摘要：3D-aware image generative modeling aims to generate 3D-consistent images with explicitly controllable camera poses. Recent works have shown promising results by training neural radiance field (NeRF) generators on unstructured 2D images, but still can not generate highly-realistic images with fine details. A critical reason is that the high memory and computation cost of volumetric representation learning greatly restricts the number of point samples for radiance integration during training. Deficient sampling not only limits the expressive power of the generator to handle fine details but also impedes effective GAN training due to the noise caused by unstable Monte Carlo sampling. We propose a novel approach that regulates point sampling and radiance field learning on 2D manifolds, embodied as a set of learned implicit surfaces in the 3D volume. For each viewing ray, we calculate ray-surface intersections and accumulate their radiance generated by the network. By training and rendering such radiance manifolds, our generator can produce high quality images with realistic fine details and strong visual 3D consistency.

【3】 Towards Robust Neural Image Compression: Adversarial Attack and Model Finetuning 标题：面向鲁棒神经图像压缩：对抗性攻击与模型优化链接：https://arxiv.org/abs/2112.08691

作者：Tong Chen,Zhan Ma 摘要：基于深度神经网络的图像压缩已经得到了广泛的研究。模型健壮性在很大程度上被忽略了，尽管它对于支持服务至关重要。我们通过向原始源图像注入少量噪声扰动来执行对抗性攻击，然后使用流行的学习图像压缩模型对这些对抗性示例进行编码。实验报告了对抗性示例重建过程中的严重失真，揭示了现有方法的普遍脆弱性，无论底层压缩模型（例如，网络架构、损耗函数、质量等级）和用于注入扰动的优化策略（例如，噪声阈值、信号距离测量）中使用的设置如何。随后，我们应用迭代对抗性微调来细化预训练模型。在每次迭代中，随机源图像和对抗性示例混合在一起，以更新底层模型。结果表明，通过显著提高压缩模型的鲁棒性，所提出的微调策略是有效的。总的来说，我们的方法简单、有效、可推广，对于开发健壮的图像压缩解决方案具有吸引力。所有材料均已在网站上公开https://njuvision.github.io/RobustNIC用于重复性研究。摘要：Deep neural network based image compression has been extensively studied. Model robustness is largely overlooked, though it is crucial to service enabling. We perform the adversarial attack by injecting a small amount of noise perturbation to original source images, and then encode these adversarial examples using prevailing learnt image compression models. Experiments report severe distortion in the reconstruction of adversarial examples, revealing the general vulnerability of existing methods, regardless of the settings used in underlying compression model (e.g., network architecture, loss function, quality scale) and optimization strategy used for injecting perturbation (e.g., noise threshold, signal distance measurement). Later, we apply the iterative adversarial finetuning to refine pretrained models. In each iteration, random source images and adversarial examples are mixed to update underlying model. Results show the effectiveness of the proposed finetuning strategy by substantially improving the compression model robustness. Overall, our methodology is simple, effective, and generalizable, making it attractive for developing robust learnt image compression solution. All materials have been made publicly accessible at https://njuvision.github.io/RobustNIC for reproducible research.

【4】 StyleMC: Multi-Channel Based Fast Text-Guided Image Generation and Manipulation 标题：StyleMC：基于多通道的文本引导图像快速生成与处理链接：https://arxiv.org/abs/2112.08493

作者：Umut Kocasari,Alara Dirik,Mert Tiftikci,Pinar Yanardag 备注：Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV 2022) 摘要：在GANs的潜在空间中发现有意义的方向以操纵语义属性通常需要大量的标记数据。最近的工作旨在通过利用对比语言图像预训练（CLIP）这一联合文本图像模型来克服这一局限性。虽然这些方法很有希望，但需要几个小时的预处理或训练才能实现所需的操作。在本文中，我们提出了StyleMC，一种快速有效的文本驱动图像生成和处理方法。StyleMC使用基于剪辑的丢失和身份丢失通过单个文本提示操作图像，而不会显著影响其他属性。与以前的工作不同，StyleMC只需要对每个文本提示符进行几秒钟的训练就可以找到稳定的全局方向，不需要进行提示工程，并且可以与任何预先训练过的StyleGAN2模型一起使用。我们展示了我们的方法的有效性，并将其与最先进的方法进行了比较。我们的代码可以在http://catlab-team.github.io/stylemc. 摘要：Discovering meaningful directions in the latent space of GANs to manipulate semantic attributes typically requires large amounts of labeled data. Recent work aims to overcome this limitation by leveraging the power of Contrastive Language-Image Pre-training (CLIP), a joint text-image model. While promising, these methods require several hours of preprocessing or training to achieve the desired manipulations. In this paper, we present StyleMC, a fast and efficient method for text-driven image generation and manipulation. StyleMC uses a CLIP-based loss and an identity loss to manipulate images via a single text prompt without significantly affecting other attributes. Unlike prior work, StyleMC requires only a few seconds of training per text prompt to find stable global directions, does not require prompt engineering and can be used with any pre-trained StyleGAN2 model. We demonstrate the effectiveness of our method and compare it to state-of-the-art methods. Our code can be found at http://catlab-team.github.io/stylemc.

【5】 Positional Encoding Augmented GAN for the Assessment of Wind Flow for Pedestrian Comfort in Urban Areas 标题：位置编码增强型GaN用于城市地区行人舒适性的风流评价链接：https://arxiv.org/abs/2112.08447

作者：Henrik Høiness,Kristoffer Gjerde,Luca Oggiano,Knut Erik Teigen Giljarhus,Massimiliano Ruocco 摘要：使用计算流体动力学（CFD）方法近似风场可能会非常耗时。创建用于交互式设计原型的工具，同时观察风流变化，需要更简单的模型来更快地模拟。深度学习中的数据驱动方法可能能够在很短的时间内给出类似的结果，而不是运行导致详细计算的数值近似。这项工作将使用CFD计算三维流场的问题重新表述为基于建筑物足迹的二维图像到图像转换问题，以预测行人高度水平的流场。我们研究了生成性对抗网络（GAN）的使用，如Pix2Pix[1]和CycleGAN[2]，它们代表了各个领域中图像到图像转换任务的最新技术，以及U-Net自动编码器[3]。模型可以以数据驱动的方式了解数据集的基本分布，我们认为这有助于模型从CFD中了解基本的雷诺平均Navier-Stokes（RANS）方程。我们在不同的有高度信息和没有高度信息的三维断崖形建筑物上进行了新的模拟数据集实验。此外，我们对一系列模型的生成图像进行了广泛的定性和定量评估，并将其性能与CFD提供的模拟结果进行了比较。然后，我们展示了向输入中添加位置数据可以通过在不同的体系结构上注入此类信息来产生更准确的结果。此外，我们还表明，通过应用注意机制和频谱归一化来促进稳定的训练，模型的性能得到了提高。摘要：Approximating wind flows using computational fluid dynamics (CFD) methods can be time-consuming. Creating a tool for interactively designing prototypes while observing the wind flow change requires simpler models to simulate faster. Instead of running numerical approximations resulting in detailed calculations, data-driven methods in deep learning might be able to give similar results in a fraction of the time. This work rephrases the problem from computing 3D flow fields using CFD to a 2D image-to-image translation-based problem on the building footprints to predict the flow field at pedestrian height level. We investigate the use of generative adversarial networks (GAN), such as Pix2Pix [1] and CycleGAN [2] representing state-of-the-art for image-to-image translation task in various domains as well as U-Net autoencoder [3]. The models can learn the underlying distribution of a dataset in a data-driven manner, which we argue can help the model learn the underlying Reynolds-averaged Navier-Stokes (RANS) equations from CFD. We experiment on novel simulated datasets on various three-dimensional bluff-shaped buildings with and without height information. Moreover, we present an extensive qualitative and quantitative evaluation of the generated images for a selection of models and compare their performance with the simulations delivered by CFD. We then show that adding positional data to the input can produce more accurate results by proposing a general framework for injecting such information on the different architectures. Furthermore, we show that the models performances improve by applying attention mechanisms and spectral normalization to facilitate stable training.

【6】 Lifelong Generative Modelling Using Dynamic Expansion Graph Model 标题：基于动态扩展图模型的终身创成式建模链接：https://arxiv.org/abs/2112.08370

作者：Fei Ye,Adrian G. Bors 备注：Accepted in Proceedings of the 36th AAAI Conference on Artificial Intelligence (AAAI 2022) 摘要：变分自动编码器（VAE）在学习多个连续任务时，性能会退化。这是由灾难性的遗忘造成的。为了解决知识丢失问题，虚拟企业正在使用生成重放（GR）机制或扩展网络体系结构（ENA）。在本文中，我们使用GR和ENA联合方法，通过推导负边际对数似然上界来研究VAEs的遗忘行为。这一理论分析为VAEs如何在终身学习中忘记先前学到的知识提供了新的见解。分析表明，在ENA框架下，当考虑模型混合物时，在没有组件数量限制的情况下，达到了最佳性能。然而，基于ENA的方法可能需要过多的参数。这促使我们提出了一种新的动态扩展图模型（DEGM）。DEGM根据与每个新数据库相关的新颖性，与网络从以前的任务中已经学习到的信息相比较，扩展了其体系结构。DEGM训练优化了知识结构，描述了与过去和最近学习的任务相对应的联合概率表示。我们证明了DEGM保证了每个任务的最佳性能，同时也最小化了所需的参数数量。补充资料（SM）和源代码可在https://github.com/dtuzi123/Expansion-Graph-Model. 摘要：Variational Autoencoders (VAEs) suffer from degenerated performance, when learning several successive tasks. This is caused by catastrophic forgetting. In order to address the knowledge loss, VAEs are using either Generative Replay (GR) mechanisms or Expanding Network Architectures (ENA). In this paper we study the forgetting behaviour of VAEs using a joint GR and ENA methodology, by deriving an upper bound on the negative marginal log-likelihood. This theoretical analysis provides new insights into how VAEs forget the previously learnt knowledge during lifelong learning. The analysis indicates the best performance achieved when considering model mixtures, under the ENA framework, where there are no restrictions on the number of components. However, an ENA-based approach may require an excessive number of parameters. This motivates us to propose a novel Dynamic Expansion Graph Model (DEGM). DEGM expands its architecture, according to the novelty associated with each new databases, when compared to the information already learnt by the network from previous tasks. DEGM training optimizes knowledge structuring, characterizing the joint probabilistic representations corresponding to the past and more recently learned tasks. We demonstrate that DEGM guarantees optimal performance for each task while also minimizing the required number of parameters. Supplementary materials (SM) and source code are available in https://github.com/dtuzi123/Expansion-Graph-Model.

OCR|文本相关(2篇)

【1】 Connecting the Dots between Audio and Text without Parallel Data through Visual Knowledge Transfer 标题：通过视觉知识传授实现无平行数据的音文连点链接：https://arxiv.org/abs/2112.08995

作者：Yanpeng Zhao,Jack Hessel,Youngjae Yu,Ximing Lu,Rowan Zellers,Yejin Choi 备注：Our code is available at this https URL 摘要：能够表示和描述环境声景的机器具有实用潜力，例如，用于音频标签和字幕系统。流行的学习模式一直依赖于平行的音频文本数据，然而，这在网络上几乎不可用。我们提出了VIP-ANT，它在不使用任何并行音频文本数据的情况下诱导\textbf{A}udio-\textbf{T}ext对齐。我们的核心思想是在双模图像文本表示和双模图像音频表示之间共享图像模态；图像模态作为轴心，在三模态嵌入空间中隐式连接音频和文本。在没有成对音频文本数据的困难Zero-Shot设置中，我们的模型在ESC50和US8K音频分类任务上展示了最先进的Zero-Shot性能，甚至超过了Cloto字幕检索（带音频查询）的监督最先进水平2.2\%R@1.我们进一步调查了最小音频文本监督的案例，发现，例如，在US8K上，只有几百对受监督的音频文本对将Zero-Shot音频分类精度提高了8%。然而，为了在一些Zero-Shot任务上匹配人类平价，我们的经验缩放实验表明，我们需要大约$2^{21}\大约200万美元的监督音频字幕对。我们的工作为学习几乎没有平行音频文本数据的音频文本连接开辟了新的途径。摘要：Machines that can represent and describe environmental soundscapes have practical potential, e.g., for audio tagging and captioning systems. Prevailing learning paradigms have been relying on parallel audio-text data, which is, however, scarcely available on the web. We propose VIP-ANT that induces \textbf{A}udio-\textbf{T}ext alignment without using any parallel audio-text data. Our key idea is to share the image modality between bi-modal image-text representations and bi-modal image-audio representations; the image modality functions as a pivot and connects audio and text in a tri-modal embedding space implicitly. In a difficult zero-shot setting with no paired audio-text data, our model demonstrates state-of-the-art zero-shot performance on the ESC50 and US8K audio classification tasks, and even surpasses the supervised state of the art for Clotho caption retrieval (with audio queries) by 2.2\% R@1. We further investigate cases of minimal audio-text supervision, finding that, e.g., just a few hundred supervised audio-text pairs increase the zero-shot audio classification accuracy by 8\% on US8K. However, to match human parity on some zero-shot tasks, our empirical scaling experiments suggest that we would need about $2^{21} \approx 2M$ supervised audio-caption pairs. Our work opens up new avenues for learning audio-text connections with little to no parallel audio-text data.

【2】 SGEITL: Scene Graph Enhanced Image-Text Learning for Visual Commonsense Reasoning 标题：SGEITL：用于视觉常识推理的场景图增强图文学习链接：https://arxiv.org/abs/2112.08587

作者：Zhecan Wang,Haoxuan You,Liunian Harold Li,Alireza Zareian,Suji Park,Yiqing Liang,Kai-Wei Chang,Shih-Fu Chang 备注：None 摘要：回答关于图像的复杂问题是机器智能的一个雄心勃勃的目标，它需要对图像、文本和常识的共同理解，以及强大的推理能力。近年来，多模态变换器在视觉常识推理（VCR）方面取得了巨大进展，它通过跨模态注意层共同理解视觉对象和文本标记。然而，这些方法并没有利用场景的丰富结构和对象之间的交互作用，这对于回答复杂的常识性问题至关重要。我们提出了一个场景图增强图像文本学习（SGEITL）框架，将视觉场景图融入常识推理。为了利用场景图结构，在模型结构层次上，我们提出了一种多跳图变换器，用于正则化跳之间的注意交互。在预训练方面，提出了一种场景图感知的预训练方法，以利用从视觉场景图中提取的结构知识。此外，我们还介绍了一种在弱监督的情况下使用文本注释来训练和生成与领域相关的视觉场景图的方法。在VCR和其他任务上进行的大量实验表明，与最先进的方法相比，性能显著提高，并证明了每个拟议组件的有效性。摘要：Answering complex questions about images is an ambitious goal for machine intelligence, which requires a joint understanding of images, text, and commonsense knowledge, as well as a strong reasoning ability. Recently, multimodal Transformers have made great progress in the task of Visual Commonsense Reasoning (VCR), by jointly understanding visual objects and text tokens through layers of cross-modality attention. However, these approaches do not utilize the rich structure of the scene and the interactions between objects which are essential in answering complex commonsense questions. We propose a Scene Graph Enhanced Image-Text Learning (SGEITL) framework to incorporate visual scene graphs in commonsense reasoning. To exploit the scene graph structure, at the model structure level, we propose a multihop graph transformer for regularizing attention interaction among hops. As for pre-training, a scene-graph-aware pre-training method is proposed to leverage structure knowledge extracted in the visual scene graph. Moreover, we introduce a method to train and generate domain-relevant visual scene graphs using textual annotations in a weakly-supervised manner. Extensive experiments on VCR and other tasks show a significant performance boost compared with the state-of-the-art methods and prove the efficacy of each proposed component.

人脸|人群计数(2篇)

【1】 Human Hands as Probes for Interactive Object Understanding 标题：人的手作为交互式物体理解的探针链接：https://arxiv.org/abs/2112.09120

作者：Mohit Goyal,Sahil Modi,Rishabh Goyal,Saurabh Gupta 备注：Project website at this https URL 摘要：交互式对象理解，或者说我们可以对对象做什么，以及如何做，是计算机视觉的一个长期目标。在本文中，我们通过在以自我为中心的视频中观察人类的手来解决这个问题。我们证明了观察人的手与什么相互作用以及如何提供相关数据和必要的监督。注意手，容易定位和稳定活动对象进行学习，并揭示与对象发生交互的位置。通过分析手，我们可以了解我们可以对物体做什么，以及如何处理。我们将这些基本原则应用于EPIC-KITCHENS数据集，并通过观察以自我为中心的视频中的手，成功地学习了状态敏感特征和对象启示（交互区域和提供的抓握）。摘要：Interactive object understanding, or what we can do to objects and how is a long-standing goal of computer vision. In this paper, we tackle this problem through observation of human hands in in-the-wild egocentric videos. We demonstrate that observation of what human hands interact with and how can provide both the relevant data and the necessary supervision. Attending to hands, readily localizes and stabilizes active objects for learning and reveals places where interactions with objects occur. Analyzing the hands shows what we can do to objects and how. We apply these basic principles on the EPIC-KITCHENS dataset, and successfully learn state-sensitive features, and object affordances (regions of interaction and afforded grasps), purely by observing hands in egocentric videos.

【2】 Intelli-Paint: Towards Developing Human-like Painting Agents 标题：INTILI-PAINT：发展仿人涂饰剂链接：https://arxiv.org/abs/2112.08930

作者：Jaskirat Singh,Cameron Smith,Jose Echevarria,Liang Zheng 摘要：生成设计良好的艺术品通常非常耗时，并且假定人类画家具有高度的熟练程度。为了促进人类的绘画过程，已经在教机器如何“像人类一样绘画”方面进行了大量的研究，然后使用经过训练的代理作为人类用户的绘画辅助工具。然而，当前这方面的研究通常依赖于基于网格的渐进式分割策略，其中代理将整个图像分割为连续的更精细网格，然后并行绘制每个网格。这不可避免地导致人工绘画序列，人类用户不容易理解。为了解决这个问题，我们提出了一种新的绘画方法，它可以学习生成输出画布，同时展示更人性化的绘画风格。建议的绘制管道Intelli Paint由1）渐进分层策略组成，该策略允许代理首先绘制自然背景场景表示，然后以渐进方式添加每个前景对象。2）我们还介绍了一种新的顺序笔画引导策略，它可以帮助绘画代理以语义感知的方式在不同的图像区域之间转移注意力。3）最后，我们提出了一种笔画规则化策略，该策略允许所需笔画总数减少约60-80%，而生成画布的质量没有任何明显差异。通过定量和定性结果，我们表明，生成的代理不仅提高了输出画布生成的效率，而且展示了更自然的绘画风格，这将更好地帮助人类用户通过数字艺术品表达他们的想法。摘要：The generation of well-designed artwork is often quite time-consuming and assumes a high degree of proficiency on part of the human painter. In order to facilitate the human painting process, substantial research efforts have been made on teaching machines how to "paint like a human", and then using the trained agent as a painting assistant tool for human users. However, current research in this direction is often reliant on a progressive grid-based division strategy wherein the agent divides the overall image into successively finer grids, and then proceeds to paint each of them in parallel. This inevitably leads to artificial painting sequences which are not easily intelligible to human users. To address this, we propose a novel painting approach which learns to generate output canvases while exhibiting a more human-like painting style. The proposed painting pipeline Intelli-Paint consists of 1) a progressive layering strategy which allows the agent to first paint a natural background scene representation before adding in each of the foreground objects in a progressive fashion. 2) We also introduce a novel sequential brushstroke guidance strategy which helps the painting agent to shift its attention between different image regions in a semantic-aware manner. 3) Finally, we propose a brushstroke regularization strategy which allows for ~60-80% reduction in the total number of required brushstrokes without any perceivable differences in the quality of the generated canvases. Through both quantitative and qualitative results, we show that the resulting agents not only show enhanced efficiency in output canvas generation but also exhibit a more natural-looking painting style which would better assist human users express their ideas through digital artwork.

图像视频检索|Re-id相关(1篇)

【1】 Self-Distilled Hashing for Deep Image Retrieval 标题：用于深度图像检索的自提取散列算法链接：https://arxiv.org/abs/2112.08816

作者：Young Kyun Jang,Geonmo Gu,Byungsoo Ko,Nam Ik Cho 摘要：在基于散列的图像检索系统中，原始图像的转换输入通常会生成不同的代码，从而降低检索精度。为了缓解这个问题，可以在训练期间应用数据扩充。然而，即使一个内容的增强样本在真实空间中是相似的，量化也可以将它们分散到遥远的汉明空间中。这会导致表现差异，从而阻碍训练并降低绩效。在这项工作中，我们提出了一种新的自蒸馏散列方案，以最大限度地减少差异，同时利用增强数据的潜力。通过将弱转换样本的哈希知识转移到强转换样本，我们使哈希代码对各种转换不敏感。我们还引入了基于散列代理的相似性学习和基于二进制交叉熵的量化损失来提供高质量的散列码。最后，我们构建了一个深度散列框架来生成区分性散列码。大量的基准测试验证了我们的自蒸馏改进了现有的深度散列方法，并且我们的框架实现了最先进的检索结果。代码将很快发布。摘要：In hash-based image retrieval systems, the transformed input from the original usually generates different codes, deteriorating the retrieval accuracy. To mitigate this issue, data augmentation can be applied during training. However, even if the augmented samples of one content are similar in real space, the quantization can scatter them far away in Hamming space. This results in representation discrepancies that can impede training and degrade performance. In this work, we propose a novel self-distilled hashing scheme to minimize the discrepancy while exploiting the potential of augmented data. By transferring the hash knowledge of the weakly-transformed samples to the strong ones, we make the hash code insensitive to various transformations. We also introduce hash proxy-based similarity learning and binary cross entropy-based quantization loss to provide fine quality hash codes. Ultimately, we construct a deep hashing framework that generates discriminative hash codes. Extensive experiments on benchmarks verify that our self-distillation improves the existing deep hashing approaches, and our framework achieves state-of-the-art retrieval results. The code will be released soon.

蒸馏|知识提取(1篇)

【1】 Feature Distillation Interaction Weighting Network for Lightweight Image Super-Resolution 标题：用于轻量级图像超分辨率的特征提取交互加权网络链接：https://arxiv.org/abs/2112.08655

作者：Guangwei Gao,Wenjie Li,Juncheng Li,Fei Wu,Huimin Lu,Yi Yu 备注：9 pages, 9 figures, 4 tables 摘要：基于卷积神经网络的单幅图像超分辨率（SISR）技术近年来取得了很大的进展。然而，由于计算和内存成本的原因，很难将这些方法应用到实际场景中。同时，如何在有限的参数和计算约束下充分利用中间特性也是一个巨大的挑战。为了缓解这些问题，我们提出了一种轻量级但高效的特征提取交互加权网络（FDIWN）。具体地说，FDIWN利用一系列专门设计的特征洗牌加权群（FSWG）作为主干，几个新颖的相互宽剩余蒸馏相互作用块（WDIB）形成FSWG。此外，为了更好地进行特征提取，在WDIB中引入了宽相同剩余加权（WIRW）单元和宽卷积剩余加权（WCRW）单元。此外，还提出了一种宽残余蒸馏连接（WRDC）框架和一个自校准融合（SCF）单元，以更灵活、更有效地与不同尺度的特征交互。大量实验表明，我们的FDIWN在模型性能和效率之间取得了良好的平衡，优于其他模型。该守则可于https://github.com/IVIPLab/FDIWN. 摘要：Convolutional neural networks based single-image super-resolution (SISR) has made great progress in recent years. However, it is difficult to apply these methods to real-world scenarios due to the computational and memory cost. Meanwhile, how to take full advantage of the intermediate features under the constraints of limited parameters and calculations is also a huge challenge. To alleviate these issues, we propose a lightweight yet efficient Feature Distillation Interaction Weighted Network (FDIWN). Specifically, FDIWN utilizes a series of specially designed Feature Shuffle Weighted Groups (FSWG) as the backbone, and several novel mutual Wide-residual Distillation Interaction Blocks (WDIB) form an FSWG. In addition, Wide Identical Residual Weighting (WIRW) units and Wide Convolutional Residual Weighting (WCRW) units are introduced into WDIB for better feature distillation. Moreover, a Wide-Residual Distillation Connection (WRDC) framework and a Self-Calibration Fusion (SCF) unit are proposed to interact features with different scales more flexibly and efficiently.Extensive experiments show that our FDIWN is superior to other models to strike a good balance between model performance and efficiency. The code is available at https://github.com/IVIPLab/FDIWN.

点云|SLAM|雷达|激光|深度RGBD相关(1篇)

【1】 Multi-Camera LiDAR Inertial Extension to the Newer College Dataset 标题：多相机LiDAR惯性扩展到较新的学院数据集链接：https://arxiv.org/abs/2112.08854

作者：Lintong Zhang,Marco Camurri,Maurice Fallon 摘要：在本文中，我们提出了一个4.5km步行距离的多相机激光雷达惯性数据集，作为对较新的大学数据集的扩展。全局快门多摄像头设备与IMU和激光雷达硬件同步。该数据集还提供了六个自由度（DoF）地面真实姿态，激光雷达频率为10hz。描述了三个数据收集，并举例说明了多摄像机视觉惯性里程计的使用。该扩展数据集包含小型和狭窄通道、大型开放空间以及植被覆盖区域，用于测试定位和绘图系统。此外，一些序列呈现出挑战性的情况，例如突然的灯光变化、无纹理的表面和攻击性的运动。该数据集可从以下网址获得：https://ori-drs.github.io/newer-college-dataset 摘要：In this paper, we present a multi-camera LiDAR inertial dataset of 4.5km walking distance as an expansion to the Newer College Dataset. The global shutter multi-camera device is hardware synchronized with the IMU and the LiDAR. This dataset also provides six Degrees of Freedom (DoF) ground truth poses, at the LiDAR frequency of 10hz. Three data collections are described and example usage of multi-camera visual-inertial odometry is demonstrated. This expansion dataset contains small and narrow passages, large scale open spaces as well as vegetated areas to test localization and mapping systems. Furthermore, some sequences present challenging situations such as abrupt lighting change, textureless surfaces, and aggressive motion. The dataset is available at: https://ori-drs.github.io/newer-college-dataset

多模态(1篇)

【1】 CrossLoc: Scalable Aerial Localization Assisted by Multimodal Synthetic Data 标题：CrossLoc：多模态合成数据辅助的可伸缩空中定位链接：https://arxiv.org/abs/2112.09081

作者：Qi Yan,Jianhao Zheng,Simon Reding,Shanci Li,Iordan Doytchinov 备注：Preprint. Our code is available at this https URL 摘要：我们提出了一个视觉定位系统，学习估计摄像机姿态在现实世界中的帮助下，合成数据。尽管近年来取得了重大进展，但大多数基于学习的视觉定位方法都只针对单个领域，需要地理标记图像的密集数据库才能正常工作。为了缓解数据稀缺问题并提高神经定位模型的可伸缩性，我们引入了TOPO DataGen，这是一种多功能的合成数据生成工具，可在真实世界和虚拟世界之间平滑地进行遍历，它依赖于地理摄像机视点。提出了新的大规模模拟真实基准数据集，以展示和评估所述合成数据的效用。我们的实验表明，合成数据通常会提高神经网络在真实数据上的性能。此外，我们还介绍了CrossLoc，一种用于姿态估计的跨模态视觉表示学习方法，该方法通过自我监督充分利用场景坐标地面真实性。在没有任何额外数据的情况下，CrossLoc显著优于最先进的方法，并实现了更高的实际数据采样效率。我们的代码可在https://github.com/TOPO-EPFL/CrossLoc. 摘要：We present a visual localization system that learns to estimate camera poses in the real world with the help of synthetic data. Despite significant progress in recent years, most learning-based approaches to visual localization target at a single domain and require a dense database of geo-tagged images to function well. To mitigate the data scarcity issue and improve the scalability of the neural localization models, we introduce TOPO-DataGen, a versatile synthetic data generation tool that traverses smoothly between the real and virtual world, hinged on the geographic camera viewpoint. New large-scale sim-to-real benchmark datasets are proposed to showcase and evaluate the utility of the said synthetic data. Our experiments reveal that synthetic data generically enhances the neural network performance on real data. Furthermore, we introduce CrossLoc, a cross-modal visual representation learning approach to pose estimation that makes full use of the scene coordinate ground truth via self-supervision. Without any extra data, CrossLoc significantly outperforms the state-of-the-art methods and achieves substantially higher real-data sample efficiency. Our code is available at https://github.com/TOPO-EPFL/CrossLoc.

3D|3D重建等相关(1篇)

【1】 Looking Outside the Box to Ground Language in 3D Scenes 标题：跳出框框看3D场景中的落地语言链接：https://arxiv.org/abs/2112.08879

作者：Ayush Jain,Nikolaos Gkanatsios,Ishita Mediratta,Katerina Fragkiadaki 备注：First two authors contributed equally 摘要：现有的语言基础模型通常使用对象提议瓶颈：预先训练的检测器在场景中提议对象，模型学习从这些盒子提议中选择答案，而不考虑原始图像或三维点云。对象检测器通常是在对象和属性的固定词汇表上进行训练的，这些词汇表对于开放域语言基础来说往往过于严格，其中话语可能涉及不同抽象层次的视觉实体，例如椅子、椅子腿或椅子前腿的尖端。我们提出了一个3D场景中的基础语言模型，该模型绕过了盒子建议瓶颈，主要创新有三个：i）跨语言流、点云特征流和3D盒子建议的迭代注意。ii）具有非参数实体查询的转换器解码器，用于解码对象和零件参照的三维框。iii）通过将对象检测视为由候选类别标签列表组成的参考话语的基础，从3D对象注释和语言基础注释进行联合监督。与之前流行的3D语言基准测试方法相比，这些创新带来了显著的数量收益（SR3D基准测试的绝对改善率高达+9%）。我们对每项创新都进行了删减，以显示其对模型性能的贡献。当应用于语言的基础上的2D图像与微小的变化，它执行与国家的最先进的，而收敛在一半的GPU时间。代码和检查点将在https://github.com/nickgkan/beauty_detr 摘要：Existing language grounding models often use object proposal bottlenecks: a pre-trained detector proposes objects in the scene and the model learns to select the answer from these box proposals, without attending to the original image or 3D point cloud. Object detectors are typically trained on a fixed vocabulary of objects and attributes that is often too restrictive for open-domain language grounding, where an utterance may refer to visual entities at various levels of abstraction, such as a chair, the leg of a chair, or the tip of the front leg of a chair. We propose a model for grounding language in 3D scenes that bypasses box proposal bottlenecks with three main innovations: i) Iterative attention across the language stream, the point cloud feature stream and 3D box proposals. ii) Transformer decoders with non-parametric entity queries that decode 3D boxes for object and part referentials. iii) Joint supervision from 3D object annotations and language grounding annotations, by treating object detection as grounding of referential utterances comprised of a list of candidate category labels. These innovations result in significant quantitative gains (up to +9% absolute improvement on the SR3D benchmark) over previous approaches on popular 3D language grounding benchmarks. We ablate each of our innovations to show its contribution to the performance of the model. When applied on language grounding on 2D images with minor changes, it performs on par with the state-of-the-art while converges in half of the GPU time. The code and checkpoints will be made available at https://github.com/nickgkan/beauty_detr

其他神经网络|深度学习|模型|建模(3篇)

【1】 Distilled Dual-Encoder Model for Vision-Language Understanding 标题：用于视觉语言理解的提炼双编码器模型链接：https://arxiv.org/abs/2112.08723

作者：Zekun Wang,Wenhui Wang,Haichao Zhu,Ming Liu,Bing Qin,Furu Wei 备注：Work in progress 摘要：我们提出了一个跨模态注意提取框架来训练视觉语言理解任务（如视觉推理和视觉问答）的双编码器模型。双编码器模型比融合编码器模型具有更快的推理速度，并且能够在推理过程中对图像和文本进行预计算。然而，双编码器模型中使用的浅层交互模块不足以处理复杂的视觉语言理解任务。为了了解图像和文本之间的深层交互，我们引入了跨模式注意提取，它使用融合编码器模型的图像到文本和文本到图像的注意分布来指导我们的双编码器模型的训练。此外，我们还表明，将跨模态注意蒸馏应用于预训练和微调阶段可以实现进一步的改进。实验结果表明，提取的双编码器模型在视觉推理、视觉蕴涵和视觉问答任务方面具有很强的竞争力，同时比融合编码器模型具有更快的推理速度。我们的代码和模型将在https://github.com/kugwzk/Distilled-DualEncoder. 摘要：We propose a cross-modal attention distillation framework to train a dual-encoder model for vision-language understanding tasks, such as visual reasoning and visual question answering. Dual-encoder models have a faster inference speed than fusion-encoder models and enable the pre-computation of images and text during inference. However, the shallow interaction module used in dual-encoder models is insufficient to handle complex vision-language understanding tasks. In order to learn deep interactions of images and text, we introduce cross-modal attention distillation, which uses the image-to-text and text-to-image attention distributions of a fusion-encoder model to guide the training of our dual-encoder model. In addition, we show that applying the cross-modal attention distillation for both pre-training and fine-tuning stages achieves further improvements. Experimental results demonstrate that the distilled dual-encoder model achieves competitive performance for visual reasoning, visual entailment and visual question answering tasks while enjoying a much faster inference speed than fusion-encoder models. Our code and models will be publicly available at https://github.com/kugwzk/Distilled-DualEncoder.

【2】 Learning to Prompt for Continual Learning 标题：学会促进持续学习链接：https://arxiv.org/abs/2112.08654

作者：Zifeng Wang,Zizhao Zhang,Chen-Yu Lee,Han Zhang,Ruoxi Sun,Xiaoqi Ren,Guolong Su,Vincent Perot,Jennifer Dy,Tomas Pfister 摘要：持续学习背后的主流范式是使模型参数适应非平稳数据分布，其中灾难性遗忘是核心挑战。典型的方法依赖于测试时的预演缓冲区或已知任务标识来检索所学知识和解决遗忘问题，而这项工作提出了一种新的持续学习范式，旨在训练更简洁的记忆系统，而不需要在测试时访问任务标识。我们的方法学习动态提示（L2P）一个预先训练的模型，以便在不同的任务转换下顺序学习任务。在我们提出的框架中，提示是可学习的小参数，保存在内存空间中。目标是优化提示以指导模型预测，并在保持模型可塑性的同时明确管理任务不变和任务特定知识。我们在具有不同挑战性的连续学习设置的流行图像分类基准下进行综合实验，其中L2P始终优于现有的最先进方法。令人惊讶的是，L2P即使没有预演缓冲区，也能与基于预演的方法取得竞争性的结果，并且直接适用于具有挑战性的任务无关的持续学习。源代码可在https://github.com/google-research/l2p. 摘要：The mainstream paradigm behind continual learning has been to adapt the model parameters to non-stationary data distributions, where catastrophic forgetting is the central challenge. Typical methods rely on a rehearsal buffer or known task identity at test time to retrieve learned knowledge and address forgetting, while this work presents a new paradigm for continual learning that aims to train a more succinct memory system without accessing task identity at test time. Our method learns to dynamically prompt (L2P) a pre-trained model to learn tasks sequentially under different task transitions. In our proposed framework, prompts are small learnable parameters, which are maintained in a memory space. The objective is to optimize prompts to instruct the model prediction and explicitly manage task-invariant and task-specific knowledge while maintaining model plasticity. We conduct comprehensive experiments under popular image classification benchmarks with different challenging continual learning settings, where L2P consistently outperforms prior state-of-the-art methods. Surprisingly, L2P achieves competitive results against rehearsal-based methods even without a rehearsal buffer and is directly applicable to challenging task-agnostic continual learning. Source code is available at https://github.com/google-research/l2p.

【3】 A comparative study of paired versus unpaired deep learning methods for physically enhancing digital rock image resolution 标题：物理提高数字岩石图像分辨率的成对深度学习方法与非成对深度学习方法的比较研究链接：https://arxiv.org/abs/2112.08644

作者：Yufu Niu,Samuel J. Jackson,Naif Alqahtani,Peyman Mostaghimi,Ryan T. Armstrong 备注：26 pages, 11 figures, 4 tables 摘要：X射线显微计算机断层扫描（micro-CT）已被广泛用于表征地下多孔岩石的孔隙尺度几何结构。使用深度学习的超分辨率（SR）方法的最新发展允许在大空间尺度上对低分辨率（LR）图像进行数字增强，从而生成与高分辨率（HR）地面真实情况相当的SR图像。这绕过了传统的分辨率和视野权衡。一个突出的问题是成对（注册）LR和HR数据的使用，这在此类方法的训练步骤中经常需要，但很难获得。在这项工作中，我们严格比较了两种不同的最先进的SR深度学习技术，使用配对和非配对数据，以及相似的地面真实数据。第一种方法需要成对图像来训练卷积神经网络（CNN），而第二种方法使用未成对图像来训练生成性对抗网络（GAN）。利用具有复杂微孔结构的显微CT碳酸盐岩样品对这两种方法进行了比较。我们实施了各种基于图像和数值的验证以及实验验证，以定量评估这两种方法的物理精度和灵敏度。我们的定量结果表明，非配对GAN方法可以像成对CNN方法一样精确地重建超分辨率图像，并且训练时间和数据集要求相当。这开启了使用非配对深度学习方法进行微CT图像增强的新应用；在数据处理阶段不再需要图像配准。从数据存储平台分离的图像可以更有效地用于训练SR数字岩石应用的网络。这为非均质多孔介质中多尺度流动模拟的各种应用开辟了新的途径。摘要：X-ray micro-computed tomography (micro-CT) has been widely leveraged to characterise pore-scale geometry in subsurface porous rock. Recent developments in super resolution (SR) methods using deep learning allow the digital enhancement of low resolution (LR) images over large spatial scales, creating SR images comparable to the high resolution (HR) ground truth. This circumvents traditional resolution and field-of-view trade-offs. An outstanding issue is the use of paired (registered) LR and HR data, which is often required in the training step of such methods but is difficult to obtain. In this work, we rigorously compare two different state-of-the-art SR deep learning techniques, using both paired and unpaired data, with like-for-like ground truth data. The first approach requires paired images to train a convolutional neural network (CNN) while the second approach uses unpaired images to train a generative adversarial network (GAN). The two approaches are compared using a micro-CT carbonate rock sample with complicated micro-porous textures. We implemented various image based and numerical verifications and experimental validation to quantitatively evaluate the physical accuracy and sensitivities of the two methods. Our quantitative results show that unpaired GAN approach can reconstruct super-resolution images as precise as paired CNN method, with comparable training times and dataset requirement. This unlocks new applications for micro-CT image enhancement using unpaired deep learning methods; image registration is no longer needed during the data processing stage. Decoupled images from data storage platforms can be exploited more efficiently to train networks for SR digital rock applications. This opens up a new pathway for various applications of multi-scale flow simulation in heterogeneous porous media.

其他(12篇)

【1】 ICON: Implicit Clothed humans Obtained from Normals 标题：图标：从Normals获得的隐含的衣着人类链接：https://arxiv.org/abs/2112.09127

作者：Yuliang Xiu,Jinlong Yang,Dimitrios Tzionas,Michael J. Black 备注：21 pages, 18 figures, 7 tables. Project page: this https URL 摘要：当前用于学习逼真且可设置动画的3D服装化身的方法需要姿势3D扫描或具有仔细控制的用户姿势的2D图像。相比之下，我们的目标是仅从处于无约束姿势的人的二维图像中学习化身。给定一组图像，我们的方法从每个图像中估计出一个详细的3D表面，然后将其组合成一个可动画化的化身。隐式函数非常适合第一个任务，因为它们可以捕捉头发或衣服等细节。然而，当前的方法对不同的人体姿势并不鲁棒，并且通常会生成具有残缺肢体、缺失细节或非人体形状的3D曲面。问题是这些方法使用对全局姿态敏感的全局特征编码器。为了解决这个问题，我们提出了一个图标（“从法线获得的隐式穿衣服的人”），它使用了局部特征。ICON有两个主要模块，它们都利用SMPL（-X）车身模型。首先，ICON根据SMPL（-X）法线推断出详细的穿衣人体法线（前/后）。其次，可见性感知隐式曲面回归器生成人类居住场的iso曲面。重要的是，在推断时，反馈循环在使用推断的覆盖法线细化SMPL（-X）网格和细化法线之间交替进行。给定一个物体在不同姿势下的多个重建帧，我们使用SCANimate从中生成一个可动画化的化身。对AGORA和CAPE数据集的评估表明，即使训练数据非常有限，ICON在重建方面也优于最新技术。此外，它对分布外的样本（例如，在野生姿势/图像和帧外裁剪中）更具鲁棒性。ICON从原始图像向健壮的3D人体重建迈出了一步。这使得直接从视频中创建具有个性化和自然姿势相关的布料变形的化身成为可能。摘要：Current methods for learning realistic and animatable 3D clothed avatars need either posed 3D scans or 2D images with carefully controlled user poses. In contrast, our goal is to learn the avatar from only 2D images of people in unconstrained poses. Given a set of images, our method estimates a detailed 3D surface from each image and then combines these into an animatable avatar. Implicit functions are well suited to the first task, as they can capture details like hair or clothes. Current methods, however, are not robust to varied human poses and often produce 3D surfaces with broken or disembodied limbs, missing details, or non-human shapes. The problem is that these methods use global feature encoders that are sensitive to global pose. To address this, we propose ICON ("Implicit Clothed humans Obtained from Normals"), which uses local features, instead. ICON has two main modules, both of which exploit the SMPL(-X) body model. First, ICON infers detailed clothed-human normals (front/back) conditioned on the SMPL(-X) normals. Second, a visibility-aware implicit surface regressor produces an iso-surface of a human occupancy field. Importantly, at inference time, a feedback loop alternates between refining the SMPL(-X) mesh using the inferred clothed normals and then refining the normals. Given multiple reconstructed frames of a subject in varied poses, we use SCANimate to produce an animatable avatar from them. Evaluation on the AGORA and CAPE datasets shows that ICON outperforms the state of the art in reconstruction, even with heavily limited training data. Additionally, it is much more robust to out-of-distribution samples, e.g., in-the-wild poses/images and out-of-frame cropping. ICON takes a step towards robust 3D clothed human reconstruction from in-the-wild images. This enables creating avatars directly from video with personalized and natural pose-dependent cloth deformation.

【2】 IS-COUNT: Large-scale Object Counting from Satellite Images with Covariate-based Importance Sampling 标题：IS-Count：基于协变量重要性采样的卫星图像大尺度目标计数链接：https://arxiv.org/abs/2112.09126

作者：Chenlin Meng,Enci Liu,Willie Neiswanger,Jiaming Song,Marshall Burke,David Lobell,Stefano Ermon 备注：AAAI 2022 摘要：在许多环境和社会经济监测应用中，高分辨率卫星图像中的目标检测正在成为地面调查数据收集的可扩展替代方案。然而，由于购买图像和计算的成本很高，在大型地理区域执行目标检测的成本仍然高得令人望而却步。受传统调查数据收集策略的启发，我们提出了一种通过抽样估计大型地理区域的对象计数统计数据的方法。在给定成本预算的情况下，我们的方法通过从可学习的提案分布中抽样来选择少量具有代表性的领域。与穷举方法相比，使用重要性抽样，我们能够在仅处理一小部分图像后准确估计对象计数。我们的经验表明，所提出的框架在估算美国和非洲的建筑数量、肯尼亚的汽车数量、孟加拉国的砖窑数量和美国的游泳池数量方面取得了很好的效果，而与穷举法相比，只需要0.01%的卫星图像。摘要：Object detection in high-resolution satellite imagery is emerging as a scalable alternative to on-the-ground survey data collection in many environmental and socioeconomic monitoring applications. However, performing object detection over large geographies can still be prohibitively expensive due to the high cost of purchasing imagery and compute. Inspired by traditional survey data collection strategies, we propose an approach to estimate object count statistics over large geographies through sampling. Given a cost budget, our method selects a small number of representative areas by sampling from a learnable proposal distribution. Using importance sampling, we are able to accurately estimate object counts after processing only a small fraction of the images compared to an exhaustive approach. We show empirically that the proposed framework achieves strong performance on estimating the number of buildings in the United States and Africa, cars in Kenya, brick kilns in Bangladesh, and swimming pools in the U.S., while requiring as few as 0.01% of satellite images compared to an exhaustive approach.

【3】 RegionCLIP: Region-based Language-Image Pretraining 标题：RegionCLIP：基于区域的语言图像预训练链接：https://arxiv.org/abs/2112.09106

作者：Yiwu Zhong,Jianwei Yang,Pengchuan Zhang,Chunyuan Li,Noel Codella,Liunian Harold Li,Luowei Zhou,Xiyang Dai,Lu Yuan,Yin Li,Jianfeng Gao 备注：Technical report 摘要：使用图像-文本对的对比语言图像预训练（CLIP）在Zero-Shot和迁移学习环境下的图像分类方面都取得了令人印象深刻的结果。然而，我们发现，直接应用此类模型识别图像区域进行目标检测会导致性能低下，因为域转移：剪辑被训练为将图像作为一个整体与文本描述相匹配，而没有捕获图像区域和文本跨度之间的细粒度对齐。为了缓解这个问题，我们提出了一种称为RegionCLIP的新方法，该方法显著扩展了CLIP以学习区域级视觉表示，从而实现图像区域和文本概念之间的细粒度对齐。我们的方法利用剪辑模型将图像区域与模板标题进行匹配，然后对模型进行预训练，以便在特征空间中对齐这些区域文本对。当将我们的预训练模型转换为开放词汇表对象检测任务时，我们的方法在COCO和LVIS数据集上的新类别分别显著优于最新的3.8 AP50和2.2 AP。此外，学习到的区域表示支持Zero-Shot推断用于目标检测，在COCO和LVIS数据集上都显示了有希望的结果。我们的代码可在https://github.com/microsoft/RegionCLIP. 摘要：Contrastive language-image pretraining (CLIP) using image-text pairs has achieved impressive results on image classification in both zero-shot and transfer learning settings. However, we show that directly applying such models to recognize image regions for object detection leads to poor performance due to a domain shift: CLIP was trained to match an image as a whole to a text description, without capturing the fine-grained alignment between image regions and text spans. To mitigate this issue, we propose a new method called RegionCLIP that significantly extends CLIP to learn region-level visual representations, thus enabling fine-grained alignment between image regions and textual concepts. Our method leverages a CLIP model to match image regions with template captions and then pretrains our model to align these region-text pairs in the feature space. When transferring our pretrained model to the open-vocabulary object detection tasks, our method significantly outperforms the state of the art by 3.8 AP50 and 2.2 AP for novel categories on COCO and LVIS datasets, respectively. Moreoever, the learned region representations support zero-shot inference for object detection, showing promising results on both COCO and LVIS datasets. Our code is available at https://github.com/microsoft/RegionCLIP.

【4】 Solving Inverse Problems with NerfGANs 标题：用神经网络求解反问题链接：https://arxiv.org/abs/2112.09061

作者：Giannis Daras,Wen-Sheng Chu,Abhishek Kumar,Dmitry Lagun,Alexandros G. Dimakis 备注：16 pages, 18 figures 摘要：我们介绍了一种新的框架，用于解决反问题使用NeRF风格的生成模型。我们感兴趣的是给定单个二维图像和已知摄像机参数的三维场景重建问题。我们表明，天真地优化潜在空间会导致伪影和糟糕的新视图渲染。我们将此问题归因于三维几何体中清晰的体积障碍物，并在新视图的渲染中变得可见。我们提出了一种新的辐射场正则化方法，以获得更好的三维曲面，并在单视图观测的情况下改进了新视图。我们的方法自然地扩展到一般的反问题，包括仅部分观察单个视图的修复。我们通过实验评估了我们的方法，在广泛的任务中实现了视觉改善和性能提升。与以前的先进技术相比，我们的方法实现了30-40\%$MSE减少和15-25\%$LPIPS损失减少。摘要：We introduce a novel framework for solving inverse problems using NeRF-style generative models. We are interested in the problem of 3-D scene reconstruction given a single 2-D image and known camera parameters. We show that naively optimizing the latent space leads to artifacts and poor novel view rendering. We attribute this problem to volume obstructions that are clear in the 3-D geometry and become visible in the renderings of novel views. We propose a novel radiance field regularization method to obtain better 3-D surfaces and improved novel views given single view observations. Our method naturally extends to general inverse problems including inpainting where one observes only partially a single view. We experimentally evaluate our method, achieving visual improvements and performance boosts over the baselines in a wide range of tasks. Our method achieves $30-40\%$ MSE reduction and $15-25\%$ reduction in LPIPS loss compared to the previous state of the art.

【5】 Towards Robust Real-time Audio-Visual Speech Enhancement 标题：面向鲁棒实时视听语音增强的研究链接：https://arxiv.org/abs/2112.09060

作者：Mandar Gogate,Kia Dashtipour,Amir Hussain 摘要：人类大脑在上下文中利用异质的感觉信息来有效地执行包括视觉和听觉在内的认知任务。例如，在鸡尾酒会的情况下，人类的听觉皮层上下文整合视听（AV）线索，以便更好地感知语音。最近的研究表明，与纯音频语音增强（SE）模型相比，AV语音增强（SE）模型可以显著提高极低信噪比（SNR）环境下的语音质量和可懂度。然而，尽管在AV SE领域进行了大量研究，但开发低延迟的实时处理模型仍然是一项艰巨的技术挑战。在本文中，我们提出了一种新的低延迟非特定人AVSE框架，该框架可以推广到一系列视觉和声学噪声。特别地，提出了一种生成性对抗网络（GAN）来解决AV-SE中视觉缺陷的实际问题。此外，我们提出了一种基于深度神经网络的实时AV SE模型，该模型考虑了来自GAN的干净视觉语音输出，以提供更鲁棒的SE。使用客观的语音质量和可懂度指标以及主观列表测试，在合成和真实的有噪声AV语料库上对所提出的框架进行了评估。对比仿真结果表明，我们的实时AV SE框架优于最先进的SE方法，包括最新的基于DNN的SE模型。摘要：The human brain contextually exploits heterogeneous sensory information to efficiently perform cognitive tasks including vision and hearing. For example, during the cocktail party situation, the human auditory cortex contextually integrates audio-visual (AV) cues in order to better perceive speech. Recent studies have shown that AV speech enhancement (SE) models can significantly improve speech quality and intelligibility in very low signal to noise ratio (SNR) environments as compared to audio-only SE models. However, despite significant research in the area of AV SE, development of real-time processing models with low latency remains a formidable technical challenge. In this paper, we present a novel framework for low latency speaker-independent AV SE that can generalise on a range of visual and acoustic noises. In particular, a generative adversarial networks (GAN) is proposed to address the practical issue of visual imperfections in AV SE. In addition, we propose a deep neural network based real-time AV SE model that takes into account the cleaned visual speech output from GAN to deliver more robust SE. The proposed framework is evaluated on synthetic and real noisy AV corpora using objective speech quality and intelligibility metrics and subjective listing tests. Comparative simulation results show that our real time AV SE framework outperforms state-of-the-art SE approaches, including recent DNN based SE models.

【6】 On the Uncertain Single-View Depths in Endoscopies 标题：内窥镜检查中单视深度的不确定性研究链接：https://arxiv.org/abs/2112.08906

作者：Javier Rodríguez-Puigvert,David Recasens,Javier Civera,Rubén Martínez-Cantín 备注：10 pages 摘要：从内窥镜图像估计深度是一系列人工智能辅助技术的先决条件，即精确定位、测量肿瘤或识别未检查区域。由于结肠镜检查的领域特异性——一种可变形的低纹理环境，具有流体、恶劣的光照条件和突然的传感器运动——对多视图方法提出了挑战，因此单视图深度学习是一个很有前途的研究方向。在本文中，我们首次探索了贝叶斯深度网络在结肠镜检查中的单视图深度估计。它们的不确定性量化为这一关键应用领域提供了巨大的潜力。我们的具体贡献有两个方面：1）在三个不同的数据集中对用于深度估计的贝叶斯深度网络进行了详尽的分析，突出了关于合成到真实领域变化以及监督与自我监督方法的挑战和结论；2）一种考虑到教师不确定性的新型师生深度学习方法。摘要：Estimating depth from endoscopic images is a pre-requisite for a wide set of AI-assisted technologies, namely accurate localization, measurement of tumors, or identification of non-inspected areas. As the domain specificity of colonoscopies -- a deformable low-texture environment with fluids, poor lighting conditions and abrupt sensor motions -- pose challenges to multi-view approaches, single-view depth learning stands out as a promising line of research. In this paper, we explore for the first time Bayesian deep networks for single-view depth estimation in colonoscopies. Their uncertainty quantification offers great potential for such a critical application area. Our specific contribution is two-fold: 1) an exhaustive analysis of Bayesian deep networks for depth estimation in three different datasets, highlighting challenges and conclusions regarding synthetic-to-real domain changes and supervised vs. self-supervised methods; and 2) a novel teacher-student approach to deep depth learning that takes into account the teacher uncertainty.

【7】 Saliency Grafting: Innocuous Attribution-Guided Mixup with Calibrated Label Mixing 标题：显著嫁接：无伤大雅的归因导向混合和校准标签混合链接：https://arxiv.org/abs/2112.08796

作者：Joonhyung Park,June Yong Yang,Jinwoo Shin,Sung Ju Hwang,Eunho Yang 备注：12 pages; Accepted to AAAI2022 摘要：混合方案建议混合一对样本来创建一个增强的训练样本，并且最近为了提高神经网络的可推广性而受到了相当大的关注。混搭的一个简单且广泛使用的扩展是与类似区域退出的方法相结合：从一个样本中移除随机补丁，并用另一个样本中的特征替换它。尽管这些方法简单有效，但由于其随机性，容易产生有害样本。为了解决这个问题，最近提出了“最大显著性”策略：它们只选择信息量最大的特征来防止这种现象。然而，他们现在缺乏样本多样化，因为他们总是决定性地选择具有最大显著性的区域，将偏差注入到增强的数据中。在本文中，我们提出了一种新颖而简单的混音变体，它抓住了这两个世界的优点。我们的想法是双重的。通过对特征进行随机采样并将其“嫁接”到另一个样本上，我们的方法有效地生成多样但有意义的样本。其第二个要素是通过以显著性校准方式混合标签来生成嫁接样本的标签，从而纠正随机抽样程序引入的监督误导。我们在CIFAR、Tiny ImageNet和ImageNet数据集下的实验表明，我们的方案不仅在分类精度方面优于当前最先进的增强策略，而且在应对数据损坏和对象遮挡等压力条件方面也优于现有的增强策略。摘要：The Mixup scheme suggests mixing a pair of samples to create an augmented training sample and has gained considerable attention recently for improving the generalizability of neural networks. A straightforward and widely used extension of Mixup is to combine with regional dropout-like methods: removing random patches from a sample and replacing it with the features from another sample. Albeit their simplicity and effectiveness, these methods are prone to create harmful samples due to their randomness. To address this issue, 'maximum saliency' strategies were recently proposed: they select only the most informative features to prevent such a phenomenon. However, they now suffer from lack of sample diversification as they always deterministically select regions with maximum saliency, injecting bias into the augmented data. In this paper, we present, a novel, yet simple Mixup-variant that captures the best of both worlds. Our idea is two-fold. By stochastically sampling the features and 'grafting' them onto another sample, our method effectively generates diverse yet meaningful samples. Its second ingredient is to produce the label of the grafted sample by mixing the labels in a saliency-calibrated fashion, which rectifies supervision misguidance introduced by the random sampling procedure. Our experiments under CIFAR, Tiny-ImageNet, and ImageNet datasets show that our scheme outperforms the current state-of-the-art augmentation strategies not only in terms of classification accuracy, but is also superior in coping under stress conditions such as data corruption and object occlusion.

【8】 Forensic Analysis of Synthetically Generated Scientific Images 标题：合成科学图像的取证分析链接：https://arxiv.org/abs/2112.08739

作者：Sara Mandelli,Davide Cozzolino,Joao P. Cardenuto,Daniel Moreira,Paolo Bestagini,Walter Scheirer,Anderson Rocha,Luisa Verdoliva,Stefano Tubaro,Edward J. Delp 摘要：合成内容的广泛传播是一个严重的威胁，需要采取紧急对策。合成内容的生成并不局限于视频、照片或音频序列等多媒体数据，而是涵盖了相当广泛的领域，也可以包括生物图像，如westernblot和显微镜图像。在这篇论文中，我们主要研究合成的westernblot图像的检测。生物医学文献中对Western blot图像进行了大量研究，已经证明这些图像很容易伪造，很少有希望通过目视检查或标准法医检测器发现操纵。为了克服缺少公开可用数据集的问题，我们创建了一个新的数据集，该数据集由三种不同的最先进的生成方法生成，包含超过14K的原始western blot图像和18K的合成western blot图像。然后，我们研究了不同的策略来检测合成蛋白质印迹，探索了二元分类方法以及一类检测器。在这两种情况下，我们从未在训练阶段利用合成的westernblot图像。取得的结果表明，合成产生的westernblot图像可以具有良好的准确性，即使开发的检测器没有优化这些科学图像的合成版本。摘要：The widespread diffusion of synthetically generated content is a serious threat that needs urgent countermeasures. The generation of synthetic content is not restricted to multimedia data like videos, photographs, or audio sequences, but covers a significantly vast area that can include biological images as well, such as western-blot and microscopic images. In this paper, we focus on the detection of synthetically generated western-blot images. Western-blot images are largely explored in the biomedical literature and it has been already shown how these images can be easily counterfeited with few hope to spot manipulations by visual inspection or by standard forensics detectors. To overcome the absence of a publicly available dataset, we create a new dataset comprising more than 14K original western-blot images and 18K synthetic western-blot images, generated by three different state-of-the-art generation methods. Then, we investigate different strategies to detect synthetic western blots, exploring binary classification methods as well as one-class detectors. In both scenarios, we never exploit synthetic western-blot images at training stage. The achieved results show that synthetically generated western-blot images can be spot with good accuracy, even though the exploited detectors are not optimized over synthetic versions of these scientific images.

【9】 Use Image Clustering to Facilitate Technology Assisted Review 标题：使用图像聚类促进技术辅助审查链接：https://arxiv.org/abs/2112.08604

作者：Haozhen Zhao,Fusheng Wei,Hilary Quatinetz,Han Qin,Adam Dabrowski 备注：2021 IEEE International Conference on Big Data (Big Data) 摘要：在过去十年中，GPU硬件和深度神经网络技术的突破已经彻底改变了计算机视觉领域，使图像分析潜力可用于一系列实际应用。电子发现中的技术辅助审查（TAR）虽然传统上主要处理文本内容，但现在越来越需要将多媒体内容纳入这一范围。在过去几年中，我们为TAR开发了创新的图像分析应用程序，如图像分类、图像聚类和对象检测等。在本文中，我们将根据服务客户的经验，讨论如何使用图像聚类应用程序来促进TAR。我们描述了在任务中利用图像聚类的一般工作流程，并使用实际项目的统计数据来展示在TAR中使用图像聚类的有效性。我们还总结了在TAR中使用图像聚类的经验教训和最佳实践。摘要：During the past decade breakthroughs in GPU hardware and deep neural networks technologies have revolutionized the field of computer vision, making image analytical potentials accessible to a range of real-world applications. Technology Assisted Review (TAR) in electronic discovery though traditionally has dominantly dealt with textual content, is witnessing a rising need to incorporate multimedia content in the scope. We have developed innovative image analytics applications for TAR in the past years, such as image classification, image clustering, and object detection, etc. In this paper, we discuss the use of image clustering applications to facilitate TAR based on our experiences in serving clients. We describe our general workflow on leveraging image clustering in tasks and use statistics from real projects to showcase the effectiveness of using image clustering in TAR. We also summarize lessons learned and best practices on using image clustering in TAR.

【10】 Implicit Neural Representations for Deconvolving SAS Images 标题：去卷积SAS图像的隐式神经表示链接：https://arxiv.org/abs/2112.08539

作者：Albert Reed,Thomas Blanford,Daniel C. Brown,Suren Jayasuriya 摘要：合成孔径声纳（SAS）图像分辨率受波形带宽和阵列几何结构的限制。具体而言，波形带宽决定了模糊场景中点散射体位置的点扩散函数（PSF）。理论上，使用场景PSF对重建的SAS图像进行去卷积，可以恢复散射体的原始分布，并产生更清晰的重建。然而，反褶积是一种对噪声高度敏感的不适定操作。在这项工作中，我们利用隐式神经表示法（INR）来解卷积SAS图像，INR是自然图像空间的强先验。重要的是，我们的方法不需要训练数据，因为我们以自我监督的方式通过综合优化分析来执行反卷积。我们用点散射模型创建的模拟SAS数据和空中圆形SAS捕获的真实数据验证了我们的方法。这项工作是将神经网络应用于SAS图像反褶积的重要第一步。摘要：Synthetic aperture sonar (SAS) image resolution is constrained by waveform bandwidth and array geometry. Specifically, the waveform bandwidth determines a point spread function (PSF) that blurs the locations of point scatterers in the scene. In theory, deconvolving the reconstructed SAS image with the scene PSF restores the original distribution of scatterers and yields sharper reconstructions. However, deconvolution is an ill-posed operation that is highly sensitive to noise. In this work, we leverage implicit neural representations (INRs), shown to be strong priors for the natural image space, to deconvolve SAS images. Importantly, our method does not require training data, as we perform our deconvolution through an analysis-bysynthesis optimization in a self-supervised fashion. We validate our method on simulated SAS data created with a point scattering model and real data captured with an in-air circular SAS. This work is an important first step towards applying neural networks for SAS image deconvolution.

【11】 Visualizing the Loss Landscape of Winning Lottery Tickets 标题：彩票中奖损失景观的可视化链接：https://arxiv.org/abs/2112.08538

作者：Robert Bain 备注：7 pages, 7 figures, 1 algorithm/pseudocode 摘要：深层神经网络的潜在损失对其训练有很大影响，但由于计算上的限制，人们主要从理论上对其进行研究。这项工作大大减少了计算此类损失情况所需的时间，并将其用于研究通过迭代幅度修剪发现的中奖彩票。我们还分享了与先前声称的某些损失景观投影方法与模型可训练性和泛化误差之间的相关性相矛盾的结果。摘要：The underlying loss landscapes of deep neural networks have a great impact on their training, but they have mainly been studied theoretically due to computational constraints. This work vastly reduces the time required to compute such loss landscapes, and uses them to study winning lottery tickets found via iterative magnitude pruning. We also share results that contradict previously claimed correlations between certain loss landscape projection methods and model trainability and generalization error.

【12】 Predicting Levels of Household Electricity Consumption in Low-Access Settings 标题：低接入环境下的家庭用电量水平预测链接：https://arxiv.org/abs/2112.08497

作者：Simone Fobi,Joel Mugyenyi,Nathaniel J. Williams,Vijay Modi,Jay Taneja 备注：Accepted to be published in Proceedings of IEEE Winter Conference on Applications of Computer Vision (WACV) 2022 摘要：在低收入环境中，电力公司最关键的信息是客户的预期消费。在很大一部分家庭尚未通电的情况下，很难进行用电量评估。在此类设置中，预期消耗量的绝对水平可能在5-100 kWh/月之间，导致这些客户之间的高度可变性。如果低消费群体中有相当一部分人与高消费群体相关联，那么宝贵的资源就岌岌可危。这是第一次在低收入环境下进行此类研究，试图预测建筑物的消费量，而不是总行政区域的消费量。我们使用来自肯尼亚20000个地理参考电力客户（占肯尼亚居民客户的0.01%）的公用事业账单样本，在电气化前的日间卫星图像上训练卷积神经网络（CNN）。这是通过两阶段方法实现的，该方法使用一种新的建筑物分割方法，利用大量的免费卫星图像，最大限度地利用稀缺和昂贵的客户数据。我们的方法表明，可以在建筑水平上实现竞争精度，解决消费变化的挑战。这项工作表明，建筑的特征及其周围环境在预测消费水平方面都很重要。我们还评估了在训练过程中添加低分辨率地理空间数据集的情况，包括夜间灯光和人口普查数据。通过对肯尼亚单个建筑的精细预测，结果已经有助于为选址和分布水平规划提供信息，没有理由不能推广到其他国家。摘要：In low-income settings, the most critical piece of information for electric utilities is the anticipated consumption of a customer. Electricity consumption assessment is difficult to do in settings where a significant fraction of households do not yet have an electricity connection. In such settings the absolute levels of anticipated consumption can range from 5-100 kWh/month, leading to high variability amongst these customers. Precious resources are at stake if a significant fraction of low consumers are connected over those with higher consumption. This is the first study of it's kind in low-income settings that attempts to predict a building's consumption and not that of an aggregate administrative area. We train a Convolutional Neural Network (CNN) over pre-electrification daytime satellite imagery with a sample of utility bills from 20,000 geo-referenced electricity customers in Kenya (0.01% of Kenya's residential customers). This is made possible with a two-stage approach that uses a novel building segmentation approach to leverage much larger volumes of no-cost satellite imagery to make the most of scarce and expensive customer data. Our method shows that competitive accuracies can be achieved at the building level, addressing the challenge of consumption variability. This work shows that the building's characteristics and it's surrounding context are both important in predicting consumption levels. We also evaluate the addition of lower resolution geospatial datasets into the training process, including nighttime lights and census-derived data. The results are already helping inform site selection and distribution-level planning, through granular predictions at the level of individual structures in Kenya and there is no reason this cannot be extended to other countries.

返回列表
RAID 技术全解-图文并茂 RAID 技术全解 – RAID0、RAID1、RAID5、RAID100